CN116049459A

CN116049459A - Cross-modal mutual retrieval method, device, server and storage medium

Info

Publication number: CN116049459A
Application number: CN202310324164.9A
Authority: CN
Inventors: 赵坤; 王立; 李仁刚; 赵雅倩; 范宝余; 鲁璐; 郭振华
Original assignee: Inspur Electronic Information Industry Co Ltd
Current assignee: Inspur Electronic Information Industry Co Ltd
Priority date: 2023-03-30
Filing date: 2023-03-30
Publication date: 2023-05-02
Anticipated expiration: 2043-03-30
Also published as: CN116049459B

Abstract

The application discloses a method, a device, a server and a storage medium for cross-modal mutual retrieval, and relates to the technical field of data processing, wherein the training method comprises the following steps: constructing a text information feature encoder and an image sequence feature encoder; constructing a retrieval network based on the text information feature encoder and the image sequence feature encoder to obtain an initial image text retrieval network; constructing an alignment loss function based on the positive and negative sample sets of each sample; training the initial image text retrieval network based on the alignment loss function and the training data to obtain the multi-mode image text retrieval network. So as to improve the accuracy of the image-text mutual inspection of the multi-structure text data and the image data.

Description

Cross-modal mutual retrieval method, device, server and storage medium

Technical Field

The present application relates to the field of data processing technologies, and in particular, to a training method for a cross-modal cross-checking neural network, another training method, three cross-modal cross-checking methods, a training device, an image text retrieval device, a server, and a computer readable storage medium.

Background

With the continuous development of information technology, artificial intelligence technology can be applied in more and more fields to improve the efficiency and effect of processing data. In the field of recognition of text data and image data, a corresponding model can be adopted for recognition to obtain a regression result or a classification result.

In the related art, a mutual inspection task of multi-modal data to multi-modal text and image sequences is required in the multi-modal field. Wherein multimodal refers to data comprising both text and image sequences. Multi-structure text refers to text that can be divided into multiple structural categories according to its semantics. The retrieval network generally adopted cannot effectively process the multi-structure text, so that the effect of retrieving the multi-mode data is reduced, and the problem of lower reasoning accuracy exists.

Therefore, how to improve the effect of processing multi-modal data and to improve the accuracy of reasoning are important issues for those skilled in the art.

Disclosure of Invention

The invention aims to provide a training method, a cross-modal mutual search neural network, a cross-modal mutual search method, a training device, an image text search device, a server and a computer readable storage medium, so as to realize processing of multi-structure text data and improve accuracy of image-text mutual detection of multi-structure text data and image data.

In order to solve the above technical problems, the present application provides a training method for cross-modal mutual search neural network, including:

constructing a text information feature encoder and an image sequence feature encoder; wherein the text information feature encoder comprises: the device comprises a text coding layer, an attribute path establishment layer, a recoding layer and a recoding characteristic secondary aggregation layer;

constructing a retrieval network based on the text information feature encoder and the image sequence feature encoder to obtain an initial image text retrieval network;

constructing an alignment loss function based on the positive and negative sample sets of each sample;

training the initial image text retrieval network based on the alignment loss function and training data to obtain a multi-mode image text retrieval network.

Optionally, the text coding layer is configured to perform feature coding on the input multi-structure text data to obtain feature vectors of each text, process all feature vectors of the multi-structure text data through the attention network, and obtain and output feature codes of the multi-structure text data to the attribute path establishment layer;

the attribute path establishment layer is used for carrying out attribute connection on feature codes of the multi-structure text data of all samples based on attribute information of all samples to obtain and output a plurality of neighbor relation graphs of the corresponding samples to the recoding layer;

The recoding layer is used for aggregating the feature codes of the sub-samples into the feature codes of the main samples based on each neighbor relation graph to obtain and output recoding features of each neighbor relation graph to the recoding feature secondary aggregation layer;

and the recoding feature secondary aggregation layer is used for carrying out secondary aggregation on all recoding features based on the weight of each recoding feature to obtain text coding features of corresponding samples.

Optionally, the text information feature encoder further includes: and the sample traversing unit is used for traversing all the samples to obtain text coding features corresponding to each sample.

Optionally, the image sequence feature encoder comprises a feature extraction unit, an image sequence screening unit and an image sequence integral feature extraction unit.

Optionally, constructing the alignment loss function based on the positive and negative sample sets of each sample includes:

determining a positive sample set and a negative sample set for each sample based on attribute connections between the corresponding sample and other samples;

an alignment loss function is constructed based on the positive and negative sample sets, the alignment loss function being the loss function.

Determining a positive sample set and a negative sample set for each sample and attribute connections between other samples;

constructing a text feature to image feature alignment loss function and an image feature to text feature alignment loss function based on the positive and negative sample sets of each sample; wherein, any one alignment loss function includes: an alignment loss function of the corresponding feature, a contrast loss function of the positive sample group expanded with the attribute path constraint, and a contrast loss function of the negative sample expanded with the attribute path constraint;

and taking the sum of the alignment loss function from the text feature to the image feature and the loss function from the image feature to the text feature as the alignment loss function.

The application also provides a cross-modal cross-checking method, which comprises the following steps:

when multi-structure text data are input, a text information feature encoder based on a multi-mode image text retrieval network performs feature encoding on the text information to obtain corresponding text encoding features; the text information characteristic encoder comprises a text encoding layer, an attribute path establishment layer, a recoding layer and a recoding characteristic secondary aggregation layer;

when the image data is input, an image sequence feature encoder based on a multi-mode image text retrieval network performs feature encoding on the image data to obtain corresponding image encoding features;

And reasoning the text coding features or the image coding features through an output layer of the multi-mode image text retrieval network to obtain retrieval results.

Optionally, when multi-structure text data is input, a text information feature encoder based on the multi-mode image text retrieval network performs feature encoding on the text information to obtain corresponding text encoding features, including:

performing feature coding on the multi-structure text data to obtain feature vectors of each word, processing all the feature vectors of the multi-structure text data through an attention network to obtain and output the feature codes of the multi-structure text data to an attribute path establishment layer;

performing attribute connection on feature codes of the multi-structure text data of all samples based on the attribute information of all samples to obtain and output a plurality of neighbor relation graphs of the corresponding samples to a recoding layer;

aggregating the feature codes of the subsamples into the feature codes of the main samples based on each neighbor relation graph to obtain and output recoding features of each neighbor relation graph to a recoding feature secondary aggregation layer;

and performing secondary aggregation on all recoding features based on the weight of each recoding feature to obtain text coding features of corresponding samples.

The application also provides a training method of the cross-modal mutual retrieval neural network, which comprises the following steps:

the client sends a network training instruction to the server so that the server constructs a text information feature encoder and an image sequence feature encoder; wherein the text information feature encoder comprises: the device comprises a text coding layer, an attribute path establishment layer, a recoding layer and a recoding characteristic secondary aggregation layer; constructing a retrieval network based on the text information feature encoder and the image sequence feature encoder to obtain an initial image text retrieval network; constructing an alignment loss function based on the positive and negative sample sets of each sample; training the initial image text retrieval network based on the alignment loss function and training data to obtain a multi-mode image text retrieval network; transmitting the multi-modal image text retrieval network;

the client receives the multi-mode image text retrieval network and displays a training completion message.

The method comprises the steps that a client inputs data to be retrieved to a server, so that when multi-structure text data are input by the server, a text information feature encoder based on a multi-mode image text retrieval network performs feature encoding on the text information to obtain corresponding text encoding features; the text information characteristic encoder comprises a text encoding layer, an attribute path establishment layer, a recoding layer and a recoding characteristic secondary aggregation layer; when the image data is input, an image sequence feature encoder based on a multi-mode image text retrieval network performs feature encoding on the image data to obtain corresponding image encoding features; reasoning the text coding features or the image coding features through an output layer of the multi-mode image text retrieval network to obtain and send retrieval results;

and the client receives the search result and displays the search result.

the server receives data to be retrieved input by the client;

reasoning the text coding features or the image coding features through an output layer of the multi-mode image text retrieval network to obtain retrieval results;

and sending the search result to the client so that the client displays the search result.

The application also provides a training device of the cross-modal mutual retrieval neural network, which comprises:

the text coding module is used for constructing a text information feature coder and an image sequence feature coder; wherein the text information feature encoder comprises: the device comprises a text coding layer, an attribute path establishment layer, a recoding layer and a recoding characteristic secondary aggregation layer;

the image coding module is used for constructing a retrieval network based on the text information feature encoder and the image sequence feature encoder to obtain an initial image text retrieval network;

a loss function construction module for constructing an aligned loss function based on the positive and negative sample sets of each sample;

and the model training module is used for training the initial image text retrieval network based on the alignment loss function and training data to obtain a multi-mode image text retrieval network.

The application also provides an image text retrieval device, comprising:

the text data processing module is used for carrying out feature coding on the text information based on a text information feature coder of the multi-mode image text retrieval network when the multi-structure text data is input, so as to obtain corresponding text coding features; the text information characteristic encoder comprises a text encoding layer, an attribute path establishment layer, a recoding layer and a recoding characteristic secondary aggregation layer;

the image data processing module is used for carrying out feature coding on the image data based on an image sequence feature coder of the multi-mode image text retrieval network when the image data is input, so as to obtain corresponding image coding features;

and the characteristic reasoning module is used for reasoning the text coding characteristics or the image coding characteristics through an output layer of the multi-mode image text retrieval network to obtain a retrieval result.

The application also provides a server comprising:

a memory for storing a computer program;

a processor for implementing, when executing the computer program, steps of a training method such as a cross-modal cross-checking neural network and/or steps of a cross-modal cross-checking method.

The present application also provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements steps of a training method, such as a cross-modal cross-checking neural network, and/or steps of a cross-modal cross-checking method.

The training method of the cross-mode mutual detection cable neural network provided by the application comprises the following steps: constructing a text information feature encoder and an image sequence feature encoder; wherein the text information feature encoder comprises: the device comprises a text coding layer, an attribute path establishment layer, a recoding layer and a recoding characteristic secondary aggregation layer; constructing a retrieval network based on the text information feature encoder and the image sequence feature encoder to obtain an initial image text retrieval network; constructing an alignment loss function based on the positive and negative sample sets of each sample; training the initial image text retrieval network based on the alignment loss function and training data to obtain a multi-mode image text retrieval network.

The method has the advantages that the constructed text information feature encoder comprises a text encoding layer, an attribute path establishing layer, a recoding layer and a recoding feature secondary aggregation layer, then an initial image text retrieval network capable of processing multi-structure text data is constructed, finally training is carried out, the image text retrieval network capable of processing the multi-structure text data is obtained, the multi-structure text data is processed, the retrieval effect of the multi-mode data is improved, and the reasoning accuracy is improved.

The application also provides a cross-modal mutual search method, another cross-modal mutual search method of the cross-modal mutual search neural network, another two cross-modal mutual search methods, a training device of the cross-modal mutual search neural network, an image text search device, a server and a computer readable storage medium, which have the advantages and are not repeated herein.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present application, and that other drawings may be obtained according to the provided drawings without inventive effort to a person skilled in the art.

Fig. 1 is a flowchart of a method for training a cross-modal cross-detection cable neural network according to an embodiment of the present application;

fig. 2 is a schematic structural diagram of an attention network of a method for training a cross-modal cross-detection neural network according to an embodiment of the present application;

FIG. 3 is a schematic diagram of an attribute path according to an embodiment of the present disclosure;

fig. 4 is a schematic diagram of a neighbor relation diagram according to an embodiment of the present application;

FIG. 5 is a schematic representation of a type of photopolymerization provided in the examples herein;

FIG. 6 is a schematic diagram of an image encoder according to an embodiment of the present disclosure;

fig. 7 is a schematic structural diagram of a training device of a cross-modal cross-detection cable neural network according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of an image text retrieval device according to an embodiment of the present application;

fig. 9 is a schematic structural diagram of a server according to an embodiment of the present application.

Detailed Description

The core of the application is to provide a training method, a cross-modal mutual search neural network, a training device, an image text search device, a server and a computer readable storage medium for realizing processing of multi-structure text data and improving accuracy of image-text mutual detection of the multi-structure text data and the image data.

For the purposes of making the objects, technical solutions and advantages of the embodiments of the present application more clear, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.

Therefore, the application provides a training method of a cross-mode mutual retrieval neural network, which comprises a text coding layer, an attribute path establishment layer, a recoding layer and a recoding feature secondary aggregation layer through a constructed text information feature encoder, then an initial image text retrieval network capable of processing multi-structure text data is constructed, finally training is carried out to obtain the image text retrieval network capable of processing the multi-structure text data, processing of multi-structure text data is achieved, the multi-mode data retrieval effect is improved, and the reasoning accuracy is improved.

The following describes, by way of an embodiment, a method for training a cross-modal cross-detection cable neural network provided by the present application.

Referring to fig. 1, fig. 1 is a flowchart of a training method of a cross-modal cross-detection cable neural network according to an embodiment of the present application.

In this embodiment, the method may include:

s101, constructing a text information feature encoder and an image sequence feature encoder;

it can be seen that this step aims at constructing a text information feature encoder and an image sequence feature encoder; wherein the text information feature encoder comprises: the device comprises a text coding layer, an attribute path establishment layer, a recoding layer and a recoding characteristic secondary aggregation layer.

The multi-structure text data can be processed through the text coding layer, the attribute path establishing layer, the recoding layer and the recoding characteristic secondary aggregation layer, so that the accuracy and precision of processing the multi-structure text data are improved, and the processing effect of the multimedia data is further improved.

Further, the text coding layer is used for carrying out feature coding on the input multi-structure text data to obtain feature vectors of each word, processing all the feature vectors of the multi-structure text data through the attention network to obtain and output the feature codes of the multi-structure text data to the attribute path establishment layer;

The attribute path establishment layer is used for carrying out attribute connection on the feature codes of the multi-structure text data of all the samples based on the attribute information of all the samples to obtain and output a plurality of neighbor relation graphs of the corresponding samples to the recoding layer;

the recoding layer is used for aggregating the feature codes of the subsamples into the feature codes of the main samples based on each neighbor relation graph to obtain and output recoding features of each neighbor relation graph to the recoding feature secondary aggregation layer;

and the recoding feature secondary aggregation layer is used for carrying out secondary aggregation on all recoding features based on the weight of each recoding feature to obtain the text coding features of the corresponding sample.

Further, the text information feature encoder further includes: and the sample traversing unit is used for traversing all the samples to obtain text coding features corresponding to each sample.

Further, the image sequence feature encoder comprises a feature extraction unit, an image sequence screening unit and an image sequence integral feature extraction unit.

S102, constructing a retrieval network based on a text information feature encoder and an image sequence feature encoder to obtain an initial image text retrieval network;

on the basis of S101, this step aims at performing a search network construction based on the text information feature encoder and the image sequence feature encoder, to obtain an initial image text search network.

The method for constructing the search network may be any search method provided in the prior art, and is not specifically limited herein.

S103, constructing an alignment loss function based on the positive sample group and the negative sample group of each sample; the sample is the multi-structure text coding feature processed by the text information feature encoder and the image sequence coding feature processed by the image sequence feature encoder;

on the basis of S103, this step aims at constructing an alignment loss function based on the positive and negative sample groups of each sample.

Further, the step may include:

step 1, determining a positive sample group and a negative sample group of corresponding samples based on attribute connection between each sample and other samples;

and 2, constructing an alignment loss function based on the positive sample group and the negative sample group, and taking the alignment loss function as a loss function.

It can be seen that this alternative is mainly illustrative of how the loss function is constructed. In this alternative, positive and negative sample sets for each sample are determined based on the attribute connections between the sample and other samples; an alignment loss function is constructed based on the positive and negative sample sets, with the alignment loss function being the loss function.

Further, the step may include:

step 1, determining a positive sample group and a negative sample group of corresponding samples for attribute connection between each sample and other samples;

step 2, constructing an alignment loss function from text feature to image feature and an alignment loss function from image feature to text feature based on the positive sample group and the negative sample group of each sample; wherein, any one alignment loss function includes: an alignment loss function of the corresponding feature, a contrast loss function of the positive sample group expanded with the attribute path constraint, and a contrast loss function of the negative sample expanded with the attribute path constraint;

and 3, taking the sum of the alignment loss function from the text feature to the image feature and the loss function from the image feature to the text feature as the alignment loss function.

And S104, training the initial image text retrieval network based on the alignment loss function and the training data to obtain the multi-mode image text retrieval network.

On the basis of S104, this step aims at training the initial image text retrieval network based on the alignment loss function and the training data, resulting in a multimodal image text retrieval network.

The training method for the initial image text retrieval network may be any training method provided in the prior art, and is not specifically limited herein.

In summary, the text information feature encoder includes a text encoding layer, an attribute path establishing layer, a recoding layer, and a recoding feature secondary aggregation layer, then an initial image text retrieval network capable of processing multi-structure text data is constructed, and finally training is performed to obtain an image text retrieval network capable of processing multi-structure text data, so that the multi-structure text data is processed, the retrieval effect of multi-mode data is improved, and the reasoning accuracy is improved.

A cross-modal cross-checking method provided by the present application is described below by way of one embodiment.

In this embodiment, the method may include:

s201, when multi-structure text data are input, a text information feature encoder based on a multi-mode image text retrieval network performs feature encoding on the text information to obtain corresponding text encoding features; the text information feature encoder comprises a text encoding layer, an attribute path establishment layer, a recoding layer and a recoding feature secondary aggregation layer;

further, the step may include:

step 1, feature coding is carried out on multi-structure text data to obtain feature vectors of each word, all the feature vectors of the multi-structure text data are processed through an attention network to obtain and output the feature codes of the multi-structure text data to an attribute path establishment layer;

Step 2, performing attribute connection on feature codes of the multi-structure text data of all samples based on the attribute information of all samples to obtain and output a plurality of neighbor relation graphs of the corresponding samples to a recoding layer;

step 3, aggregating the feature codes of the subsamples into the feature codes of the main samples based on each neighbor relation graph to obtain and output recoding features of each neighbor relation graph to a recoding feature secondary aggregation layer;

and 4, performing secondary aggregation on all recoding features based on the weight of each recoding feature to obtain text coding features of the corresponding sample.

S202, when the image data is input, an image sequence feature encoder based on a multi-mode image text retrieval network performs feature encoding on the image data to obtain corresponding image encoding features;

S203, reasoning the text coding features or the image coding features through an output layer of the multi-mode image text retrieval network to obtain a retrieval result.

It can be seen that, the text information feature encoder includes a text encoding layer, an attribute path establishing layer, a recoding layer, and a recoding feature secondary aggregation layer, then an initial image text retrieval network capable of processing multi-structure text data is constructed, and finally training is performed to obtain an image text retrieval network capable of processing multi-structure text data, so that multi-structure text data is processed, the retrieval effect of multi-mode data is improved, and the reasoning accuracy is improved.

In this embodiment, the method may include:

step 1, a client sends a network training instruction to a server so that the server constructs a text information feature encoder and an image sequence feature encoder; wherein the text information feature encoder comprises: the device comprises a text coding layer, an attribute path establishment layer, a recoding layer and a recoding characteristic secondary aggregation layer; constructing a retrieval network based on the text information feature encoder and the image sequence feature encoder to obtain an initial image text retrieval network; constructing an alignment loss function based on the positive and negative sample sets of each sample; training the initial image text retrieval network based on the alignment loss function and training data to obtain a multi-mode image text retrieval network; transmitting a multimodal image text retrieval network;

and step 2, the client receives the multi-mode image text retrieval network and displays a training completion message.

In this embodiment, the method may include:

Step 1, a client inputs data to be retrieved to a server, so that when multi-structure text data is input by the server, a text information feature encoder based on a multi-mode image text retrieval network performs feature encoding on text information to obtain corresponding text encoding features; the text information feature encoder comprises a text encoding layer, an attribute path establishment layer, a recoding layer and a recoding feature secondary aggregation layer; when the image data is input, the image sequence feature encoder based on the multi-mode image text retrieval network performs feature encoding on the image data to obtain corresponding image encoding features; reasoning the text coding features or the image coding features through an output layer of the multi-mode image text retrieval network to obtain and send retrieval results;

and 2, receiving the search result by the client and displaying the search result.

In this embodiment, the method may include:

step 1, a server receives data to be retrieved input by a client;

step 2, when multi-structure text data are input, a text information feature encoder based on a multi-mode image text retrieval network performs feature encoding on the text information to obtain corresponding text encoding features; the text information feature encoder comprises a text encoding layer, an attribute path establishment layer, a recoding layer and a recoding feature secondary aggregation layer;

Step 3, when the input is image data, the image data is subjected to feature coding by an image sequence feature coder based on a multi-mode image text retrieval network to obtain corresponding image coding features;

step 4, reasoning the text coding features or the image coding features through an output layer of the multi-mode image text retrieval network to obtain a retrieval result;

and step 5, sending the search result to the client so that the client displays the search result.

The following further describes, through another specific embodiment, a method for training a cross-modal reciprocal detection cable neural network provided in the present application.

In the embodiment, the mutual inspection task of the multi-structure text and image sequence is realized. Wherein, the multi-structure text or image sequence has the phenomenon of modal deletion.

The present embodiment may include: text encoder, image encoder, loss function.

Wherein, the text encoder:

in this embodiment, the menu multi-modal data is taken as an example. Text data contains 3 classes: 1) menu operation step text (composed of multiple sentences), 2) process and 3) main material. The following is shown:

sample 1: parching ovum gallus Domesticus with fructus Lycopersici Esculenti.

The menu operation steps are as follows: 1) Frying eggs by hot oil; 2) Adding sugar salt into the hot oil stir-fried tomatoes to obtain juice; 3) Adding the fried eggs, adding seasonings, and stir-frying uniformly; 4) And (5) taking out of the pot.

The process comprises the following steps: parching.

Main material: tomato, egg.

From the above multi-structure text, recipe operation steps can be defined as main text, also called search text. The process and the master are called attribute text.

The main text is firstly subjected to feature coding and contains menu step text information. After step text information for each sample is obtained, each word is converted into a feature vector using the word2vector method.

The feature vectors of all texts are input into a transformer network, and the final feature expression of all texts is obtained. The feature of a node is the feature code of all characters of a sample.

Referring to fig. 2, fig. 2 is a schematic structural diagram of an attention network of a training method of a cross-modal cross-detection neural network according to an embodiment of the present application.

As shown in fig. 2, wherein text 1 represents the step text.

Each word is converted into a feature vector emb by word2vector method for each word. The text type is obtained, and in this embodiment, the step text represents a text type of 1, which is a predefined number.

And acquiring text position information, namely acquiring the position of each text in the text where the text is located, for the text 1, for example, cutting tomatoes in the text 1, wherein the position information is 1, the position information of the west is 2 and the position information of the red is 3. And sequentially obtaining the corresponding position information of all the characters in the text.

The emb feature of the text is added with the position information feature of the text and the type information feature 3 items of the text to obtain a final input feature vector of the text, and the final input feature vector is input into a transducer network.

Furthermore, the transducer network can obtain the output feature vectors of all the characters, and each character corresponds to the feature vector output by the transducer network. Wherein, the method comprises the following steps of: multi-attention head with mask and corresponding normalization process, forward propagation and corresponding normalization process. The input feature vector is first processed by a multi-attention head with a mask and corresponding normalization, and then the features of the previous process are processed by forward propagation and corresponding normalization. The specific processing procedure of the transducer network may be any processing method provided in the prior art, which is not specifically limited herein.

In this embodiment, the average value of the output feature vectors of all the characters is obtained and used as the node feature of one sample. Traversing all samples to obtain node characteristics of each sample, which can be recorded as

Wherein i represents the ith sample, +.>

Feature encoding of the multi-structure samples representing sample i.

2) And establishing an attribute path.

Referring to fig. 3, fig. 3 is a schematic diagram of an attribute path according to an embodiment of the present application.

Based on the attribute information of all samples, an attribute connection is constructed as shown in fig. 3.

As can be seen, this embodiment establishes 2 attribute paths: vegetable-main material-vegetable-process-vegetable.

As shown in fig. 3: sample 1 is stewed beef with tomatoes, and the neighbor relation of the sample is determined according to the dish-main material-dish, namely the main material path.

Samples

1, 2, 3 are neighbor samples because they are connectable through the master path. Similarly, the neighbor relation of the samples is determined according to the process path, such as dish-process-dish, and

samples

1, 4 and 5 are neighbor samples because the processes are the same.

Referring to fig. 4, fig. 4 is a schematic diagram of a neighbor relation diagram according to an embodiment of the present application.

Based on the attribute information, an attribute connection is constructed as shown in fig. 4.

Wherein, different neighbor relation diagrams are established according to different attribute paths of the sample 1, as shown in fig. 4.

Sample 1 may have a plurality of neighbor relation graphs, and may be established according to a plurality of different attribute paths.

3) Recoding characteristics of all samples based on the neighbor relation graph are calculated respectively.

Traversing each sample, and sequentially calculating recoding features of each neighbor relation graph according to all neighbor relation graphs of the sample. As shown in fig. 4, features of sample 2 and sample 3 are first aggregated into sample 1 according to the neighbor relation of fig. 1.

Wherein the features of

samples

1, 2,3 represent feature codes of the recipe operation step text of

samples

1, 2,3, which have been already coded above. The following is an operation of finding the aggregate weight and the feature addition. The formula is as follows:

wherein->

Representing a sample feature i (i.e.: A)>

) And sample feature j (i.e.: />

) The weight relationship between them is shown as weight 1 in fig. 4. />

Representative of the Euclidean distance. />

Representing a nonlinear activation function such as a sigmoid function. In the above example, i is equal to1, j is equal to 2,3.k is 1 to->

Any integer number in between,

the number of neighbor samples representing the kth neighbor relation graph of sample i.

The final sample characteristics were calculated as follows:

wherein->

Representing the recoded feature of the ith sample feature under the kth neighbor relation graph.

Thus, according to different neighbor relation diagrams, one sample feature i can be recoded into a plurality of features, and recorded as

N represents N neighbor relation graphs.

4) The recoded features of each of all samples were secondarily aggregated.

Referring to fig. 5, fig. 5 is a schematic diagram of a photopolymerization according to an embodiment of the present application.

Wherein sample 1 can be recoded into a plurality of features according to the neighbor relation graph, and recorded as

/>

N represents N neighbor relation graphs. This step shows that not all neighbor relation graphs are important for sample 1, requiring screening of recoding features by secondary aggregation. The formula is as follows:

first, the weights of the recoded features of the sample i are calculated, as shown in fig. 5, i represents the ith sample, and N represents a total of N neighbor relation graphs of the sample i:

。

weight representing the j-th recoded feature of sample i,/->

Is a trainable vector, randomly initializing values during training, < >>

Is a trainable matrix, and randomly initializes values during training. Relu is a nonlinear activation function layer, < ->

Is the j-th neighbor feature of sample i.

The secondary aggregation characteristic of the final sample i is:

。

wherein, the liquid crystal display device comprises a liquid crystal display device,

features are encoded for the multi-structure text.

5) And traversing all samples in sequence, and obtaining the secondary aggregation characteristics of all samples as final text coding characteristics of the samples.

Referring to fig. 6, fig. 6 is a schematic structural diagram of an image encoder according to an embodiment of the present application.

Further, the image encoder may extract the features of each sequential input image by using a ResNet backbone network (backbone), obtain the features of the ResNet network at the previous layer of the classification layer as the features of each image, and record as

I representsSequence image i.

Then, the present embodiment also provides a novel attention structure

Because the images are sequential images, the importance of the sequential images is different, the embodiment designs the attention structure, and screens the importance of the image sequence, so that BiLSTM can concentrate on useful information.

The architecture of the Attention network comprises two fully connected layers FC and one ReLU layer. In this embodiment, the image feature obtains an embedded feature after passing through a backbone network backhaul, and the embedded feature obtains a final embedded feature e of each image after passing through a full connection layer. The final embedded feature e calculates the weight of each feature by passing through the saturation structure, the weight is a number, and normalization is performed by the sigmoid layer.

The weights of the features of all image sequences are unified into the softmax (normalized exponential function) layer to determine which image is important. Finally, the feature weights of the images after the softmax layer are multiplied by the embedded features of each corresponding image.

At the same time, the idea of a residual network is introduced, the output of which attention structure is shown in the following formula for each medical image:

。

Among them, attention is a multi-layer Attention structure.

And finally, inputting the image characteristics into a BiLSTM network to acquire the integral characteristics of the sequence image group. The formula is as follows:

，

。

the image also contains both reverse order and sequential order. Timing semantic information is implied. It is encoded with the above formula. Wherein, the liquid crystal display device comprises a liquid crystal display device,

calculating for forward long-short term memory network +.>

And calculating for the backward long-short term memory network. I is the number of neurons of the long-term and short-term memory network.

Where BiLSTM (two-way long-short term memory network computation) represents each cell of the BiLSTM network.

Representing the output of the ith BiLSTM cell. />

Representing a medical image input feature, i representing an ith image,/->

Representing the attention network of the present embodiment.

And taking the average value of the characteristic coding output of the BiLSTM unit as the characteristic output of the whole sequence image. The following is shown:

。

and outputting the characteristic representing the sequence image group for the next retrieval. I is the number of neurons of the long-term and short-term memory network.

Traversing all samples in turn to obtain the output of the sequence image group characteristics of all samples

I represents the ith sample. />

As above, for each sample, one multi-structure text encoding feature is included

And an image sequence coding feature +.>

I represents the i-th sample->

M represents a total of M training samples in the dataset.

A new alignment loss function is constructed as follows.

The method comprises the following steps:

step 1, sampling a batch of data from input samples for calculation, and supposing that H samples are selected, each sample contains corresponding multi-structure text coding features and image sequence coding features.

And 2, establishing a positive sample group and a negative sample group corresponding to each sample, and assuming that the positive and negative sample groups are established for the ith sample.

And 3, traversing all M samples, and acquiring neighbor nodes connected with the samples by utilizing all attribute connections of the ith sample (according to the connection relation), wherein all neighbor sample characteristics connected with the nodes form the positive sample group. This embodiment is characterized in that multiple neighbor graphs are established by different attribute paths, multiple neighbor graphs can be established by different attribute paths, and all neighbor samples form a positive sample group

. Each sample has its own set of positive samples.

Step 4, the negative sample group is established by collecting all neighbor samples of the rest H-1 nodes trained in the training process to form a feature set

The negative set of samples is noted for this sample i.

And 5, traversing each node characteristic in turn, traversing H times in total, and acquiring a positive sample group and a negative sample group of each node.

Step 6, calculating the loss using the following function:

。

where d () represents the distance.

Representing all text encoding features in the positive sample group of the ith sample.

Representing all text encoding features in the negative sample set of the ith sample. />

For the loss of the multi-structure graph Wen Hujian, H is the total number of samples input in the training, H represents the serial number of each sample, and in the formula, i is equal to H, < >>

Multi-structure text coding feature representing the ith sample,/->

Representing a preset minimum constant.

representing all image sequence encoding features in the positive sample group of the ith sample. />

Representing all image sequence encoding features in the negative sample set of the ith sample.

It can be seen that the above-mentioned loss functions include a text-to-image alignment loss function and an image-to-text alignment loss function, and for both of the above alignment loss functions, 3 terms are included:

taking the image-to-text alignment loss function as an example:

a first item: typically, the penalty function is aligned, and each image-encoding feature is traversed, which may be associated with its absolute corresponding text feature, and the distance constraint d (), such that the image feature within each sample is closest to its corresponding text-encoding feature.

The second item: the method comprises the steps of path constraint expansion positive sample contrast loss function design, and in the first step, text features which are absolutely corresponding to each image coding feature are found. And secondly, constructing a positive sample group according to the text features corresponding to the absolute values, and constructing the positive sample group by the neighbor node features connected with the text features corresponding to the absolute values.

The path constraint expansion positive sample contrast loss function can build multiple neighbor relation diagrams through different attribute paths, multiple neighbor relation diagrams can be built through different attribute paths, and all neighbor samples form a positive sample group

. Each sample has its own set of positive samples. For the characteristics of the positive sample group, 2 constraint modes are designed in the present embodiment: a) The average of all features of the positive sample set is then calculated, and the distance between the image-encoding feature and the average feature is calculated. b) The distances of all features of the positive sample set from the image encoding features are maximized.

Third item: the path constraint expands the negative sample contrast loss function design, assuming that for each training, H samples are input, called a batch. For each image coding feature, obtaining text coding features of other H-1 nodes and all neighbor samples thereof which do not correspond to the image coding features to form a feature set

The negative set of samples is noted for this sample i. The distance between the features of the negative set of samples and the image-encoding features is minimized.

The above 3 items are added together to form an image-to-text alignment loss function.

And vice versa, text-to-image alignment loss functions.

Wherein the training process may include:

step 1, a novel multi-structure image text retrieval network is constructed, and the novel multi-structure image text retrieval network comprises a text information feature encoder and an image sequence feature encoder.

And 2, establishing a loss function.

And step 3, training the network according to the loss function so as to enable the network to be converged.

Further, the network training process is as follows: the training process of convolutional neural networks is divided into two phases. The first phase is a phase in which data is propagated from a low level to a high level, i.e., a forward propagation phase. Another phase is a phase of propagation training from a high level to the bottom layer, i.e., a back propagation phase, when the result of the forward propagation does not match the expected result. The training process is as follows:

1. all network layer weights are initialized, and random initialization is generally adopted;

2. the input image and text data are transmitted forward through each layer of the neural network, the convolution layer, the downsampling layer, the full-connection layer and the like to obtain an output value;

3. And obtaining the output value of the network, and obtaining the output value of the network according to a loss function formula.

4. The error is reversely transmitted back to the network, and each layer of the network is sequentially obtained: and (3) the back propagation errors of layers such as a neural network layer, a full connection layer, a convolution layer and the like are shown.

5. And adjusting all weight coefficients in the network according to the back propagation errors of the layers, namely updating the weights.

6. And randomly selecting new image text data of the batch again, and then entering a second step to obtain the network forward propagation to obtain an output value.

7. And (3) carrying out infinite iteration, and ending training when the error between the output value and the target value (label) of the network is smaller than a certain threshold value or the iteration number exceeds a certain threshold value.

8. And saving the trained network parameters of all layers.

The following briefly describes the network reasoning process, i.e. the search matching process:

the embodiment provides a novel multi-structure text image mutual detection method. In the reasoning process, the weight coefficient trained by the network is preloaded. And extracting features of the menu text or the sequence image.

And storing the data into a data set to be retrieved.

The user gives any recipe text data or sequence image data, called query data.

And extracting the characteristics of menu text data or sequence image data of the query data, and using a novel image text retrieval network.

And performing distance matching on the features of the query data and all sample features in the data set to be retrieved, namely solving vector distances. The present embodiment finds the euclidean distance.

For example: if the query data is menu text data, all menu video features in the data set to be searched are taken to calculate the distance. The homography data is recipe video data.

And (3) obtaining Euclidean distance from all the menu video features in the data set to be searched, wherein a sample with the minimum distance is a recommended sample, and outputting the recommended sample.

The following describes a device for training a cross-modal mutual search neural network provided by an embodiment of the present application, where the device for training a cross-modal mutual search neural network described below and the method for training a cross-modal mutual search neural network described above may be referred to correspondingly.

Referring to fig. 7, fig. 7 is a schematic structural diagram of a training device for a cross-modal cross-detection cable neural network according to an embodiment of the present application.

In this embodiment, the apparatus may include:

a text encoding module 11 for constructing a text information feature encoder and an image sequence feature encoder; wherein the text information feature encoder comprises: the device comprises a text coding layer, an attribute path establishment layer, a recoding layer and a recoding characteristic secondary aggregation layer;

the image coding module 12 is used for constructing a retrieval network based on the text information feature encoder and the image sequence feature encoder to obtain an initial image text retrieval network;

a loss function construction module 13 for constructing an aligned loss function based on the positive and negative sample groups of each sample;

the model training module 14 is configured to train the initial image text retrieval network based on the alignment loss function and the training data to obtain a multi-modal image text retrieval network.

The image text retrieval device provided by the embodiment of the application is introduced below, and the image text retrieval device described below and the cross-modal mutual retrieval method described above can be referred to correspondingly.

Referring to fig. 8, fig. 8 is a schematic structural diagram of an image text retrieval device according to an embodiment of the present application.

In this embodiment, the apparatus may include:

the text data processing module 21 is configured to perform feature encoding on the text information based on the text information feature encoder of the multi-mode image text retrieval network to obtain corresponding text encoding features when multi-structure text data is input; the text information feature encoder comprises a text encoding layer, an attribute path establishment layer, a recoding layer and a recoding feature secondary aggregation layer;

the image data processing module 22 is configured to perform feature encoding on the image data based on the image sequence feature encoder of the multi-mode image text retrieval network when the image data is input, so as to obtain corresponding image encoding features;

the feature reasoning module 23 is configured to infer the text coding feature or the image coding feature through an output layer of the multi-mode image text retrieval network, so as to obtain a retrieval result.

The present application further provides a server, please refer to fig. 9, fig. 9 is a schematic structural diagram of a server provided in an embodiment of the present application, and the server may include:

a memory for storing a computer program;

and the processor is used for realizing the steps of any one of the cross-mode mutual detection cable neural network training methods when executing the computer program.

As shown in fig. 9, which is a schematic structural diagram of a server, the server may include: a processor 10, a memory 11, a communication interface 12 and a communication bus 13. The processor 10, the memory 11 and the communication interface 12 all complete communication with each other through a communication bus 13.

In the present embodiment, the processor 10 may be a central processing unit (Central Processing Unit, CPU), an asic, a dsp, a field programmable gate array, or other programmable logic device, etc.

Processor 10 may call a program stored in memory 11, and in particular, processor 10 may perform operations in an embodiment of an abnormal IP identification method.

The memory 11 is used for storing one or more programs, and the programs may include program codes, where the program codes include computer operation instructions, and in this embodiment, at least the programs for implementing the following functions are stored in the memory 11:

training the initial image text retrieval network based on the alignment loss function and the training data to obtain the multi-mode image text retrieval network.

In one possible implementation, the memory 11 may include a storage program area and a storage data area, where the storage program area may store an operating system, and at least one application program required for functions, etc.; the storage data area may store data created during use.

In addition, the memory 11 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device or other volatile solid-state storage device.

The communication interface 12 may be an interface of a communication module for interfacing with other devices or systems.

Of course, it should be noted that the structure shown in fig. 9 does not limit the server in the embodiment of the present application, and the server may include more or fewer components than those shown in fig. 9 or may combine some components in practical applications.

The application also provides a computer readable storage medium, and the computer readable storage medium stores a computer program, and when the computer program is executed by a processor, the computer program can implement the steps of the training method of any one of the cross-modal cross-checking neural networks and/or the steps of the cross-modal cross-checking method.

The computer readable storage medium may include: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

For the description of the computer-readable storage medium provided in the present application, reference is made to the above method embodiments, and the description is omitted herein.

In the description, each embodiment is described in a progressive manner, and each embodiment is mainly described by the differences from other embodiments, so that the same similar parts among the embodiments are mutually referred. For the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the description of the method section.

Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative elements and steps are described above generally in terms of functionality in order to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. The software modules may be disposed in Random Access Memory (RAM), memory, read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

The above describes in detail a method for training a cross-modal cross-checking neural network, a cross-modal cross-checking method, another method for training a cross-modal cross-checking neural network, another two cross-modal cross-checking methods, a training device for a cross-modal cross-checking neural network, an image text retrieval device, a server and a computer readable storage medium. Specific examples are set forth herein to illustrate the principles and embodiments of the present application, and the description of the examples above is only intended to assist in understanding the methods of the present application and their core ideas. It should be noted that it would be obvious to those skilled in the art that various improvements and modifications can be made to the present application without departing from the principles of the present application, and such improvements and modifications fall within the scope of the claims of the present application.

Claims

1. The method for training the cross-modal mutual-detection cable neural network is characterized by comprising the following steps of:

constructing a text information feature encoder and an image sequence feature encoder;

constructing an alignment loss function based on the positive and negative sample sets of each sample; the sample is a multi-structure text coding feature processed by a text information feature encoder and an image sequence coding feature processed by an image sequence feature encoder;

2. The training method of claim 1, wherein the text information feature encoder comprises: the device comprises a text coding layer, an attribute path establishment layer, a recoding layer and a recoding characteristic secondary aggregation layer.

3. The training method according to claim 2, wherein the text encoding layer is configured to perform feature encoding on the input multi-structure text data to obtain feature vectors of each word, and process all feature vectors of the multi-structure text data through an attention network to obtain and output feature codes of the multi-structure text data to the attribute path establishment layer;

4. The training method of claim 2, wherein the text information feature encoder further comprises: and the sample traversing unit is used for traversing all the samples to obtain text coding features corresponding to each sample.

5. The training method according to claim 1, wherein the image sequence feature encoder comprises a feature extraction unit, an image sequence screening unit, and an image sequence overall feature extraction unit.

6. The training method of claim 1, wherein constructing an alignment loss function based on the positive and negative sample sets for each sample comprises:

an alignment loss function is constructed based on the positive and negative sample sets.

7. The training method of claim 1, wherein constructing an alignment loss function based on the positive and negative sample sets for each sample comprises:

8. A cross-modal cross-checking method, comprising:

When multi-structure text data are input, a text information feature encoder based on a multi-mode image text retrieval network performs feature encoding on the text information to obtain corresponding text encoding features;

9. The cross-modal cross-checking method of claim 8, wherein the text information feature encoder comprises a text encoding layer, an attribute path establishment layer, a recoding layer, and a recoding feature secondary aggregation layer.

10. The cross-modal cross-checking method of claim 9, wherein when multi-modal image text retrieval network based text information feature encoder performs feature encoding on the text information to obtain corresponding text encoding features when multi-modal image text data is inputted, comprising:

11. The cross-modal cross-detection method as claimed in claim 8, wherein the image sequence feature encoder comprises a feature extraction unit, an image sequence screening unit and an image sequence overall feature extraction unit.

12. The method for training the cross-modal mutual-detection cable neural network is characterized by comprising the following steps of:

the client sends a network training instruction to the server so that the server constructs a text information feature encoder and an image sequence feature encoder; constructing a retrieval network based on the text information feature encoder and the image sequence feature encoder to obtain an initial image text retrieval network; constructing an alignment loss function based on the positive and negative sample sets of each sample; training the initial image text retrieval network based on the alignment loss function and training data to obtain a multi-mode image text retrieval network; transmitting the multi-modal image text retrieval network;

13. The training method of claim 12, wherein the text information feature encoder comprises: the device comprises a text coding layer, an attribute path establishment layer, a recoding layer and a recoding characteristic secondary aggregation layer.

14. A cross-modal cross-checking method, comprising:

the method comprises the steps that a client inputs data to be retrieved to a server, so that when multi-structure text data are input by the server, a text information feature encoder based on a multi-mode image text retrieval network performs feature encoding on the text information to obtain corresponding text encoding features; when the image data is input, an image sequence feature encoder based on a multi-mode image text retrieval network performs feature encoding on the image data to obtain corresponding image encoding features; reasoning the text coding features or the image coding features through an output layer of the multi-mode image text retrieval network to obtain and send retrieval results;

and the client receives the search result and displays the search result.

15. The cross-modal cross-checking method of claim 14, wherein the text information feature encoder comprises a text encoding layer, an attribute path establishment layer, a recoding layer, and a recoding feature secondary aggregation layer.

16. A cross-modal cross-checking method, comprising:

the server receives data to be retrieved input by the client;

17. The cross-modal cross-checking method as claimed in claim 16, wherein the text information feature encoder comprises a text encoding layer, an attribute path establishment layer, a recoding layer, and a recoding feature secondary aggregation layer;

18. A cross-modal mutual retrieval neural network training device, comprising:

19. An image text retrieval apparatus, comprising:

20. A server, comprising:

a memory for storing a computer program;

a processor for implementing the steps of the method of training a cross-modal mutual-detection neural network as claimed in any one of claims 1 to 7, 12, 13 or the steps of the cross-modal mutual-detection method as claimed in any one of claims 8 to 11, 14 to 17 when executing the computer program.

21. A computer readable storage medium, characterized in that it has stored thereon a computer program which, when executed by a processor, implements the steps of the method of training a cross-modal cross-checking neural network according to any of claims 1 to 7, 12, 13 or the steps of the cross-modal cross-checking method according to any of claims 8 to 11, 14 to 17.