CN115858848B

CN115858848B - Image-text mutual inspection method and device, training method and device, server and medium

Info

Publication number: CN115858848B
Application number: CN202310166849.5A
Authority: CN
Inventors: 赵坤; 王立; 李仁刚; 赵雅倩; 范宝余; 鲁璐; 郭振华
Original assignee: Inspur Electronic Information Industry Co Ltd
Current assignee: Inspur Electronic Information Industry Co Ltd
Priority date: 2023-02-27
Filing date: 2023-02-27
Publication date: 2023-08-15
Anticipated expiration: 2043-02-27
Also published as: CN115858848A

Abstract

The application discloses an image-text mutual inspection method and device, a training method and device, a server and a medium, and relates to the technical field of data processing, wherein the training method comprises the following steps: constructing an image multi-connection feature encoder and a text feature encoder; wherein the image multi-connection feature encoder comprises: an image classification network, an image detection network and a graph structure construction network of images; constructing a network based on the image multi-connection feature encoder and the text feature encoder to obtain an initial image-text mutual inspection network; constructing an image-text retrieval loss function; training the initial image-text mutual inspection network based on the image-text retrieval loss function and training data to obtain the image-text mutual inspection network. The effect and the reasoning accuracy of processing the multi-mode data are improved.

Description

Image-text mutual inspection method and device, training method and device, server and medium

Technical Field

The present application relates to the field of data processing technologies, and in particular, to a training method for an image-text mutual inspection network, an image-text mutual inspection method, a training device for an image-text mutual inspection network, an image-text mutual inspection device for an image-text mutual inspection network, a server, and a computer readable storage medium.

Background

With the continuous development of information technology, artificial intelligence technology can be applied in more and more fields to improve the efficiency and effect of processing data. In the field of recognition of text data and image data, a corresponding model can be adopted for recognition to obtain a regression result or a classification result.

In the related art, a mutual inspection task of multi-modal data to multi-modal text and image sequences is required in the multi-modal field. Wherein multimodal refers to data comprising both text and image sequences. The commonly adopted retrieval network cannot effectively process the image sequence in the multi-mode data, so that the effect of retrieving the multi-mode data is reduced, and the problem of low reasoning accuracy exists.

Therefore, how to improve the effect of processing multi-modal data and to improve the accuracy of reasoning are important issues for those skilled in the art.

Disclosure of Invention

The application aims to provide a training method of a picture-text mutual detection network, a picture-text mutual detection method, a training method of another picture-text mutual detection network, a picture-text mutual detection method of another two picture-text mutual detection networks, a training device of the picture-text mutual detection network, a picture-text mutual detection device of the picture-text mutual detection network, a server and a computer readable storage medium, so that the effect of processing multi-mode data is improved, and the reasoning accuracy is improved.

In order to solve the technical problems, the application provides a training method of an image-text mutual inspection network, which comprises the following steps:

constructing an image multi-connection feature encoder and a text feature encoder; wherein the image multi-connection feature encoder comprises: an image classification network, an image detection network and a graph structure construction network of images;

performing network construction based on the image multi-connection feature encoder and the text feature encoder to obtain an initial image-text mutual inspection network;

constructing an image-text retrieval loss function;

and training the initial image-text mutual inspection network based on the image-text retrieval loss function and training data to obtain an image-text mutual inspection network.

Optionally, the image classification network is configured to extract features of an input image and serve as overall features of the image;

the image detection network is used for extracting a main component detection frame and component information of the image, and taking the component information as the component characteristics;

the image structure construction network is used for constructing the image structure based on the integral features and the component features of each image to obtain image structure features from the image features to the component information.

Optionally, the text feature encoder includes: extracting a node characteristic layer, constructing a connection relation layer, constructing a graph layer and constructing a neighbor relation layer.

Optionally, the node feature layer is configured to perform feature encoding on the text information of the multi-structure text to obtain feature encoding corresponding to each sample;

the connection relation layer is used for taking each sample as a node and constructing connection relation among each node based on semantic information of each node;

the construction layer is used for constructing a graph neural network corresponding to each node based on the connection relation between the nodes;

and constructing a neighbor relation layer, wherein the neighbor relation layer is used for carrying out weighted calculation on edges of the graph neural network of the corresponding nodes based on the number of connections between the nodes to obtain the corresponding node characteristics.

Optionally, constructing the teletext retrieval loss function includes:

constructing a first loss function by taking the distance between the sample and the positive sample as a target more and more;

constructing a second loss function by taking the distance between the sample and the negative sample as a target;

constructing a third loss function by taking the increasing distance between the text sample and the corresponding text most similar sample as a target;

and combining the first loss function, the second loss function and the third loss function into the image-text retrieval loss function.

The application also provides an image-text mutual detection method of the image-text mutual detection network, which comprises the following steps:

when an image is input, an image multi-connection feature encoder based on an image-text mutual detection network performs feature extraction on the image to obtain an image multi-connection feature; wherein the image multi-connection feature encoder comprises: an image classification network, an image detection network and a graph structure construction network of images;

when text information is input, performing feature coding on the text information based on a text feature coder of the image-text mutual inspection network to obtain corresponding text coding features;

and searching the image multi-connection features or the text coding features through an output layer of the image-text mutual detection network to obtain a search result.

Optionally, when an image is input, an image multi-connection feature encoder based on an image-text mutual inspection network performs feature extraction on the image to obtain an image multi-connection feature, including:

when the input is an image, extracting the characteristics of the input image and taking the characteristics as the integral characteristics of the image;

extracting a main component detection frame and component information of the image, and taking the component information as the component characteristics;

and constructing a graph structure based on the integral features and the component features of each image, and obtaining graph structure features from the image features to the component information.

The application also provides a training method of the image-text mutual inspection network, which comprises the following steps:

the client sends a network training instruction to the server so that the server constructs an image multi-connection feature encoder and a text feature encoder; wherein the image multi-connection feature encoder comprises: an image classification network, an image detection network and a graph structure construction network of images; performing network construction based on the image multi-connection feature encoder and the text feature encoder to obtain an initial image-text mutual inspection network; constructing an image-text retrieval loss function; training the initial image-text mutual inspection network based on the image-text retrieval loss function and training data to obtain an image-text mutual inspection network; sending the image-text mutual detection network;

and the client receives the image-text mutual inspection network and displays a training completion message.

the method comprises the steps that a client inputs data to be retrieved to a server, so that when an image is input by the server, an image multi-connection feature encoder based on an image-text mutual detection network performs feature extraction on the image to obtain an image multi-connection feature; wherein the image multi-connection feature encoder comprises: an image classification network, an image detection network and a graph structure construction network of images; when text information is input, performing feature coding on the text information based on a text feature coder of the image-text mutual inspection network to obtain corresponding text coding features; searching the image multi-connection features or the text coding features through an output layer of the image-text mutual detection network to obtain and send a search result;

And the client receives the search result and displays the search result.

the server receives data to be retrieved input by the client;

searching the image multi-connection features or the text coding features through an output layer of the image-text mutual detection network to obtain a search result;

and sending the search result to the client so that the client displays the search result.

The application also provides a training device of the image-text mutual inspection network, which comprises:

the encoder construction module is used for constructing an image multi-connection feature encoder and a text feature encoder; wherein the image multi-connection feature encoder comprises: an image classification network, an image detection network and a graph structure construction network of images;

The network construction module is used for carrying out network construction based on the image multi-connection feature encoder and the text feature encoder to obtain an initial image-text mutual inspection network;

the loss function construction module is used for constructing an image-text retrieval loss function;

and the network training module is used for training the initial image-text mutual detection network based on the image-text retrieval loss function and training data to obtain an image-text mutual detection network.

The application also provides an image-text mutual detection device of the image-text mutual detection network, which comprises:

the image characteristic processing module is used for extracting the characteristics of the image based on the image multi-connection characteristic encoder of the image-text mutual detection network when the image is input, so as to obtain the image multi-connection characteristics; wherein the image multi-connection feature encoder comprises: an image classification network, an image detection network and a graph structure construction network of images;

the text feature processing module is used for carrying out feature coding on the text information based on a text feature coder of the image-text mutual inspection network when the text information is input, so as to obtain corresponding text coding features;

and the reasoning module is used for searching the image multi-connection features or the text coding features through the output layer of the image-text mutual detection network to obtain a search result.

The application also provides a server, comprising:

a memory for storing a computer program;

and the processor is used for realizing the steps of the training method of the image-text mutual detection network and/or the steps of the image-text mutual detection method of the image-text mutual detection network when executing the computer program.

The application also provides a computer readable storage medium on which a computer program is stored, which when being executed by a processor implements the steps of the training method of the teletext network and/or the steps of the teletext method of the teletext network as described above.

The application provides a training method of an image-text mutual inspection network, which comprises the following steps: constructing an image multi-connection feature encoder and a text feature encoder; wherein the image multi-connection feature encoder comprises: an image classification network, an image detection network and a graph structure construction network of images; performing network construction based on the image multi-connection feature encoder and the text feature encoder to obtain an initial image-text mutual inspection network; constructing an image-text retrieval loss function; and training the initial image-text mutual inspection network based on the image-text retrieval loss function and training data to obtain an image-text mutual inspection network.

Through the built image multi-connection feature encoder, an image classification network, an image detection network and an image structure building network of images are included, then an initial image-text mutual inspection network capable of processing multi-structure text data is built again, finally training is carried out, an image text retrieval network capable of processing the images more efficiently is obtained, the image data is processed, the retrieval effect of the multi-mode data is improved, and the reasoning accuracy is improved.

The application also provides a picture-text mutual inspection method of the picture-text mutual inspection network, a training method of the other picture-text mutual inspection network, a picture-text mutual inspection method of the other two picture-text mutual inspection networks, a training device of the picture-text mutual inspection network, a picture-text mutual inspection device of the picture-text mutual inspection network, a server and a computer readable storage medium, which have the advantages and are not repeated herein.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present application, and that other drawings can be obtained according to the provided drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flowchart of a training method of an image-text mutual inspection network according to an embodiment of the present application;

fig. 2 is a schematic diagram of a sample of an image-text mutual inspection method of an image-text mutual inspection network according to an embodiment of the present application;

FIG. 3 is a schematic diagram of image encoding of an image-text mutual inspection method of an image-text mutual inspection network according to an embodiment of the present application;

fig. 4 is a schematic diagram of an image-text mutual inspection method of an image-text mutual inspection network according to an embodiment of the present application;

fig. 5 is a schematic path diagram of an image-text mutual inspection method of an image-text mutual inspection network according to an embodiment of the present application;

fig. 6 is a schematic diagram of text encoding of an image-text mutual inspection method of an image-text mutual inspection network according to an embodiment of the present application;

fig. 7 is a schematic diagram of node connection of an image-text mutual inspection method of an image-text mutual inspection network according to an embodiment of the present application;

fig. 8 is a schematic diagram of a positive sample of a graph-text mutual detection method of a graph-text mutual detection network according to an embodiment of the present application;

fig. 9 is a schematic structural diagram of a training device of an image-text mutual inspection network according to an embodiment of the present application;

fig. 10 is a schematic structural diagram of an image-text interaction device of an image-text interaction network according to an embodiment of the present application;

Fig. 11 is a schematic structural diagram of a server according to an embodiment of the present application.

Detailed Description

The core of the application is to provide a training method of a picture-text mutual detection network, a picture-text mutual detection method, a training method of another picture-text mutual detection network, a picture-text mutual detection method of another two picture-text mutual detection networks, a training device of the picture-text mutual detection network, a picture-text mutual detection device of the picture-text mutual detection network, a server and a computer readable storage medium, thereby improving the effect of processing multi-mode data and improving the reasoning accuracy.

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments of the present application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

Therefore, the application provides a training method of the image-text mutual inspection network, which comprises the steps of constructing an image multi-connection feature encoder, comprising an image classification network, an image detection network and an image structure construction network of images, then constructing an initial image-text mutual inspection network capable of processing multi-structure text data, finally training to obtain an image text retrieval network capable of processing the images more efficiently, realizing the processing of the image data, improving the retrieval effect of multi-mode data and improving the reasoning accuracy.

The following describes a training method of an image-text mutual inspection network provided by the application through an embodiment.

Referring to fig. 1, fig. 1 is a flowchart of a training method of an image-text mutual inspection network according to an embodiment of the present application.

In this embodiment, the method may include:

s101, constructing an image multi-connection feature encoder and a text feature encoder; wherein the image multi-connection feature encoder comprises: an image classification network, an image detection network and a graph structure construction network of images;

the method aims at constructing an image multi-connection feature encoder and a text feature encoder; wherein the image multi-connection feature encoder comprises: an image classification network, an image detection network and a graph structure construction network of images.

Therefore, in the technical scheme of the application, in order to extract the characteristics of the image, a double-layer image encoder is realized. In general, only images are classified or image features are extracted through image detection, but the problem that the image feature extraction is incomplete easily exists, and the accuracy of image detection is reduced. Therefore, in the embodiment, the first layer processes the image data by adopting a mode of processing the image classification network and the image detection network in parallel, and then the second layer fuses a plurality of features through the image construction network of the image, so that the accuracy of extracting the image features is improved.

Further, the image classification network is used for extracting the characteristics of the input image and taking the characteristics as the integral characteristics of the image;

the image detection network is used for extracting a main component detection frame and component information of the image and taking the component information as component characteristics;

and the image structure construction network is used for constructing the image structure based on the integral features and the component features of each image to obtain image structure features from the image features to the component information.

Therefore, the image sequence is effectively processed through the image multi-connection feature coding in the technical scheme, and the effect and efficiency of image mutual inspection are improved finally.

Further, the text feature encoder in this embodiment includes: extracting a node characteristic layer, constructing a connection relation layer, constructing a graph layer and constructing a neighbor relation layer. The text feature encoder with two layers is built by extracting the node feature layer, building the connection relation layer, building the image layer and building the neighbor relation layer. The first layer performs feature extraction on the text information, and the second layer constructs a neighbor relation on the extracted text information to achieve two-level text feature extraction.

Specifically, extracting a node characteristic layer, which is used for carrying out characteristic coding on text information of a multi-structure text to obtain a characteristic code corresponding to each sample;

constructing a connection relation layer, wherein each sample is used as a node, and the connection relation between each node is constructed based on semantic information of each node;

a layer is built and is used for building a graph neural network corresponding to the nodes based on the connection relation between each node;

and constructing a neighbor relation layer, and performing weighted calculation on edges of the graph neural network of the corresponding nodes based on the number of connections between the nodes to obtain the corresponding node characteristics.

S102, constructing a network based on an image multi-connection feature encoder and a text feature encoder to obtain an initial image-text mutual inspection network;

On the basis of S101, the method aims at constructing a network based on an image multi-connection feature encoder and a text feature encoder to obtain an initial image-text mutual inspection network.

Further, the network may be constructed by any construction method provided in the prior art, which is not specifically limited herein.

S103, constructing an image-text retrieval loss function;

on the basis of S102, this step aims at constructing a teletext retrieval loss function.

In order to improve the training effect, the process of constructing the image-text retrieval loss function may include:

step 1, for any one image sample, the corresponding multi-structure sample is a positive sample. And constructing a first loss function by taking the image sample and the positive sample as targets which are more and more close;

step 2, for any one image sample, an image similarity sample group is established, a most similar sample is selected from the similarity sample group, the most similar sample is called an image most similar sample, and a loss function is constructed by using the image most similar sample, and the steps are as follows:

for any one image sample, there is a positive sample for its corresponding multi-structure sample, which is defined as an anchor sample. All neighbor samples of the anchor sample are obtained, and all samples connected with the anchor sample through a defined path (main material path and process path) are neighbor samples of the anchor sample. And traversing all paths, calculating samples with path links with anchor point samples, and constructing a similar sample group.

And traversing all samples in the similar sample group, calculating the multi-structure text with the maximum connection number with the anchor point samples, marking the multi-structure text as the image most similar sample, and taking the average value of the characteristics of the most similar samples as the image most similar sample if a plurality of most similar samples with the same connection number exist.

Finally, constructing a second loss function by taking the characteristic distance between the image sample and the most similar sample of the image as large as possible as a target.

Step 3, for any text sample, a similar sample group is established, a most similar sample is selected from the similar sample group, the most similar sample is called a text most similar sample, and a loss function is constructed by using the text most similar sample, and the steps are as follows:

for any one text sample, there is a positive sample for its corresponding image sample, which is defined as an anchor sample. All neighbor samples of the anchor sample are obtained, and all samples connected with the anchor sample through the defined path are neighbor samples of the anchor sample. And traversing all paths, calculating samples with path links with anchor point samples, and constructing a similar sample group.

And traversing all samples in the similar sample group, calculating the image node with the maximum connection number with the anchor point sample, marking the image node as the text most similar sample, and taking the average value of the characteristics of the most similar samples as the text most similar sample if a plurality of most similar samples with the same connection number exist.

Finally, a third loss function is constructed with the aim of having as large a feature distance as possible between the text sample and the most similar text sample.

And 4, taking the minimum value of the sum of the first loss function, the second loss function and the third loss function as a target, and taking the minimum value as an image-text retrieval loss function. Further, the step may include:

step 1, constructing a first loss function by taking the distance between a sample and a positive sample as a target to be more and more similar;

step 2, constructing a second loss function by taking the distance between the sample and the negative sample as a target, wherein the distance is longer and longer;

step 3, constructing a third loss function by taking the distance between the text sample and the corresponding text most similar sample as a target, wherein the distance is larger and larger;

and 4, combining the first loss function, the second loss function and the third loss function into an image-text retrieval loss function.

It can be seen that this alternative is mainly illustrative of how the loss function is constructed. In the alternative, a first loss function is constructed by taking the distance between a sample and a positive sample as a target to be closer and closer; constructing a second loss function by taking the distance between the sample and the negative sample as a target; constructing a third loss function by taking the increasing distance between the text sample and the corresponding text most similar sample as a target; the first, second and third loss functions are combined into a teletext retrieval loss function.

S104, training the initial image-text mutual inspection network based on the image-text retrieval loss function and training data to obtain the image-text mutual inspection network.

On the basis of S103, the step aims at training the initial image-text mutual inspection network based on the image-text retrieval loss function and training data to obtain the image-text mutual inspection network.

Based on the constructed image-text mutual inspection loss function, effective mutual inspection can be carried out on images and characters, and the efficiency and effect of mutual inspection are improved.

In summary, the image multi-connection feature encoder comprises an image classification network, an image detection network and an image structure construction network of the image, then an initial image-text mutual inspection network capable of processing multi-structure text data is constructed, finally training is performed, and an image text retrieval network capable of processing the image more efficiently is obtained, so that the image data is processed, the retrieval effect of the multi-mode data is improved, and the reasoning accuracy is improved.

The following describes an image-text mutual inspection method of the image-text mutual inspection network provided by the application through another embodiment.

In this embodiment, the method may include:

s201, when an image is input, an image multi-connection feature encoder based on an image-text mutual detection network performs feature extraction on the image to obtain an image multi-connection feature; wherein the image multi-connection feature encoder comprises: an image classification network, an image detection network and a graph structure construction network of images;

The method aims at extracting the characteristics of the image by an image multi-connection characteristic encoder based on an image-text mutual inspection network when the image is input, so as to obtain the image multi-connection characteristics; wherein the image multi-connection feature encoder comprises: an image classification network, an image detection network and a graph structure construction network of images.

Further, the step may include:

step 1, when an input image is input, extracting the characteristics of the input image and taking the characteristics as the integral characteristics of the image;

step 2, extracting a main component detection frame and component information of the image, and taking the component information as component characteristics;

and 3, constructing a graph structure based on the integral features and the component features of each image, and obtaining graph structure features from the image features to the component information.

S202, when text information is input, a text feature encoder based on a graph-text mutual inspection network performs feature encoding on the text information to obtain corresponding text encoding features;

the text feature encoder based on the image-text mutual inspection network performs feature encoding on the text information when the text information is input, so as to obtain corresponding text encoding features.

Wherein, text feature encoder includes: extracting a node characteristic layer, constructing a connection relation layer, constructing a graph layer and constructing a neighbor relation layer.

S203, searching the image multi-connection features or the text coding features through an output layer of the image-text mutual detection network to obtain a search result.

On the basis of S201 and S202, the step aims at searching the image multi-connection features or the text coding features through the output layer of the image-text mutual detection network to obtain a search result.

Therefore, the image multi-connection feature encoder comprises an image classification network, an image detection network and an image structure construction network of the image, then an initial image-text mutual inspection network capable of processing multi-structure text data is constructed, finally training is carried out, and an image text retrieval network capable of processing the image more efficiently is obtained, so that the image data is processed, the retrieval effect of the multi-mode data is improved, and the reasoning accuracy is improved.

The following describes a training method of the image-text mutual inspection network provided by the application through another embodiment.

In this embodiment, the method may include:

s301, a client sends a network training instruction to a server so that the server can construct an image multi-connection feature encoder and a text feature encoder; wherein the image multi-connection feature encoder comprises: an image classification network, an image detection network and a graph structure construction network of images; constructing a network based on the image multi-connection feature encoder and the text feature encoder to obtain an initial image-text mutual inspection network; constructing an image-text retrieval loss function; training the initial image-text mutual inspection network based on the image-text retrieval loss function and training data to obtain an image-text mutual inspection network; sending an image-text mutual detection network;

S302, the client receives the image-text mutual inspection network and displays a training completion message.

Therefore, the image-text mutual inspection network with better effect can be trained through the server through the embodiment, and the reasoning accuracy is improved.

In this embodiment, the method may include:

s401, the client inputs data to be retrieved to the server, so that when the server inputs an image, the image multi-connection feature encoder based on the image-text mutual detection network performs feature extraction on the image to obtain the image multi-connection feature; wherein the image multi-connection feature encoder comprises: an image classification network, an image detection network and a graph structure construction network of images; when the text information is input, a text feature encoder based on an image-text mutual check network carries out feature encoding on the text information to obtain corresponding text encoding features; searching the image multi-connection features or the text coding features through an output layer of the image-text mutual detection network to obtain and send a search result;

s402, the client receives the search result and displays the search result.

Therefore, the embodiment can improve the reasoning effect through the newly trained image-text mutual detection network.

In this embodiment, the method may include:

s501, a server receives data to be retrieved input by a client;

s502, when an image is input, an image multi-connection feature encoder based on an image-text mutual detection network performs feature extraction on the image to obtain an image multi-connection feature; wherein the image multi-connection feature encoder comprises: an image classification network, an image detection network and a graph structure construction network of images;

s503, when the text information is input, a text feature encoder based on a graph-text mutual inspection network performs feature encoding on the text information to obtain corresponding text encoding features;

s504, searching the image multi-connection features or the text coding features through an output layer of the image-text mutual detection network to obtain a search result;

s505, the search result is sent to the client so that the client displays the search result.

In this embodiment, a new neural network structure and training method are provided for the task of searching multi-structure text and images.

The embodiment mainly comprises the following steps: 1. constructing a neural network of the multi-structure text graph; 2. constructing a multi-structure text connection relation; 3. judging a neighbor relation; 4. extracting image features; 5. an image neighbor relation construction and judgment method; 6. and constructing a loss function.

And in the first part, extracting image features and constructing image neighbor relations.

The embodiment takes the menu text in the menu map as an example, and describes the specific implementation process, but the embodiment can be applied to other fields.

Referring to fig. 2, fig. 2 is a schematic diagram of an image-text mutual inspection method of an image-text mutual inspection network according to an embodiment of the present application.

In this embodiment, each sample consists of 1 image and 1 multi-structured text, as shown in fig. 2.

In the training process, 1 image corresponds to 1 multi-structured text.

Referring to fig. 3, fig. 3 is an image encoding schematic diagram of an image-text mutual inspection method of an image-text mutual inspection network according to an embodiment of the present application.

The image coding method includes the steps of firstly adopting an image classification network, such as a Resnet network, and extracting the characteristics of the whole image to be used as the integral characteristic representation of the image. Then, an image detection network, such as a Yolo network, is used to extract the principal component detection frame and component information of the whole image.

In this embodiment, a main component detection frame in an image is obtained through an image detection network, and a detection network feature corresponding to the detection frame is extracted as the component feature, and all the detection frames are traversed to complete extraction of all the component features and component information.

Wherein the combined image global features and the component information features are used as final feature representations of the image.

The joint features are uniformly mapped to one dimension after passing through the full connection layer to form a new feature vector which can be marked as a feature p.

Wherein an image-based graph structure is established.

Referring to fig. 4, fig. 4 is a schematic diagram of an image-text mutual inspection method of an image-text mutual inspection network according to an embodiment of the present application.

By the above operation, each image sample contains a final feature vector representation and its corresponding component information.

All image samples are traversed to construct a graph structure of image features and component information, as shown in fig. 4.

All the image samples correspond to one or more main materials, and a connection relation between the image characteristics and the main material characteristics is established.

And the second part is a multi-structure text graph neural network construction.

The multi-structure text of the recipe is exemplified here in its entirety, but other text application fields are applicable.

1) And selecting the data and the multi-structure semantic information thereof.

For each dish, there are various types of data constitution, and three types of data are applied in this embodiment: main material, process and vegetable step text. Wherein each dish contains the three information items.

2) And establishing a reasonable multi-node path according to the semantic information at the screening place, wherein the reasonable multi-node path at least comprises 2 pieces.

Referring to fig. 5, fig. 5 is a schematic path diagram of an image-text mutual inspection method of an image-text mutual inspection network according to an embodiment of the present application.

As shown in fig. 5, 2 types of paths are constructed in the present embodiment: vegetable name-main material-vegetable name-process-vegetable name.

The construction mode is that the dish is connected with the main material node as long as the main material information appears in the dish name or the text of the cooking step. As long as a certain key word of cooking mode, such as frying, stir-frying, boiling, frying, etc., appears in the name of the dish or the text of the cooking step, the dish is connected with the process node. And traversing all samples to complete the establishment of the multi-node path.

3) And constructing a graph neural network.

Wherein, the construction of the graph neural network comprises: and constructing a graph neural network node and characteristics thereof, and constructing connection relations among the nodes.

Wherein, construct the neural network node of the figure and characteristic. Firstly, extracting text features, and acquiring text information of each menu, wherein the text information comprises a menu name and step text information in the embodiment.

In this embodiment, each of the dishes is referred to as a sample, and includes a dish name and step text information. After obtaining the text information for each sample, each word is converted into a feature vector using the word2vector method.

The feature vectors of all the texts are input into a transformer network, and the final feature expression of all the texts is obtained and is called node feature in the embodiment. The feature of a node is the feature code of all characters of a sample.

Referring to fig. 6, fig. 6 is a schematic diagram of text encoding of an image-text mutual inspection method of an image-text mutual inspection network according to an embodiment of the present application.

As shown in fig. 6, text 1 represents a vegetable name, text 2 represents a step text, and text 3 is not used in the present embodiment.

Each word is converted into a feature vector Emb by word2vector method for each word. The text type is obtained, in this embodiment, the menu name represents text type 1, as in [1] of fig. 6. The step text represents text type 2, as shown at [2] in fig. 6.

And acquiring text position information, wherein for each type of text, the position of each text in the text where the text is positioned is acquired, for example, the text 1 is "tomato stir-fried egg", the position information of the western is 1, the position information of the red is 2, and the position information of the egg is 6. And sequentially obtaining the corresponding position information of all texts (text 1 and text 2) and the text in the texts.

The emb feature of the text is added with the position information feature of the text and the type information feature 3 items of the text to obtain a final input feature vector of the text, and the final input feature vector is input into a transformer network.

The transducer network can obtain the output feature vectors of all the characters, and each character corresponds to the feature vector output by the transducer network.

In this embodiment, the average value of the output feature vectors of all the characters is obtained and used as the node feature of one sample. Traversing all samples, and respectively solving the node characteristics of each sample.

Then, connection relations between the respective nodes are constructed.

Through the steps, a graph neural network with each dish as one node is constructed, and the close relationship of each node is represented below.

First, a path node is established. In this embodiment, the main material node and the process node are the same.

For example: the master node comprises: tomatoes, cucumbers, fish, meat and the like.

A process node comprising: frying, parching, decocting, and frying.

2 types of paths can be constructed, including vegetable name-main material-vegetable name-process-vegetable name.

The vegetable names are 1 vegetable, 1 sample and 1 node.

Each dish (each node) is traversed, and a connection with a path node is established for each node, i.e. each node is connected with its master node and process nodes.

Referring to fig. 7, fig. 7 is a schematic diagram of node connection of an image-text mutual inspection method of an image-text mutual inspection network according to an embodiment of the present application.

As shown in fig. 7, all the dishes of tomatoes are connected with the main tomato, and all the stir-fried nodes are connected with the process node.

And carrying out graph neural network calculation according to the graph neural network and the close relation thereof.

As shown in fig. 7, the middle green node is a sample node, and the close relationship between the sample nodes needs to be determined, which is represented by a connection coefficient. If there is a path connection (any path) between any two nodes, then it is said that there are neighbors between these 2 nodes. The number of connections between any two nodes through any connection relationship is called the connection number.

For example: the tomato fried egg and tomato egg soup are close together, and the number of connections is 2 (tomato, egg). Some sample hosts have a lot, and the connection relationship is often greater than 2.

The following calculation of the graph neural network is performed, and by constructing the graph structure above, the basic graph structure is defined as:. Wherein V represents the graphic neural network node set +.>Wherein->Representative node feature = =>The method comprises the steps of carrying out a first treatment on the surface of the E represents the connection relation of the graphic neural network +.>(i.e. there is a connection relationship between nodes and the connection number represents the connection strength, i.e. the degree of neighbourhood),>representing the connection strength, i.e. the number of connections between the i-th node and the j-th node. />Represents an adjacency matrix, wherein each element represents +.>Representing node->Connection relation between the two.

Each node is traversed in turn. Opposite nodeOrdering with all other nodes according to descending order of connection number, intercepting +.>The top K most similar (the largest number of connections) node sets S are called neighbors of the node. In consideration of the importance difference of different neighbor nodes, weight information is given to each connected edge of the node, and the calculation formula is as follows: />。

The constructed graph may reflect the relationship between any two nodes with an adjacency matrix a. If it is The expression =0 indicates node +.>And->There is no connection between them.

The calculation process of the graph neural network mainly illustrates how to obtain complementary information among neighbor nodes by using the graph neural network, so as to obtain more robust node characteristic representation.

The calculation of the graph structure data is a process of weighted summation of a certain vertex and its neighbor nodes. The graph neural network calculation process may be defined as:。

wherein V represents a graph neural network node setWherein->Representative node feature = =>。

A is an adjacency matrix representing whether there are edges and their connections between two nodes are emphasized. Z represents the new feature after the calculation of the graph neural network.

Neural network for each layer of graphWherein the superscript l represents the first layer.

The calculation formula is as follows:，/>，/>。

wherein, the liquid crystal display device comprises a liquid crystal display device,a feature representing a layer i neural network node, where the value is V. />The matrix is a diagonal matrix and the diagonal element calculation is as shown in the above formula. />Representing the network parameters that need to be trained at this layer. />Is the node characteristic of the layer graph after the neural network is updated.

And thirdly, designing a loss function, namely constructing an image-text mutual inspection loss function.

As described above, the image of each sample and its corresponding multi-structure text are characterized by the image feature encoding process and the multi-structure text feature extraction process.

The searching is carried out below, and the purpose of the searching is to make the sample characteristics corresponding to each other as close as possible and the sample characteristics of different types as far as possible, so that the searching accuracy can be improved. In reality, samples which are most prone to error are similar samples, such as steamed weever and boiled weever, materials are similar, finished figures are quite similar, and error is prone to error. In view of this, the present embodiment proposes a new image text mutual inspection loss function, which solves the problem that similar samples are easy to search for errors.

The method comprises the following steps:

step 1, for any image sample, the corresponding multi-structure sample is a positive sample, and the pair of samples needs to be as close as possible, so the loss function is constructed as follows:

assuming that N sample trains are drawn per train, the formula may include:。

wherein, the liquid crystal display device comprises a liquid crystal display device,multi-structure text feature representing nth sample,/->Representing the image characteristics of the nth sample.

for any one image sample, there is a positive sample for its corresponding multi-structure sample, which is defined as an anchor sample. All neighbor samples of the anchor sample are obtained, and all samples connected with the anchor sample through a defined path (main material path and process path) are neighbor samples of the anchor sample. As shown in the following figure, all paths are traversed, and samples with path links to anchor samples are calculated to construct a similar sample group.

Referring to fig. 8, fig. 8 is a schematic diagram of a positive sample of a graph-text mutual inspection method of a graph-text mutual inspection network according to an embodiment of the application.

As shown in fig. 8, the multi-structure sample 2 has 2 connection relations with the anchor sample (there may be many processes, main materials-main materials), so the multi-structure sample 2 is the most similar sample of the image of the anchor sample, and is marked as. This embodiment requires that the feature distance of the image sample and the most similar sample of the image be as large as possible, so the construction loss function is as follows: />。

And step 3, similarly, for any text sample, establishing a similar sample group, selecting a most similar sample from the similar sample group, namely a text most similar sample, and constructing a loss function by using the text most similar sample, wherein the process is as follows:

for any one text sample, there is a positive sample for its corresponding image sample, which is defined as an anchor sample. All neighbor samples of the anchor sample are obtained, and all samples connected with the anchor sample through a defined path (the image detection network obtains the principal material information, and in the above steps, the image graph network is built according to the principal material) are all neighbor samples of the anchor sample. As shown in the following figure, all paths are traversed, and samples with path links to anchor samples are calculated to construct a similar sample group.

Traversing all samples in the similar sample group, calculating the image node with the largest connection number with the anchor point sample, marking the image node as the text most similar sample, if a plurality of most similar samples with the same connection number exist, taking the average value of the characteristics of the image node as the text most similar sample, marking the image node as the text most similar sample. This embodiment requires that the feature distance of the text sample and the text most similar sample be as large as possible, so the construction loss function is as follows: />。

The final loss function for this embodiment is:。

finally, how to train this network is explained. And (3) performing gradient back transmission in training by using the loss function, and performing parameter updating on the graph neural network.

Wherein the training process may include:

step 1, constructing a graph-based neural network, which comprises the steps of extracting images, text node characteristics, constructing graphs, constructing neighbors and constructing connection relations.

Step 2, establishing a loss function。

And step 3, training the network according to the loss function so as to enable the network to be converged.

Thus, the network training process is as follows: the training process of the graph neural network is divided into two phases. The first phase is a phase in which data is propagated from a low level to a high level, i.e., a forward propagation phase. Another phase is a phase of propagation training from a high level to the bottom layer, i.e., a back propagation phase, when the result of the forward propagation does not match the expected result. The training process is as follows:

1. All network layer weights are initialized, and random initialization is generally adopted;

2. the input text data is transmitted forward through layers such as a graphic neural network and a full-connection layer to obtain an output value;

3. and obtaining the output value of the network, and obtaining the loss value of the network according to a loss function formula.

4. The error is reversely transmitted back to the network, and each layer of the network is sequentially obtained: and (3) the neural network layer, the full connection layer and the like are used for carrying out back propagation errors.

5. And adjusting all weight coefficients in the network according to the back propagation errors of the layers, namely updating the weights.

6. And randomly selecting new image text data of the batch again, and then entering a second step to obtain the network forward propagation to obtain an output value.

7. And (3) carrying out infinite iteration, and ending training when the error between the output value and the target value (label) of the network is smaller than a certain threshold value or the iteration number exceeds a certain threshold value.

8. And saving the trained network parameters of all layers.

The following describes the training device of the image-text mutual inspection network provided by the embodiment of the application, and the training device of the image-text mutual inspection network described below and the training method of the image-text mutual inspection network described above can be correspondingly referred to each other.

Referring to fig. 9, fig. 9 is a schematic structural diagram of a training device for an image-text mutual inspection network according to an embodiment of the present application.

In this embodiment, the apparatus may include:

an encoder construction module 110 for constructing an image multi-connection feature encoder and a text feature encoder; wherein the image multi-connection feature encoder comprises: an image classification network, an image detection network and a graph structure construction network of images;

the network construction module 120 is configured to perform network construction based on the image multi-connection feature encoder and the text feature encoder to obtain an initial image-text mutual inspection network;

a loss function construction module 130, configured to construct a graph retrieval loss function;

the network training module 140 is configured to train the initial image-text mutual inspection network based on the image-text retrieval loss function and the training data, so as to obtain the image-text mutual inspection network.

The image-text mutual inspection device of the image-text mutual inspection network provided by the embodiment of the application is introduced, and the image-text mutual inspection device of the image-text mutual inspection network and the image-text mutual inspection method of the image-text mutual inspection network can be correspondingly referred to each other.

Referring to fig. 10, fig. 10 is a schematic structural diagram of an image-text mutual inspection device of an image-text mutual inspection network according to an embodiment of the present application.

In this embodiment, the apparatus may include:

the image feature processing module 210 is configured to, when an input is an image, perform feature extraction on the image by using an image multi-connection feature encoder based on an image-text mutual detection network, so as to obtain an image multi-connection feature; wherein the image multi-connection feature encoder comprises: an image classification network, an image detection network and a graph structure construction network of images;

the text feature processing module 220 is configured to perform feature encoding on the text information based on a text feature encoder of the image-text interaction detection network when the text information is input, so as to obtain corresponding text encoding features;

the reasoning module 230 is configured to retrieve the image multi-connection feature or the text coding feature through an output layer of the image-text mutual inspection network, so as to obtain a retrieval result.

The present application also provides a server, please refer to fig. 11, fig. 11 is a schematic structural diagram of a server provided in an embodiment of the present application, and the server may include:

a memory for storing a computer program;

and the processor is used for realizing the steps of the training method of any image-text mutual inspection network when executing the computer program.

As shown in fig. 11, which is a schematic structural diagram of a server, the server may include: a processor 10, a memory 11, a communication interface 12 and a communication bus 13. The processor 10, the memory 11 and the communication interface 12 all complete communication with each other through a communication bus 13.

In an embodiment of the present application, the processor 10 may be a central processing unit (Central Processing Unit, CPU), an asic, a dsp, a field programmable gate array, or other programmable logic device, etc.

Processor 10 may call a program stored in memory 11, and in particular, processor 10 may perform operations in an embodiment of an abnormal IP identification method.

The memory 11 is used for storing one or more programs, and the programs may include program codes including computer operation instructions, and in the embodiment of the present application, at least the programs for implementing the following functions are stored in the memory 11:

Constructing an image-text retrieval loss function;

In one possible implementation, the memory 11 may include a storage program area and a storage data area, where the storage program area may store an operating system, and at least one application program required for functions, etc.; the storage data area may store data created during use.

In addition, the memory 11 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device or other volatile solid-state storage device.

The communication interface 12 may be an interface of a communication module for interfacing with other devices or systems.

Of course, it should be noted that the structure shown in fig. 11 does not limit the server in the embodiment of the present application, and the server may include more or less components than those shown in fig. 11 or may combine some components in practical applications.

The application also provides a computer readable storage medium, the computer readable storage medium stores a computer program, and the computer program can realize the training method of any image-text mutual inspection network or the steps of the image-text mutual inspection method of the image-text mutual inspection network when being executed by a processor.

The computer readable storage medium may include: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

For the description of the computer-readable storage medium provided by the present application, refer to the above method embodiments, and the disclosure is not repeated here.

In the description, each embodiment is described in a progressive manner, and each embodiment is mainly described by the differences from other embodiments, so that the same similar parts among the embodiments are mutually referred. For the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the description of the method section.

Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative elements and steps are described above generally in terms of functionality in order to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. The software modules may be disposed in Random Access Memory (RAM), memory, read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

The application provides a method for training a picture-text mutual inspection network, a picture-text mutual inspection method, a method for training another picture-text mutual inspection network, a method for training another two picture-text mutual inspection networks, a device for training a picture-text mutual inspection network, a picture-text mutual inspection device of a picture-text mutual inspection network, a server and a computer readable storage medium. The principles and embodiments of the present application have been described herein with reference to specific examples, the description of which is intended only to facilitate an understanding of the method of the present application and its core ideas. It should be noted that it will be apparent to those skilled in the art that various modifications and adaptations of the application can be made without departing from the principles of the application and these modifications and adaptations are intended to be within the scope of the application as defined in the following claims.

Claims

1. The training method of the image-text mutual inspection network is characterized by comprising the following steps of:

constructing an image multi-connection feature encoder and a text feature encoder; wherein the image multi-connection feature encoder comprises: an image classification network, an image detection network and a graph structure construction network of images; the image structure construction network is used for constructing a image structure based on the integral features and the component features of each image to obtain image structure features from the image features to the component information; the text feature encoder comprises a connection relation layer which is used for taking each sample as a node and constructing connection relation among each node based on semantic information of each node;

constructing an image-text retrieval loss function; the text retrieval loss function comprises a loss function with larger and larger distance between a text sample and a corresponding text most similar sample; the method comprises the steps of taking a multi-structure sample corresponding to any one image sample as a positive sample, defining the positive sample as an anchor sample, solving all neighbor samples of the anchor sample, taking all samples connected with the anchor sample through a main material path and a process path as neighbor samples of the anchor sample, traversing all paths, calculating samples connected with the anchor sample through paths to form a similar sample group, traversing all samples in the similar sample group, calculating a multi-structure text with the largest connection number with the anchor sample to be taken as a text most similar sample, and taking a text most similar sample according to a characteristic average value if a plurality of text most similar samples with the same connection number are provided;

2. Training method according to claim 1, characterized in that the image classification network is used for extracting features of the input image and as integral features of the image;

the image detection network is used for extracting a main component detection frame and component information of the image, and taking the component information as component characteristics.

3. The training method of claim 1, wherein the text feature encoder comprises: extracting a node characteristic layer, constructing a layer and constructing a neighbor relation layer.

4. The training method according to claim 3, wherein the node feature layer is configured to perform feature encoding on text information of the multi-structure text to obtain feature codes corresponding to each sample;

5. Training method according to claim 1, characterized in that constructing the teletext retrieval loss function comprises:

6. An image-text mutual inspection method of an image-text mutual inspection network is characterized by comprising the following steps:

when an image is input, an image multi-connection feature encoder based on an image-text mutual detection network performs feature extraction on the image to obtain an image multi-connection feature; wherein the image multi-connection feature encoder comprises: an image classification network, an image detection network and a graph structure construction network of images; the image structure construction network is used for constructing a image structure based on the integral features and the component features of each image to obtain image structure features from the image features to the component information;

When text information is input, performing feature coding on the text information based on a text feature coder of the image-text mutual inspection network to obtain corresponding text coding features; the text feature encoder comprises a connection relation layer which is used for taking each sample as a node and constructing connection relation among each node based on semantic information of each node;

searching the image multi-connection features or the text coding features through an output layer of the image-text mutual detection network to obtain a search result; the text retrieval loss function comprises a loss function with larger and larger distance between a text sample and a corresponding text most similar sample; and taking a multi-structure sample corresponding to any one image sample as a positive sample, defining the positive sample as an anchor sample, solving all neighbor samples of the anchor sample, taking all samples connected with the anchor sample through a main material path and a process path as neighbor samples of the anchor sample, traversing all paths, calculating samples connected with the anchor sample through paths to form a similar sample group, traversing all samples in the similar sample group, calculating a multi-structure text with the largest connection number with the anchor sample to be taken as a text most similar sample, and taking a text most similar sample according to a characteristic average value if a plurality of text most similar samples with the same connection number are taken as the text most similar sample.

7. The method according to claim 6, wherein when an image is inputted, an image multi-connection feature encoder based on an image cross-checking network performs feature extraction on the image to obtain an image multi-connection feature, comprising:

extracting a main component detection frame and component information of the image, and taking the component information as component characteristics;

8. The method of claim 6, wherein the text feature encoder comprises: extracting a node characteristic layer, constructing a connection relation layer, constructing a graph layer and constructing a neighbor relation layer.

9. The training device of the image-text mutual inspection network is characterized by comprising:

the encoder construction module is used for constructing an image multi-connection feature encoder and a text feature encoder; wherein the image multi-connection feature encoder comprises: an image classification network, an image detection network and a graph structure construction network of images; the image structure construction network is used for constructing a image structure based on the integral features and the component features of each image to obtain image structure features from the image features to the component information; the text feature encoder comprises a connection relation layer which is used for taking each sample as a node and constructing connection relation among each node based on semantic information of each node;

the loss function construction module is used for constructing an image-text retrieval loss function; the image-text retrieval loss function is used for increasing the characteristic distance between different image samples; the text retrieval loss function comprises a loss function with larger and larger distance between a text sample and a corresponding text most similar sample; the method comprises the steps of taking a multi-structure sample corresponding to any one image sample as a positive sample, defining the positive sample as an anchor sample, solving all neighbor samples of the anchor sample, taking all samples connected with the anchor sample through a main material path and a process path as neighbor samples of the anchor sample, traversing all paths, calculating samples connected with the anchor sample through paths to form a similar sample group, traversing all samples in the similar sample group, calculating a multi-structure text with the largest connection number with the anchor sample to be taken as a text most similar sample, and taking a text most similar sample according to a characteristic average value if a plurality of text most similar samples with the same connection number are provided;

10. An image-text mutual inspection device of an image-text mutual inspection network, comprising:

the image characteristic processing module is used for extracting the characteristics of the image based on the image multi-connection characteristic encoder of the image-text mutual detection network when the image is input, so as to obtain the image multi-connection characteristics; wherein the image multi-connection feature encoder comprises: an image classification network, an image detection network and a graph structure construction network of images; the image structure construction network is used for constructing a image structure based on the integral features and the component features of each image to obtain image structure features from the image features to the component information;

the text feature processing module is used for carrying out feature coding on the text information based on a text feature coder of the image-text mutual inspection network when the text information is input, so as to obtain corresponding text coding features; the text feature encoder comprises a connection relation layer which is used for taking each sample as a node and constructing connection relation among each node based on semantic information of each node;

The reasoning module is used for searching the image multi-connection features or the text coding features through the output layer of the image-text mutual detection network to obtain a search result; the text retrieval loss function comprises a loss function with larger and larger distance between a text sample and a corresponding text most similar sample; and taking a multi-structure sample corresponding to any one image sample as a positive sample, defining the positive sample as an anchor sample, solving all neighbor samples of the anchor sample, taking all samples connected with the anchor sample through a main material path and a process path as neighbor samples of the anchor sample, traversing all paths, calculating samples connected with the anchor sample through paths to form a similar sample group, traversing all samples in the similar sample group, calculating a multi-structure text with the largest connection number with the anchor sample to be taken as a text most similar sample, and taking a text most similar sample according to a characteristic average value if a plurality of text most similar samples with the same connection number are taken as the text most similar sample.

11. A server, comprising:

a memory for storing a computer program;

processor for implementing the steps of the training method of the mutual-text detection network according to any one of claims 1 to 5 and/or the steps of the mutual-text detection method of the mutual-text detection network according to any one of claims 6 to 8 when executing the computer program.

12. A computer readable storage medium, characterized in that the computer readable storage medium has stored thereon a computer program which, when executed by a processor, implements the steps of the training method of a mutual-graphics network according to any one of claims 1 to 5 and/or the steps of the mutual-graphics method of a mutual-graphics network according to any one of claims 6 to 8.