CN115858848A

CN115858848A - Image-text mutual inspection method and device, training method and device, server and medium

Info

Publication number: CN115858848A
Application number: CN202310166849.5A
Authority: CN
Inventors: 赵坤; 王立; 李仁刚; 赵雅倩; 范宝余; 鲁璐; 郭振华
Original assignee: Inspur Electronic Information Industry Co Ltd
Current assignee: Inspur Electronic Information Industry Co Ltd
Priority date: 2023-02-27
Filing date: 2023-02-27
Publication date: 2023-03-28
Anticipated expiration: 2043-02-27
Also published as: CN115858848B

Abstract

The application discloses a graph-text mutual inspection method and device, a training method and device, a server and a medium, which relate to the technical field of data processing, and the training method comprises the following steps: constructing an image multi-connection feature encoder and a text feature encoder; wherein the image multi-connection feature encoder includes: the method comprises the steps of constructing an image classification network, an image detection network and an image structure of an image; constructing a network based on the image multi-connection feature encoder and the text feature encoder to obtain an initial image-text mutual inspection network; constructing an image-text retrieval loss function; and training the initial image-text mutual inspection network based on the image-text retrieval loss function and the training data to obtain the image-text mutual inspection network. The effect and reasoning accuracy of processing the multi-modal data are improved.

Description

Image-text mutual inspection method and device, training method and device, server and medium

Technical Field

The present application relates to the field of data processing technologies, and in particular, to a method and a device for training a teletext network, a device and a server for training a teletext network, and a computer-readable storage medium.

Background

With the continuous development of information technology, artificial intelligence technology can be applied in more and more fields to improve the efficiency and effect of processing data. In the field of identification of text data and image data, corresponding models can be used for identification to obtain regression results or classification results.

In the related art, multi-modal data is required to be subjected to a multi-structure text and image sequence mutual detection task in the multi-modal field. Multimodal, among other things, refers to data that contains both text and image sequences. Generally, the adopted retrieval network cannot effectively process the image sequence in the multi-modal data, so that the retrieval effect of the multi-modal data is reduced, and the problem of low inference accuracy exists.

Therefore, how to improve the processing effect of the multi-modal data and improve the reasoning accuracy are important issues that are focused on by those skilled in the art.

Disclosure of Invention

The purpose of the application is to provide a training method of a graph-text mutual detection network, a graph-text mutual detection method, a training method of another graph-text mutual detection network, a graph-text mutual detection method of another two graph-text mutual detection networks, a training device of the graph-text mutual detection network, a graph-text mutual detection device of the graph-text mutual detection network, a server and a computer readable storage medium, so that the effect of processing multi-mode data is improved, and the inference accuracy is improved.

In order to solve the above technical problem, the present application provides a training method for an image-text mutual inspection network, including:

constructing an image multi-connection feature encoder and a text feature encoder; wherein the image multi-connected feature encoder comprises: the method comprises the steps of constructing an image classification network, an image detection network and an image structure of an image;

constructing a network based on the image multi-connection feature encoder and the text feature encoder to obtain an initial image-text mutual inspection network;

constructing an image-text retrieval loss function;

and training the initial image-text mutual inspection network based on the image-text retrieval loss function and the training data to obtain the image-text mutual inspection network.

Optionally, the image classification network is configured to extract features of an input image, and use the features as overall features of the image;

the image detection network is used for extracting a main component detection frame and component information of the image, and taking the component information as the component characteristics;

and the graph structure construction network of the images is used for constructing a graph structure based on the overall characteristics and the component characteristics of each image to obtain the graph structure characteristics from the image characteristics to the component information.

Optionally, the text feature encoder includes: extracting a node characteristic layer, constructing a connection relation layer, constructing a layer and constructing a neighbor relation layer.

Optionally, the node feature extraction layer is configured to perform feature coding on the text information of the multi-structure text to obtain a feature code corresponding to each sample;

the connection relation building layer is used for taking each sample as a node and building a connection relation between each node based on semantic information of each node;

the constructed layer is used for constructing a graph neural network corresponding to each node based on the connection relation between the nodes;

and the constructed neighbor relation layer is used for carrying out weighted calculation on the edges of the graph neural network of the corresponding nodes based on the connection number between the nodes to obtain the corresponding node characteristics.

Optionally, constructing a graph-text retrieval loss function includes:

constructing a first loss function by taking the distance between the sample and the positive sample as a target;

constructing a second loss function by taking the distance between the sample and the negative sample as a target;

constructing a third loss function by taking the distance between the text sample and the corresponding text most similar sample as a target;

and combining the first loss function, the second loss function and the third loss function into the image-text retrieval loss function.

The application also provides a picture and text mutual inspection method of the picture and text mutual inspection network, which comprises the following steps:

when an image is input, carrying out feature extraction on the image based on an image multi-connection feature encoder of an image-text mutual inspection network to obtain image multi-connection features; wherein the image multi-connected feature encoder comprises: the method comprises the steps of constructing an image classification network, an image detection network and an image structure of an image;

when text information is input, performing feature coding on the text information based on a text feature coder of the image-text mutual inspection network to obtain corresponding text coding features;

and searching the image multi-connection characteristics or the text coding characteristics through an output layer of the image-text mutual detection network to obtain a search result.

Optionally, when an image is input, the image multi-connection feature encoder based on the image-text mutual inspection network performs feature extraction on the image to obtain an image multi-connection feature, including:

when an image is input, extracting the characteristics of the input image and taking the characteristics as the overall characteristics of the image;

extracting a main component detection frame and component information of the image, and taking the component information as the component characteristics;

and constructing a graph structure based on the integral features and the component features of each image to obtain graph structure features from the image features to the component information.

The application also provides a training method of the image-text mutual inspection network, which comprises the following steps:

the method comprises the steps that a client sends a network training instruction to a server, so that the server can construct an image multi-connection feature encoder and a text feature encoder; wherein the image multi-connected feature encoder comprises: the method comprises the steps of constructing an image classification network, an image detection network and an image structure of an image; constructing a network based on the image multi-connection feature encoder and the text feature encoder to obtain an initial image-text mutual inspection network; constructing an image-text retrieval loss function; training the initial image-text mutual inspection network based on the image-text retrieval loss function and the training data to obtain an image-text mutual inspection network; sending the image-text mutual inspection network;

and the client receives the image-text mutual inspection network and displays a training completion message.

the method comprises the steps that a client inputs data to be retrieved to a server, so that when the server inputs an image, feature extraction is carried out on the image based on an image multi-connection feature encoder of a graph-text mutual detection network to obtain image multi-connection features; wherein the image multi-connected feature encoder comprises: the method comprises the steps of constructing an image classification network, an image detection network and an image structure of an image; when text information is input, performing feature coding on the text information based on a text feature coder of the image-text mutual detection network to obtain corresponding text coding features; retrieving the image multi-connection characteristics or the text coding characteristics through an output layer of the image-text mutual inspection network to obtain and send a retrieval result;

and the client receives the retrieval result and displays the retrieval result.

the server receives data to be retrieved input by the client;

retrieving the image multi-connection characteristics or the text coding characteristics through an output layer of the image-text mutual inspection network to obtain a retrieval result;

and sending the retrieval result to the client so that the client can display the retrieval result.

The application also provides a training device for the image-text mutual inspection network, which comprises:

the encoder building module is used for building an image multi-connection feature encoder and a text feature encoder; wherein the image multi-connected feature encoder comprises: the method comprises the steps of constructing an image classification network, an image detection network and an image structure of an image;

the network construction module is used for constructing a network based on the image multi-connection feature encoder and the text feature encoder to obtain an initial image-text mutual inspection network;

the loss function construction module is used for constructing an image-text retrieval loss function;

and the network training module is used for training the initial image-text mutual inspection network based on the image-text retrieval loss function and the training data to obtain the image-text mutual inspection network.

The application also provides a picture and text mutual inspection device of the picture and text mutual inspection network, which comprises:

the image characteristic processing module is used for extracting the characteristics of an image based on an image multi-connection characteristic encoder of a graph-text mutual inspection network when the image is input to obtain image multi-connection characteristics; wherein the image multi-connected feature encoder comprises: the method comprises the steps of constructing an image classification network, an image detection network and an image structure of an image;

the text feature processing module is used for carrying out feature coding on the text information based on a text feature coder of the image-text mutual inspection network to obtain corresponding text coding features when the text information is input;

and the reasoning module is used for retrieving the image multi-connection characteristics or the text coding characteristics through an output layer of the image-text mutual inspection network to obtain a retrieval result.

The present application further provides a server, comprising:

a memory for storing a computer program;

and the processor is used for realizing the steps of the training method of the teletext mutual detection network and/or the steps of the teletext mutual detection method of the teletext mutual detection network when the computer program is executed.

The present application further provides a computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the steps of the training method of the teletext network and/or the steps of the teletext network.

The application provides a training method of a graph-text mutual inspection network, which comprises the following steps: constructing an image multi-connection feature encoder and a text feature encoder; wherein the image multi-connected feature encoder comprises: the method comprises the following steps of (1) constructing a network by an image classification network, an image detection network and an image structure of an image; constructing a network based on the image multi-connection feature encoder and the text feature encoder to obtain an initial image-text mutual inspection network; constructing an image-text retrieval loss function; and training the initial image-text mutual inspection network based on the image-text retrieval loss function and the training data to obtain the image-text mutual inspection network.

The constructed image multi-connection characteristic encoder comprises an image classification network, an image detection network and an image structure construction network of images, then an initial image-text mutual detection network capable of processing multi-structure text data is constructed, and finally training is carried out to obtain an image text retrieval network capable of processing the images more efficiently, so that the image data is processed, the effect of retrieving multi-modal data is improved, and the accuracy of reasoning is improved.

The application further provides a graph-text mutual detection method of the graph-text mutual detection network, another graph-text mutual detection network training method, another two graph-text mutual detection networks graph-text mutual detection method, a graph-text mutual detection network training device, a graph-text mutual detection device of the graph-text mutual detection network, a server and a computer readable storage medium, which have the beneficial effects, and are not described herein again.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

Fig. 1 is a flowchart of a training method for a graph-text mutual inspection network according to an embodiment of the present application;

fig. 2 is a sample schematic view of a graph-text mutual detection method of a graph-text mutual detection network provided in an embodiment of the present application;

fig. 3 is a schematic diagram of image coding of a method for mutually detecting pictures and texts in a network for mutually detecting pictures and texts according to an embodiment of the present application;

fig. 4 is a diagram structure diagram of a graph-text mutual inspection method of a graph-text mutual inspection network provided in an embodiment of the present application;

fig. 5 is a schematic path diagram of a graph-text mutual inspection method of a graph-text mutual inspection network according to an embodiment of the present application;

fig. 6 is a schematic text encoding diagram of a text-to-text mutual inspection method of a text-to-text mutual inspection network according to an embodiment of the present application;

fig. 7 is a schematic node connection diagram of a teletext mutual detection method for a teletext mutual detection network according to an embodiment of the present application;

fig. 8 is a schematic diagram of a positive sample of an image-text mutual inspection method of an image-text mutual inspection network according to an embodiment of the present application;

fig. 9 is a schematic structural diagram of a training apparatus for a graph-text mutual inspection network according to an embodiment of the present application;

fig. 10 is a schematic structural diagram of a teletext mutual detection apparatus of a teletext mutual detection network according to an embodiment of the application;

fig. 11 is a schematic structural diagram of a server according to an embodiment of the present application.

Detailed Description

The core of the application is to provide a training method of a graph-text mutual detection network, a graph-text mutual detection method, a training method of another graph-text mutual detection network, a graph-text mutual detection method of another two graph-text mutual detection networks, a training device of the graph-text mutual detection network, a graph-text mutual detection device of the graph-text mutual detection network, a server and a computer readable storage medium, so that the effect of processing multi-mode data is improved, and the inference accuracy is improved.

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

In the related art, multi-modal data is required to be subjected to a multi-structure text and image sequence mutual detection task in the multi-modal field. Where multimodality refers to data that contains both text and image sequences. Generally, the adopted retrieval network cannot effectively process the image sequence in the multi-modal data, so that the retrieval effect of the multi-modal data is reduced, and the problem of low inference accuracy exists.

Therefore, the image text retrieval network capable of processing the images more efficiently is obtained by constructing the image multi-connection characteristic encoder, including the image classification network, the image detection network and the image structure construction network of the images, then constructing the initial image text mutual detection network capable of processing the multi-structure text data, and finally training, so that the image text retrieval network capable of processing the images more efficiently is realized, the effect of retrieving the multi-mode data is improved, and the accuracy of reasoning is improved.

The following describes a training method of a teletext network according to an embodiment of the present application.

Referring to fig. 1, fig. 1 is a flowchart of a method for training a teletext network according to an embodiment of the present application.

In this embodiment, the method may include:

s101, constructing an image multi-connection feature encoder and a text feature encoder; wherein the image multi-connection feature encoder includes: the method comprises the steps of constructing an image classification network, an image detection network and an image structure of an image;

the steps aim at constructing an image multi-connection feature encoder and a text feature encoder; wherein the image multi-connection feature encoder includes: the image classification network, the image detection network and the image structure construction network are adopted.

Therefore, in the technical scheme of the application, in order to extract the features of the image, a double-layer image encoder is realized. Generally, only images are classified or image features are extracted through image detection, but the problem of incomplete image feature extraction easily exists, and the accuracy of image detection is reduced. Therefore, in the embodiment, the image classification network and the image detection network are adopted in the first level to process the image data in a parallel processing mode, and then the multiple features are fused in the second level through the image construction network of the image, so that the accuracy of image feature extraction is improved.

Further, the image classification network is used for extracting the features of the input image and taking the features as the overall features of the image;

the image detection network is used for extracting a main component detection frame and component information of the image and taking the component information as component characteristics;

Therefore, the image multi-connection feature coding in the technical scheme of the application realizes effective processing on the image sequence, and improves the effect and efficiency of final mutual image detection.

Further, the text feature encoder in this embodiment includes: extracting a node characteristic layer, constructing a connection relation layer, constructing a layer and constructing a neighbor relation layer. A two-layer text feature encoder is constructed by extracting a node feature layer, constructing a connection relation layer, constructing a layer and constructing a neighbor relation layer. The first layer extracts features of the text information, and the second layer constructs a neighbor relation for the extracted text information to realize two-level text feature extraction.

Specifically, the node feature extraction layer is used for performing feature coding on the text information of the multi-structure text to obtain a feature code corresponding to each sample;

constructing a connection relation layer, wherein each sample is used as a node, and the connection relation between each node is constructed based on the semantic information of each node;

constructing a layer for constructing a graph neural network corresponding to the nodes based on the connection relation between the nodes;

and constructing a neighbor relation layer for carrying out weighted calculation on the edges of the graph neural network of the corresponding nodes based on the connection number between the nodes to obtain the corresponding node characteristics.

S102, constructing a network based on an image multi-connection characteristic encoder and a text characteristic encoder to obtain an initial image-text mutual inspection network;

on the basis of S101, the step aims to carry out network construction based on an image multi-connection feature encoder and a text feature encoder to obtain an initial image-text mutual inspection network.

Further, the method for constructing the network in this step may adopt any one of the construction methods provided in the prior art, and is not specifically limited herein.

S103, constructing an image-text retrieval loss function;

on the basis of S102, this step aims to construct a teletext search penalty function.

In order to improve the training effect, the process of constructing the graph-text retrieval loss function may include:

step 1, for any image sample, a plurality of structural samples corresponding to the image sample are positive samples. And, regard sample of this picture and positive sample as the first loss function of target construction more and more;

step 2, establishing an image similar sample group for any image sample, selecting the most similar sample from the similar sample group, called the most similar sample of the image, and constructing a loss function by using the most similar sample of the image, wherein the steps are as follows:

for any image sample, the multi-structure sample corresponding to the image sample is a positive sample, and is defined as an anchor sample. All the near samples of the anchor point sample are obtained, and all the samples connected with the anchor point sample through the defined path (main material path and process path) are all near samples of the anchor point sample. And traversing all paths, calculating samples which are linked with the anchor point samples by paths, and constructing the samples into similar sample groups.

And traversing all samples in the similar sample group, calculating a multi-structure text with the most connection number with the anchor point sample, recording as the most similar sample of the image, and taking the average value of the characteristics of the most similar samples as the most similar sample of the image if the most similar samples with the same connection number exist.

And finally, constructing a second loss function by taking the characteristic distance between the image sample and the most similar sample of the image as large as possible as a target.

Step 3, establishing a similar sample group for any text sample, selecting the most similar sample from the similar sample group, called as the text most similar sample, and constructing a loss function by using the text most similar sample, wherein the steps are as follows:

for any text sample, the image sample corresponding to the text sample is a positive sample, and is defined as an anchor sample. And all the samples adjacent to the anchor sample are obtained, and all the samples connected with the anchor sample through the defined path are all the samples adjacent to the anchor sample. And traversing all paths, calculating samples which are linked with the anchor point samples by paths, and constructing the samples into similar sample groups.

Traversing all samples in the similar sample group, calculating the image node with the most connection number with the anchor point sample, recording as the text most similar sample, and if a plurality of most similar samples with the same connection number exist, taking the average value of the characteristics of the most similar samples as the text most similar sample.

And finally, constructing a third loss function by taking the characteristic distance between the text sample and the most similar text sample as large as possible as a target.

And 4, taking the minimum value of the sum of the first loss function, the second loss function and the third loss function as a target to serve as a graph-text retrieval loss function. Further, the step may include:

step 1, constructing a first loss function by taking the distance between a sample and a positive sample as a target;

step 2, constructing a second loss function by taking the increasingly distant distance between the sample and the negative sample as a target;

step 3, constructing a third loss function by taking the distance between the text sample and the corresponding text most similar sample as a target, wherein the distance is larger and larger;

and 4, combining the first loss function, the second loss function and the third loss function into a graph-text retrieval loss function.

It can be seen that the present alternative is primarily illustrative of how the loss function is constructed. In the alternative, a first loss function is constructed by taking the closer distance between the sample and the positive sample as a target; constructing a second loss function by taking the distance between the sample and the negative sample as a target; constructing a third loss function by taking the distance between the text sample and the corresponding text most similar sample as a target; and combining the first loss function, the second loss function and the third loss function into a graph-text retrieval loss function.

And S104, training the initial image-text mutual inspection network based on the image-text retrieval loss function and the training data to obtain the image-text mutual inspection network.

On the basis of S103, the step aims to train the initial image-text mutual inspection network based on the image-text retrieval loss function and the training data to obtain the image-text mutual inspection network.

The image and the text can be effectively checked on the basis of the constructed image-text mutual checking loss function, and the mutual checking efficiency and effect are improved.

In summary, the constructed image multi-connection feature encoder includes an image classification network, an image detection network, an image structure construction network of images, then an initial image-text mutual detection network capable of processing multi-structure text data is constructed, and finally training is performed to obtain an image text retrieval network capable of processing images more efficiently, so that image data is processed, the effect of retrieving multi-mode data is improved, and the accuracy of reasoning is improved.

The following describes a method for mutually examining pictures and texts of a picture and text mutual examination network provided by the present application by another embodiment.

In this embodiment, the method may include:

s201, when an image is input, carrying out feature extraction on the image based on an image multi-connection feature encoder of an image-text mutual inspection network to obtain image multi-connection features; wherein the image multi-connection feature encoder includes: the method comprises the steps of constructing an image classification network, an image detection network and an image structure of an image;

when an image is input, an image multi-connection feature encoder based on an image-text mutual inspection network performs feature extraction on the image to obtain image multi-connection features; wherein the image multi-connection feature encoder includes: the image classification network, the image detection network and the image structure construction network are adopted.

Further, the step may include:

step 1, when an image is input, extracting the characteristics of the input image and taking the characteristics as the overall characteristics of the image;

step 2, extracting a main component detection frame and component information of the image, and taking the component information as component characteristics;

and 3, constructing a graph structure based on the overall features and the component features of each image to obtain the graph structure features from the image features to the component information.

S202, when the text information is input, a text feature encoder based on the image-text mutual detection network carries out feature encoding on the text information to obtain corresponding text encoding features;

the method comprises the steps that when text information is input, a text characteristic encoder based on the image-text mutual inspection network carries out characteristic encoding on the text information to obtain corresponding text encoding characteristics.

Wherein, the text feature encoder comprises: extracting a node characteristic layer, constructing a connection relation layer, constructing a layer and constructing a neighbor relation layer.

And S203, retrieving the image multi-connection characteristics or the text coding characteristics through an output layer of the image-text mutual inspection network to obtain a retrieval result.

On the basis of S201 and S202, the step aims to search the image multi-connection feature or the text coding feature through an output layer of the image-text mutual inspection network to obtain a search result.

It can be seen that, in the embodiment, the constructed image multi-connection feature encoder includes an image classification network, an image detection network, an image structure construction network of images, then an initial image-text mutual detection network capable of processing multi-structure text data is constructed, and finally training is performed to obtain an image text retrieval network capable of processing images more efficiently, so that image data is processed, the effect of retrieving multi-modal data is improved, and the accuracy of reasoning is improved.

The following describes a training method of the teletext network according to the present application by using another embodiment.

In this embodiment, the method may include:

s301, the client sends a network training instruction to the server so that the server can construct an image multi-connection feature encoder and a text feature encoder; wherein the image multi-connection feature encoder includes: the method comprises the steps of constructing an image classification network, an image detection network and an image structure of an image; constructing a network based on the image multi-connection feature encoder and the text feature encoder to obtain an initial image-text mutual inspection network; constructing an image-text retrieval loss function; training an initial image-text mutual inspection network based on an image-text retrieval loss function and training data to obtain an image-text mutual inspection network; sending a graph-text mutual inspection network;

and S302, the client receives the image-text mutual inspection network and displays a training completion message.

Therefore, through the embodiment, the image-text mutual inspection network with better effect can be trained through the server, and the accuracy of reasoning is improved.

In this embodiment, the method may include:

s401, a client inputs data to be retrieved to a server, so that when the server inputs an image, the image is subjected to feature extraction by an image multi-connection feature encoder based on an image-text mutual inspection network to obtain image multi-connection features; wherein the image multi-connection feature encoder includes: the method comprises the following steps of (1) constructing a network by an image classification network, an image detection network and an image structure of an image; when the text information is input, a text characteristic encoder based on the image-text mutual inspection network performs characteristic encoding on the text information to obtain corresponding text encoding characteristics; searching the image multi-connection characteristics or the text coding characteristics through an output layer of the image-text mutual detection network to obtain and send a search result;

s402, the client receives the retrieval result and displays the retrieval result.

Therefore, the reasoning effect can be improved through the newly trained graph-text mutual inspection network.

In this embodiment, the method may include:

s501, a server receives data to be retrieved input by a client;

s502, when an image is input, carrying out feature extraction on the image based on an image multi-connection feature encoder of the image-text mutual inspection network to obtain image multi-connection features; wherein the image multi-connection feature encoder includes: the method comprises the steps of constructing an image classification network, an image detection network and an image structure of an image;

s503, when the text information is input, feature coding is carried out on the text information by a text feature coder based on the image-text mutual detection network to obtain corresponding text coding features;

s504, searching the image multi-connection characteristics or the text coding characteristics through an output layer of the image-text mutual detection network to obtain a search result;

and S505, sending the retrieval result to the client so that the client can display the retrieval result.

In the embodiment, a new neural network structure and a training method are provided for a retrieval task of multi-structure texts and images.

The embodiment mainly comprises the following steps: 1. constructing a multi-structure text graph neural network; 2. constructing a multi-structure text connection relation; 3. judging the neighbor relation; 4. extracting image features; 5. a method for constructing and judging image neighbor relation; 6. and constructing a loss function.

The first part is image feature extraction and image neighbor relation construction.

The embodiment takes the menu text in the menu chart as an example to explain the specific implementation process, but the embodiment can be applied to other fields.

Referring to fig. 2, fig. 2 is a sample schematic view of a teletext mutual detection method of a teletext mutual detection network according to an embodiment of the present application.

In this embodiment, each sample consists of 1 image and 1 multi-structured text, as shown in FIG. 2.

During the training process, 1 image corresponds to 1 multi-structured text.

Referring to fig. 3, fig. 3 is a schematic image coding diagram of a teletext mutual detection method of a teletext mutual detection network according to an embodiment of the present application.

In the process of the image coding method, firstly, an image classification network, such as a Resnet network, is adopted to extract the characteristics of the whole image as the overall characteristic representation of the image. And then extracting the main component detection frame and the component information of the whole image by adopting an image detection network, such as a Yolo network.

In this embodiment, the principal component detection frame in the image is obtained through the image detection network, the detection network feature corresponding to the detection frame is extracted as the component feature, and all the detection frames are traversed to complete the extraction of all the component features and the component information.

And combining the integral characteristic and the component information characteristic of the image to be used as a final characteristic representation of the image.

The combined features are uniformly mapped to a dimension after passing through a full connection layer to form a new feature vector which can be recorded as a feature p.

Wherein an image-based graph structure is established.

Referring to fig. 4, fig. 4 is a diagram structure diagram of a teletext mutual detection method of a teletext mutual detection network according to an embodiment of the present application.

The above operation includes a final feature vector representation and its corresponding component information for each image sample.

And traversing all the image samples to construct a graph structure of the image characteristics and the component information, as shown in fig. 4.

Wherein, all the image samples correspond to one or more main materials, and the connection relation between the image characteristics and the main material characteristics is established.

And in the second part, a multi-structure text graph neural network is constructed.

All here exemplified by a multi-structure text of a recipe, but other text application fields are also applicable.

1) And selecting the data and the multi-structure semantic information thereof.

For each dish, there are various types of data, and three types are applied in this embodiment: main materials, process and cooking steps. Wherein, each dish contains the three items of information.

2) And establishing a reasonable multi-node path according to the semantic information at the screening position, wherein the reasonable multi-node path at least comprises 2 paths.

Referring to fig. 5, fig. 5 is a schematic path diagram of a teletext mutual detection method of a teletext mutual detection network according to an embodiment of the present application.

As shown in fig. 5, 2 types of paths are constructed in the present embodiment: name of dish-main material-name of dish, name of dish-art-name of dish.

The construction mode is that the dish is connected with the main material node as long as the main material information appears in the dish name or the dish making step text. And as long as the key word of a certain cooking mode appears in the dish name or the text of the cooking step, such as frying, stir-frying, boiling, frying and the like, the dish is connected with the process node. And traversing all the samples to complete the establishment of the multi-node path.

3) And constructing a graph neural network.

Wherein, construct the neural network of picture including: and constructing graph neural network nodes and characteristics thereof, and constructing connection relations among all nodes.

Wherein, a graph neural network node and its characteristics are constructed. The method mainly comprises the steps of firstly extracting text features and obtaining text information of each menu, wherein the text information comprises menu names and step text information.

In the present embodiment, each dish is called a sample, and includes a dish name and step text information. After obtaining the text information of each sample, each word is converted into a feature vector using the word2vector method.

The feature vectors of all texts are input into the transformer network, and the final feature expression of all the texts is obtained, which is referred to as node feature in this embodiment. The feature of a node is the feature code of all characters of a sample.

Referring to fig. 6, fig. 6 is a schematic text encoding diagram of a text-to-text mutual inspection method of a text-to-text mutual inspection network according to an embodiment of the present application.

As shown in fig. 6, the text 1 represents a dish name, the text 2 represents a step text, and the text 3 is not used in the present embodiment.

For each word, the word2vector method is used to convert each word into a feature vector Emb. The text type is acquired, and in the present embodiment, the dish name represents the text type 1, as [1] in fig. 6. The step text represents text type 2, as shown in [2] in fig. 6.

Acquiring text position information, and acquiring the position of each character in the text where the character is located, such as the text 1 'tomato fried eggs', the western position information is 1, the red position information is 2, and the egg position information is 6 for each type of text. And sequentially obtaining the corresponding position information of all texts (text 1 and text 2) and characters in the texts.

Adding the emb characteristics of the text, the position information characteristics of the text and the type information characteristics of the text to obtain a final input characteristic vector of the text, and inputting the final input characteristic vector into a transformer network.

The Transformer network can obtain the output characteristic vectors of all characters, and each character corresponds to the characteristic vector output by the character.

In this embodiment, the mean value of the output feature vectors of all the characters is obtained as the node feature of one sample. And traversing all the samples, and respectively obtaining the node characteristics of each sample.

Then, a connection relationship between the respective nodes is constructed.

Through the steps, a graph neural network with each dish as one node is constructed, and the adjacent relation of each node is shown below.

First, path nodes are established. In the present embodiment, the main material node and the process node.

For example: the main material node includes: tomato, cucumber, fish, meat, etc.

A process node, comprising: frying, stir-frying, boiling and frying.

2 types of paths can be constructed, including dish name-master-dish name, dish name-art-dish name.

Wherein, the dish name is 1 dish, 1 sample, 1 node.

Traversing each dish (each node) and establishing connection with path nodes for each node, namely each node is connected with the main material node and the process node thereof.

Referring to fig. 7, fig. 7 is a schematic node connection diagram of a teletext mutual detection method for a teletext mutual detection network according to an embodiment of the present application.

As shown in fig. 7, all the dishes of tomatoes are connected with main tomatoes, and all stir-fried nodes are connected with process nodes.

And carrying out graph neural network calculation according to the graph neural network established as above and the adjacent relationship thereof.

As shown in fig. 7, the middle green node is a sample node, and it is necessary to determine the close proximity relationship between the sample nodes, which is expressed by a connection relationship coefficient. If there is a path connection (arbitrary path) between any two nodes, it is called that there is a neighbor between these 2 nodes. The number of connections between any two nodes through any connection relationship is called the number of connections.

For example: the tomato fried eggs and the tomato egg soup are adjacent, and the connection number is 2 (tomato and eggs). Some sample masters are numerous, often with connections greater than 2.

The following calculations were performed for the graph neural network, from which a graph structure has been constructed, the basic graph structure being defined as:

. Wherein V represents a set of neural network nodes of the graph->

In which>

Representative node characteristic = ->

(ii) a E represents the neural network connection relation of the graph>

(i.e., there is a connection between nodes and the number of connections represents the strength of the connection, i.e., the proximity), ->

Representing the connection strength, i.e. the number of connections between the ith and jth nodes. />

Represents an adjacency matrix in which each element represents pick>

Represents node->

The connection relationship between them.

Each node is traversed in turn. To node

Sorting the nodes in descending order according to the number of the connections with all other nodes, and intercepting the connection number>

The first K most similar (largest number of connections) node sets S of the node are called neighbors of the node. Considering that the importance of different neighbor nodes is different, the weight information is given to each connected edge of the node, and the calculation formula is as follows: />

。

The constructed graph may reflect the relationship between any two nodes with an adjacency matrix a. If it is

=0 then signifies node ÷>

And &>

There is no connection between them.

The calculation process of the graph neural network mainly explains how to obtain complementary information between neighboring nodes by using the graph neural network to obtain node feature representation with higher robustness.

The graph structure data is calculated by adding a certain vertex and the neighbor nodes thereofAnd (4) a process of weighting and summing. The graph neural network computation process may be defined as:

。

wherein V represents a set of neural network nodes of the graph

Wherein->

Representative node characteristic = ->

。

A is the adjacency matrix, representing whether an edge and its connection between two nodes are emphasized. Z represents the new feature after the neural network of the graph is calculated.

For each layer of the graph neural network

Wherein the superscript l represents the l-th layer.

The calculation formula is as follows:

，/>

，/>

。

wherein the content of the first and second substances,

representing the characteristics of the l-th layer graph neural network node, and the value is V in the first layer graph neural network. />

The matrix is a diagonal matrix and the diagonal element calculation is as shown in the above formula. />

Network parameters representing the layer to be trainedAnd (4) counting. />

Is the node characteristic after the neural network of the layer diagram is updated.

And in the third part, designing a loss function, namely constructing a graph-text mutual inspection loss function.

As described above, through the image feature encoding process and the multi-structure text feature extraction process, the features of the image of each sample and the corresponding multi-structure text thereof are obtained.

The retrieval is performed below, and the retrieval aims to make the distance between the corresponding sample features as close as possible and the features of different types of samples as far as possible, so that the retrieval accuracy can be improved. In reality, the samples which are most prone to error separation are similar samples, such as steamed weever and boiled weever, the materials are similar, the finished product images are very similar, and error separation is easy. In view of this, the present embodiment provides a novel mutual detection loss function for image and text, so as to solve the problem that similar samples are easy to be searched incorrectly.

The method comprises the following steps:

step 1, for any image sample, the multiple structural samples corresponding to the image sample are positive samples, and the pair of samples needs to be as close as possible, so that a constructed loss function is as follows:

assuming that each training takes N sample trains, the formula may include:

。

wherein the content of the first and second substances,

multi-structured text feature representing an nth sample, ->

Representing the image characteristics of the nth sample.

for any image sample, the multi-structure sample corresponding to the image sample is a positive sample, which is defined as an anchor sample. All the near samples of the anchor point sample are obtained, and all the samples connected with the anchor point sample through the defined path (main material path and process path) are all near samples of the anchor point sample. As shown in the following figure, all paths are traversed, and samples with path links to anchor samples are computed and constructed as a similar sample set.

Traversing all samples in the similar sample group, calculating a multi-structure text with the most connection number with the anchor sample, recording as the most similar sample of the image, and if the most similar samples with the same connection number exist, taking the average value of the characteristics of the most similar samples as the most similar sample of the image.

Referring to fig. 8, fig. 8 is a schematic diagram illustrating a positive sample of a graph-text mutual inspection method of a graph-text mutual inspection network according to an embodiment of the present application.

As shown in fig. 8, the multi-structure sample 2 has 2 connection relationships with the anchor sample (there may be many kinds of processes, main material-main material), so the multi-structure sample 2 is the most similar sample of the anchor sample, and is marked as the anchor sample

. The present embodiment needs to make the feature distance between the image sample and the most similar sample of the image as large as possible, so the loss function is constructed as follows: />

。

Step 3, similarly, for any text sample, establishing a similar sample group, selecting the most similar sample from the similar sample group, called the text most similar sample, and constructing a loss function by using the text most similar sample, wherein the process is as follows:

for any text sample, the image sample corresponding to the text sample is a positive sample, and is defined as an anchor sample. All the samples adjacent to the anchor point sample are obtained, and all the samples connected with the anchor point sample through the defined path (the image detection network obtains the main material information, and the image detection network is established according to the main material in the above step) are all the samples adjacent to the anchor point sample. As shown in the following figure, all paths are traversed, and samples linked with anchor samples by paths are calculated and constructed into similar sample groups.

Traversing all samples in the similar sample group, calculating the image node with the most connection number with the anchor point sample, recording as the text most similar sample, if there are a plurality of most similar samples with the same connection number, taking the average value of the characteristics as the text most similar sample, and recording as the text most similar sample

. The embodiment needs to make the feature distance between the text sample and the most similar sample of the text as large as possible, so the loss function is constructed as follows: />

。

The final loss function for this example is:

。

finally, how to train this network is explained. And (3) performing gradient back transmission in the training process by using the loss function, and updating parameters of the neural network of the graph.

Wherein, the training process may include:

step 1, constructing a graph-based neural network, including image extraction, text node feature construction, graph construction, neighbor construction and connection relation construction.

Step 2, establishing a loss function

。

And 3, training the network according to the loss function to converge.

Therefore, the network training process is as follows: the training process of the graph neural network is divided into two phases. The first phase is the phase in which data is propagated from the lower level to the upper level, i.e., the forward propagation phase. The other stage is a stage of training the propagation of the error from the high level to the bottom level when the result of the current propagation does not match the expectation, namely a back propagation stage. The training process is as follows:

1. initializing all network layer weights, generally adopting random initialization;

2. the input text data is transmitted forward through the graph neural network, the full connection layer and other layers to obtain an output value;

3. and (5) calculating the output value of the network, and calculating the loss value of the network according to a loss function formula.

4. And returning the error back to the network, and sequentially obtaining each layer of the network: and (4) back propagation errors of the neural network layer, the full connection layer and other layers.

5. And (4) adjusting all weight coefficients in the network according to the back propagation errors of each layer, namely updating the weights.

6. And randomly selecting new batch image text data again, and then entering a second step to obtain an output value by network forward propagation.

7. And (4) infinite reciprocating iteration, and finishing the training when the error between the output value of the solved network and the target value (label) is smaller than a certain threshold value or the iteration number exceeds a certain threshold value.

8. And storing the network parameters of all the trained layers.

In the following, the training device for the image-text mutual inspection network provided by the embodiment of the present application is introduced, and the training device for the image-text mutual inspection network described below and the training method for the image-text mutual inspection network described above may be referred to correspondingly.

Referring to fig. 9, fig. 9 is a schematic structural diagram of a training device for a graph-text mutual inspection network according to an embodiment of the present application.

In this embodiment, the apparatus may include:

an encoder building module 110, configured to build an image multi-connection feature encoder and a text feature encoder; wherein the image multi-connection feature encoder includes: the method comprises the steps of constructing an image classification network, an image detection network and an image structure of an image;

a network construction module 120, configured to perform network construction based on the image multi-connection feature encoder and the text feature encoder to obtain an initial image-text mutual inspection network;

a loss function constructing module 130, configured to construct a graph-text retrieval loss function;

and the network training module 140 is configured to train the initial image-text mutual inspection network based on the image-text retrieval loss function and the training data to obtain the image-text mutual inspection network.

The following introduces the image-text mutual inspection device of the image-text mutual inspection network provided in the embodiment of the present application, and the image-text mutual inspection device of the image-text mutual inspection network described below and the image-text mutual inspection method of the image-text mutual inspection network described above can be referred to correspondingly.

Referring to fig. 10, fig. 10 is a schematic structural diagram of a teletext mutual detection apparatus of a teletext mutual detection network according to an embodiment of the present application.

In this embodiment, the apparatus may include:

the image feature processing module 210 is configured to, when an image is input, perform feature extraction on the image based on an image multi-connection feature encoder of the image-text mutual inspection network to obtain an image multi-connection feature; wherein the image multi-connection feature encoder includes: the method comprises the steps of constructing an image classification network, an image detection network and an image structure of an image;

the text feature processing module 220 is configured to, when text information is input, perform feature coding on the text information by using a text feature coder based on a graph-text mutual inspection network to obtain corresponding text coding features;

and the inference module 230 is used for retrieving the image multi-connection features or the text coding features through an output layer of the image-text mutual inspection network to obtain a retrieval result.

Referring to fig. 11, fig. 11 is a schematic structural diagram of a server provided in the embodiment of the present application, where the server may include:

a memory for storing a computer program;

and the processor is used for realizing the steps of the training method of the image-text mutual inspection network when executing the computer program.

As shown in fig. 11, which is a schematic diagram of a configuration of a server, the server may include: a processor 10, a memory 11, a communication interface 12 and a communication bus 13. The processor 10, the memory 11 and the communication interface 12 all communicate with each other through a communication bus 13.

In the embodiment of the present application, the processor 10 may be a Central Processing Unit (CPU), an application specific integrated circuit, a digital signal processor, a field programmable gate array or other programmable logic device, etc.

The processor 10 may call a program stored in the memory 11, and in particular, the processor 10 may perform operations in an embodiment of the exception IP recognition method.

The memory 11 is used for storing one or more programs, the program may include program codes, the program codes include computer operation instructions, in this embodiment, the memory 11 stores at least the program for implementing the following functions:

constructing an image-text retrieval loss function;

In one possible implementation, the memory 11 may include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required for at least one function, and the like; the storage data area may store data created during use.

Further, the memory 11 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device or other volatile solid state storage device.

The communication interface 12 may be an interface of a communication module for connecting with other devices or systems.

Of course, it should be noted that the structure shown in fig. 11 does not constitute a limitation on the server in the embodiment of the present application, and in practical applications, the server may include more or less components than those shown in fig. 11, or some components may be combined.

The present application further provides a computer-readable storage medium, in which a computer program is stored, and when executed by a processor, the computer program can implement the above-mentioned training method for the image-text mutual inspection network or the steps of the image-text mutual inspection method for the image-text mutual inspection network.

The computer-readable storage medium may include: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

For the introduction of the computer-readable storage medium provided in the present application, please refer to the above method embodiments, which are not described herein again.

The embodiments are described in a progressive manner in the specification, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed in the embodiment corresponds to the method disclosed in the embodiment, so that the description is simple, and the relevant points can be referred to the description of the method part.

Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the technical solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in Random Access Memory (RAM), memory, read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

The text-image mutual inspection network training method, the text-image mutual inspection method, the other text-image mutual inspection network training method, the other two text-image mutual inspection networks text-image mutual inspection method, the text-image mutual inspection network training device, the text-image mutual inspection network text-image mutual inspection device, the server and the computer readable storage medium provided by the application are described in detail above. The principles and embodiments of the present application are described herein using specific examples, which are only used to help understand the method and its core idea of the present application. It should be noted that, for those skilled in the art, it is possible to make several improvements and modifications to the present application without departing from the principle of the present application, and such improvements and modifications also fall within the scope of the claims of the present application.

Claims

1. A training method of a graph-text mutual inspection network is characterized by comprising the following steps:

constructing an image-text retrieval loss function;

2. The training method according to claim 1, wherein the image classification network is configured to extract features of an input image as the overall features of the image;

the image detection network is used for extracting a main component detection frame and component information of the image, and taking the component information as component characteristics;

3. The training method of claim 1, wherein the text feature encoder comprises: extracting a node characteristic layer, constructing a connection relation layer, constructing a layer and constructing a neighbor relation layer.

4. The training method according to claim 3, wherein the node feature extraction layer is configured to perform feature coding on text information of a multi-structure text to obtain a feature code corresponding to each sample;

5. Training method according to claim 1, wherein constructing a teletext search loss function comprises:

6. A picture-text mutual detection method of a picture-text mutual detection network is characterized by comprising the following steps:

when an image is input, carrying out feature extraction on the image based on an image multi-connection feature encoder of an image-text mutual inspection network to obtain image multi-connection features; wherein the image multi-connected feature encoder comprises: the method comprises the following steps of (1) constructing a network by an image classification network, an image detection network and an image structure of an image;

and retrieving the image multi-connection characteristics or the text coding characteristics through an output layer of the image-text mutual detection network to obtain a retrieval result.

7. The teletext interactive inspection method according to claim 6, wherein when an image is input, an image multi-connection feature encoder based on a teletext network performs feature extraction on the image to obtain an image multi-connection feature, comprising:

extracting a main component detection frame and component information of the image, and taking the component information as component characteristics;

8. A method according to claim 6, wherein the text feature encoder comprises: extracting a node characteristic layer, constructing a connection relation layer, constructing a layer and constructing a neighbor relation layer.

9. A training device for a graph-text mutual inspection network is characterized by comprising:

the network construction module is used for constructing a network based on the image multi-connection characteristic encoder and the text characteristic encoder to obtain an initial image-text mutual inspection network;

and the network training module is used for training the initial image-text mutual detection network based on the image-text retrieval loss function and the training data to obtain the image-text mutual detection network.

10. A picture and text mutual detection device of a picture and text mutual detection network is characterized by comprising:

the text feature processing module is used for carrying out feature coding on the text information based on a text feature coder of the image-text mutual detection network to obtain corresponding text coding features when the text information is input;

11. A server, comprising:

a memory for storing a computer program;

a processor for implementing the steps of the training method of the teletext mutual detection network according to any one of claims 1 to 5 and/or the steps of the teletext mutual detection method of the teletext mutual detection network according to any one of claims 6 to 8 when executing the computer program.

12. A computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, and the computer program, when executed by a processor, implements the steps of the training method for teletext network according to any one of claims 1-5 and/or the steps of the teletext method for teletext network according to any one of claims 6-8.