WO2024098533A1

WO2024098533A1 - Image-text bidirectional search method, apparatus and device, and non-volatile readable storage medium

Info

Publication number: WO2024098533A1
Application number: PCT/CN2022/142513
Authority: WO
Inventors: 李仁刚; 王立; 范宝余; 郭振华
Original assignee: 苏州元脑智能科技有限公司
Priority date: 2022-11-08
Filing date: 2022-12-27
Publication date: 2024-05-16
Also published as: CN115438215B; CN115438215A

Abstract

Provided are an image-text bidirectional search method, apparatus and device, and a non-volatile readable storage medium, which are applied to the technical field of information retrieval. The method comprises: pre-training an image-text bidirectional search model, which comprises a text heterogeneous graph network, an image heterogeneous graph network, and an image recognition network; calling the image recognition network to acquire target recognition features of an image to be searched for; on the basis of the text heterogeneous graph network, acquiring text features and target text features of text to be searched for, wherein the text heterogeneous graph network is constructed by taking the target text features and the target recognition features as nodes; acquiring image features of said image on the basis of the image heterogeneous graph network, wherein the image heterogeneous graph network is constructed by taking original image features and the target recognition features of said image as nodes; and inputting the image features and the text features into the image-text bidirectional search model, so as to obtain an image-text search result.

Description

Image and text bidirectional search method, device, equipment and non-volatile readable storage medium

CROSS-REFERENCE TO RELATED APPLICATIONS

This application claims priority to the Chinese patent application filed with the China Patent Office on November 8, 2022, with application number 202211388778.5 and application name “Graphic-text bidirectional search and matching model training method, device, equipment and medium”, all contents of which are incorporated by reference in this application.

Technical Field

The present application relates to the field of information retrieval technology, and in particular to a method, device, equipment and non-volatile readable storage medium for bidirectional search of images and texts.

Background technique

As computer technology and network technology are widely used in daily work and life, the amount and type of data are increasing day by day. Information expressing the same goal runs in different media, and the information exists in different data formats, such as image data, text data, audio data, video data, etc. For example, for the same server, the physical parameters and performance information of the server can be described in text data and published on a web page, or it can be directly described in video form and published on a video website. Accordingly, users may want to retrieve all relevant data in different formats based on target search terms such as server models, or they may want to retrieve other types of data of the same type based on data of a certain format, that is, two-way search between different types of data.

Related technologies usually implement image-text mutual retrieval based on the attention mechanism, which uses attention to weight the extracted image features into the text features, reconstruct the text features, and enhance the similarity between text and image. Although this method can use attention to reconstruct electronic text features, it simply uses the unidirectional attention of natural images to electronic text when reconstructing electronic text features. Since there is a corresponding relationship between natural images and electronic texts, the corresponding high-order features affect each other. Only reconstructing electronic text features while ignoring natural image features makes it impossible for natural image features to accurately correspond to electronic text features, affecting image-text mutual retrieval. In addition, it is unable to obtain joint features when different modal features interact. For data involving sequence or dependency, such as tasks based on step retrieval, the retrieval accuracy between images and texts is low.

In view of this, how to improve the accuracy of two-way search between image data and text data is a technical problem that technicians in the relevant field need to solve.

Summary of the invention

The present application provides a method, device, equipment and non-volatile readable storage medium for bidirectional search between image data and text data, which effectively improves the accuracy of bidirectional search between image data and text data.

In order to solve the above technical problems, the embodiments of the present application provide the following technical solutions:

A first aspect of an embodiment of the present application provides a method for bidirectional search of images and texts, including:

Pre-training a bidirectional image-text search model; the bidirectional image-text search model includes a text heterogeneous graph network, an image heterogeneous graph network, and an image recognition network;

Calling the image recognition network to obtain target recognition features of the target image blocks contained in each sub-image of the image to be searched;

Based on the text heterogeneous graph network, text features of the text to be searched that only contains one type of target text data are obtained; the target text features corresponding to the target text data include target recognition features; the target recognition features and the target text features are node features of the text heterogeneous graph network, and the connection edges of the text heterogeneous graph network are determined by the inclusion relationship between the target recognition features and the target text features;

Based on the image heterogeneous graph network, image features of the image to be searched including a group of sub-images are obtained; the original image features and target recognition features of the image to be searched are used as node features of the image heterogeneous graph network, and the connection edges of the image heterogeneous graph network are determined by the association relationship between each target recognition feature and the original image feature;

The image features and text features are input into the image-text bidirectional search model to obtain the image-text search results.

Optionally, after pre-training the image-text bidirectional search model, it also includes:

In response to the text splitting instruction, the target recognition feature is split into a plurality of text phrases and/or text words, and the target text data is split into a plurality of text sentences;

Inputting each text phrase and/or text word into a pre-trained text feature extraction model to obtain a plurality of first-category node features;

Each text sentence is input into the text feature extraction model to obtain multiple second-category node features.

Optionally, before obtaining the text features of the text to be searched that only contains one type of target text data, the following further includes:

Build a language representation model; the language representation model includes a text information input layer, a feature extraction layer, and a text feature output layer; the feature extraction layer is a bidirectional encoder based on a converter;

The language representation model is trained using a natural language text sample dataset, and the trained language representation model is used as a text feature extraction model.

Optionally, each text sentence is input into a text feature extraction model, including:

The position information of each text sentence and each phrase and each word contained in each text sentence in the current text sentence is input into the text feature extraction model.

Optionally, before inputting each text phrase and/or text word into a pre-built text feature extraction model to obtain a plurality of first-category node features, and before inputting each text sentence into a text feature extraction model to obtain a plurality of second-category node features, the method further includes:

Obtaining the data type of data to be input into the text feature extraction model at the next moment, so as to input the data type together with the corresponding data into the text feature extraction model;

The data type includes a first identifier for identifying a target identification feature and a second identifier for identifying target text data.

Optionally, the connection edges of the text heterogeneous graph network are determined by the inclusion relationship between the target recognition feature and the target text feature, including:

For each text phrase or text word in the target recognition feature, traverse each text sentence of the target text data in turn;

If the target phrase contained in the current text sentence is the same as the current text phrase, then the second type of node feature corresponding to the current text sentence has a connection relationship with the first type of node feature corresponding to the current text phrase;

If the target word included in the current text sentence is the same as the current text word, then the second type of node feature corresponding to the current text sentence has a connection relationship with the first type of node feature corresponding to the current text word.

Optionally, obtaining target recognition features of target image blocks contained in each sub-image of the image to be searched includes:

Preliminarily using a target training sample set in which corresponding target recognition features are annotated in an image sample containing a plurality of sub-images, an image recognition network is trained;

The image to be searched is input into the image recognition network to obtain the target recognition features contained in each sub-image of the image to be searched.

Optionally, before training the image recognition network using the target training sample set in which corresponding target recognition features are annotated in the image sample including the plurality of sub-images, the method further includes:

Pre-build the target recognition network structure, which includes input layer, convolution structure, pooling layer and classifier;

The convolution structure includes a basic operation component and a residual operation component; the basic operation component is used to perform convolution processing, regularization processing, activation function processing and maximum pooling processing on the input image in sequence; the residual operation component includes multiple connected residual blocks, each residual block includes multiple convolution layers, which are used to perform convolution calculations on the output features of the basic operation component;

The pooling layer is used to convert the output features of the convolution structure into target feature vectors and transmit them to the classifier;

The classifier is used to calculate the target feature vector and output the probability of the category label.

Optionally, the text heterogeneous graph network includes multiple layers of first graph attention networks, and each layer of the first graph attention network is further integrated with a first fully connected layer; obtaining text features of the text to be searched that only contains one type of target text data includes:

For each text heterogeneous node of each first graph attention network of the text heterogeneous graph network, according to whether there is a connection relationship between the current text heterogeneous node and the remaining text heterogeneous nodes and the association relationship between the text heterogeneous nodes, the node feature of the current text heterogeneous node is updated;

Based on the node features of each text heterogeneous node in the updated text heterogeneous graph network, text features of the text to be searched are generated.

Optionally, according to whether the current text heterogeneous node has a connection relationship with other text heterogeneous nodes and the association relationship between the text heterogeneous nodes, the node feature of the current text heterogeneous node is updated, including:

Determine a target text heterogeneous node that is connected to the current text heterogeneous node and is not of the same node type;

Based on the correlation between the node features of the current text heterogeneous node and the node features of each target text heterogeneous node, the initial weight values of the current text heterogeneous node and each target text heterogeneous node are calculated, and the weight value of the current text heterogeneous node is determined according to each initial weight value;

Based on the weight value and each target text heterogeneous node, the node feature of the current text heterogeneous node is updated, and the sum of the node feature after the update and the node feature before the update of the current text heterogeneous node is used as the node feature of the current text heterogeneous node.

Optionally, based on the correlation between the node feature of the current text heterogeneous node and the node features of each target text heterogeneous node, the initial weight value of the current text heterogeneous node and each target text heterogeneous node is calculated, including:

The weight calculation formula is called to calculate the initial weight values of the current text heterogeneous node and each target text heterogeneous node respectively; the weight calculation formula is:

Where _zqp is the initial weight value of the qth text heterogeneous node and the pth text heterogeneous node, LeakyReLU() is the activation function, and _Wa , _Wb , and _Wc are known

dimensional matrix,

is the node feature of the qth text heterogeneous node,

is the node feature of the pth text heterogeneous node.

Optionally, based on the weight value and each target text heterogeneous node, the node feature of the current text heterogeneous node is updated, including:

Call the initial update relational expression to update the node features of the current text heterogeneous node; the initial update relational expression is:

In the formula,

is the updated node feature of the qth text heterogeneous node, σ is a hyperparameter, _aqp is the normalized weight of the qth node of the step node and the pth node of the component node, and _Wv is the known

dimensional matrix,

is the node feature of the pth text heterogeneous node, and _NP is the total number of target text heterogeneous nodes.

Optionally, the second-class node features corresponding to the target text data have a sequential execution order, and based on the text heterogeneous graph network, after obtaining the text features of the text to be searched that only contains one class of target text data, the method further includes:

Input each second-category node feature and sequence information into a pre-trained time series feature extraction model to obtain time series information features;

The time series information features are mapped to the text features through the fully connected layer.

Optionally, each second-category node feature and sequence information is input into a pre-trained time series feature extraction model to obtain time series information features, including:

Based on the sequence between the features of each second type of node, the features of each second type of node are input into the bidirectional long short-term memory neural network in sequence and reverse order to obtain the temporal coding features of each second type of node feature;

The timing information feature is determined according to the characteristic timing coding feature of each second-category node.

Optionally, each second-category node feature is input into a bidirectional long short-term memory neural network in order and in reverse order to obtain a temporal coding feature of each second-category node feature, including:

For each second-class node feature, the positive sequence encoding relation is called to perform positive sequence encoding on the current second-class node feature to obtain the positive sequence encoding feature; the positive sequence encoding relation is:

The reverse coding relational formula is called to encode the current second-category node feature in forward order to obtain the reverse coding feature; the reverse coding relational formula is:

The forward coding feature and the reverse coding feature are used as the temporal coding features of the current second type of node features;

Where q∈[1,Q],

is the output of the qth unit in the forward encoding direction of the bidirectional long short-term memory neural network,

is the qth second-category node feature of the Tth layer graph attention network in the text heterogeneous graph network,

is the output of the q-1th unit in the forward encoding direction of the bidirectional long short-term memory neural network, Q is the total number of node features of the second category,

is the output of the qth unit in the reverse encoding direction of the bidirectional long short-term memory neural network,

is the output of the q+1th unit in the reverse encoding direction of the bidirectional long short-term memory neural network,

is the backward encoding function of the bidirectional long short-term memory neural network,

is the forward encoding function of the bidirectional long short-term memory neural network.

Optionally, the image heterogeneous graph network includes multiple layers of second graph attention networks, and each layer of the second graph attention network is further integrated with a second fully connected layer; obtaining image features of an image to be searched including a group of sub-images, including:

Input the image to be searched into a pre-trained image feature extraction model to obtain the original image features of the image to be searched;

For each image heterogeneous node of each second graph attention network of the image heterogeneous graph network, according to whether there is a connection relationship between the current image heterogeneous node and the remaining image heterogeneous nodes and the association relationship between the image heterogeneous nodes, update the node feature of the current image heterogeneous node;

Generate image encoding features of the text to be searched based on the node features of each image heterogeneous node of the updated image heterogeneous graph network;

The image encoding features are input into a pre-trained image feature generation model to obtain the image features of the image to be searched.

A second aspect of the embodiment of the present application provides a device for bidirectional search of images and texts, including:

An image recognition module is used to call the image recognition network of the pre-trained image-text bidirectional search model to obtain target recognition features of the target image block contained in each sub-image of the image to be searched;

A text feature extraction module is used for obtaining text features of a text to be searched that contains only one type of target text data based on a text heterogeneous graph network of a bidirectional search model for text and images; target text features corresponding to the target text data include target recognition features; target recognition features and target text features are node features of the text heterogeneous graph network, and the connection edges of the text heterogeneous graph network are determined by the inclusion relationship between the target recognition features and the target text features;

An image feature extraction module is used to obtain image features of an image to be searched including a group of sub-images based on an image heterogeneous graph network of an image-text bidirectional search model; the original image features and target recognition features of the image to be searched are used as node features of the image heterogeneous graph network, and the connection edges of the image heterogeneous graph network are determined by the association relationship between each target recognition feature and the original image feature;

The bidirectional search module is used to input image features and text features into a pre-trained image-text bidirectional search model to obtain image-text search results; the image-text bidirectional search model includes a text heterogeneous graph network, an image heterogeneous graph network and an image recognition network.

A third aspect of the present application embodiment provides a method for training an image-text matching model, comprising:

Pre-build a bidirectional image and text search model;

For each group of training samples in the training sample set, original image features, target recognition features, image features of image samples in the current group of training samples and target text features and text features of text samples are obtained respectively; the target text features include target recognition features; the image samples include a group of sub-images;

Based on taking the target recognition features and target text features as text heterogeneous node features respectively, and determining the connection edge according to the inclusion relationship between the target recognition features and the target text features, a text heterogeneous graph network of the graph-text bidirectional search model is constructed;

Based on taking original image features and target recognition features as image heterogeneous node features respectively, and determining the connection edges according to the correlation between each target recognition feature and the original image feature, an image heterogeneous graph network of the image-text bidirectional search model is constructed;

The image features of each group of training samples are input into the image heterogeneous graph network, and the text features are input into the text heterogeneous graph network to train the image-text bidirectional search model.

A fourth aspect of the embodiments of the present application provides a training device for an image-text matching model, comprising:

The feature extraction module is used to obtain the original image features, target recognition features, image features of the image samples in the current group of training samples and the target text features and text features of the text samples for each group of training samples in the training sample set; the target text features include the target recognition features; the image samples include a group of sub-images;

A model building module is used to pre-build a bidirectional image-text search model; based on using target recognition features and target text features as text heterogeneous node features respectively, and determining the connection edges according to the inclusion relationship between the target recognition features and the target text features, a text heterogeneous graph network of the bidirectional image-text search model is constructed; based on using original image features and target recognition features as image heterogeneous node features respectively, and determining the connection edges according to the correlation relationship between each target recognition feature and the original image feature, an image heterogeneous graph network of the bidirectional image-text search model is constructed;

The model training module is used to input the image features of each group of training samples into the image heterogeneous graph network and the text features into the text heterogeneous graph network to train the image-text bidirectional search model.

A fifth aspect of the embodiment of the present application further provides a bidirectional image-text search device, including a processor, a memory, a human-computer interaction component, and a communication component;

The human-computer interaction component is used to receive training sample set selection requests, model training requests, and search requests input by users through the information input/information output interface, and to display graphic and text search results to users;

The communication component is used to transmit data and instructions during the training process of the image-text matching model and the execution process of the image-text bidirectional search task;

The processor is used to execute the computer program stored in the memory to implement the steps of any of the above-mentioned image-text bidirectional search methods and/or the above-mentioned image-text matching model training method.

The sixth aspect of the embodiment of the present application also provides a non-volatile readable storage medium, on which a computer program is stored. When the computer program is executed by a processor, the steps of any of the previous image-text bidirectional search methods and/or the previous image-text matching model training method are implemented.

The advantage of the technical solution provided by the present application is that a graph neural network for extracting corresponding features is constructed based on the data contained in a text containing only one type of text data and an image containing a group of sub-images and their internal relationships, which is conducive to extracting text features that can reflect the text and its internal correlation relationships in the real world, and image features that reflect the images and their internal correlation relationships in the real world. Model training is performed based on the extracted text features and image features, which is conducive to fully exploring the correlation relationship between the fine-grained features of images and texts, thereby obtaining a high-precision image-text bidirectional retrieval model, effectively improving the mutual retrieval accuracy of image data and text data.

In addition, the embodiments of the present application also provide a training method for an image-text matching model and a corresponding implementation device, an image-text bidirectional search device and a non-volatile readable storage medium for the image-text bidirectional search method, thereby further making the image-text bidirectional search method more practical, and the image-text bidirectional search method, device, equipment and non-volatile readable storage medium have corresponding advantages.

It is to be understood that the foregoing general description and the following detailed description are exemplary only and are not restrictive of the present disclosure.

BRIEF DESCRIPTION OF THE DRAWINGS

In order to more clearly illustrate the technical solutions of the embodiments of the present application or the related technologies, the drawings required for use in the embodiments or the related technical descriptions are briefly introduced below. Obviously, the drawings described below are only some embodiments of the present application. For ordinary technicians in this field, other drawings can be obtained based on these drawings without paying creative work.

FIG1 is a schematic diagram of a flow chart of a method for bidirectional image and text search provided by an embodiment of the present application;

FIG2 is a schematic diagram of a text heterogeneous graph network structure provided in an embodiment of the present application;

FIG3 is a schematic diagram of an image heterogeneous graph network structure provided in an embodiment of the present application;

FIG4 is a flow chart of a method for training an image-text matching model provided in an embodiment of the present application;

FIG5 is a structural diagram of an implementation of a cross-media retrieval device provided in an embodiment of the present application;

FIG6 is a structural diagram of an implementation of a training device for an image-text matching model provided in an embodiment of the present application;

FIG7 is a structural diagram of an implementation of a bidirectional image-text search device provided in an embodiment of the present application;

FIG8 is a structural diagram of another implementation of a bidirectional image-text search device provided in an embodiment of the present application;

FIG9 is a schematic diagram of a framework of an exemplary application scenario provided in an embodiment of the present application.

Detailed ways

In order to enable those skilled in the art to better understand the present application, the present application is further described in detail below in conjunction with the accompanying drawings and implementation methods. Obviously, the described embodiments are only part of the embodiments of the present application, rather than all of the embodiments. Based on the embodiments in the present application, all other embodiments obtained by ordinary technicians in the field without making creative work are within the scope of protection of the present application.

The terms "first", "second", "third", "fourth", etc. in the specification and claims of this application and the above drawings are used to distinguish different objects rather than to describe a specific order. In addition, the terms "including" and "having" and any variations thereof are intended to cover non-exclusive inclusions. For example, a process, method, system, product or device that includes a series of steps or units is not limited to the listed steps or units, but may include steps or units that are not listed.

After introducing the technical solutions of the embodiments of the present application, various non-limiting implementation methods of the present application are described in detail below.

First, refer to FIG. 1, which is a flow chart of a method for bidirectional image-text search provided by an embodiment of the present application. The embodiment of the present application may include the following contents:

S101: Pre-train the image-text bidirectional search model.

The bidirectional image-text search model of this embodiment is used to perform bidirectional image-text search tasks between text data and image data, that is, image data matching the text data to be searched can be determined from a known image database based on the text data to be searched, and text data matching the text data to be searched can also be determined from a known text database based on the image data to be searched. The bidirectional image-text search model of this embodiment includes a text heterogeneous graph network, an image heterogeneous graph network, and an image recognition network; the text heterogeneous graph network is used to process input text data such as text samples or text to be searched and finally output text features corresponding to the text data, and the image heterogeneous graph network is used to process input image data such as image samples or images to be searched, and output the final image features of the image data. The image recognition network used for the text heterogeneous graph network and the image heterogeneous graph network can be built based on any graph structure in any technology, which does not affect the implementation of this application. The image recognition network is used to identify the category information of a certain type of image block in an image such as an image to be searched and an image sample used in the training model process, that is, the final output is the recognition label information corresponding to the specified recognition target included in the input image, which is called the target recognition feature for the convenience of description.

S102: calling an image recognition network to obtain target recognition features of a target image block contained in each sub-image of the image to be searched.

The image to be searched and the subsequent image samples of this embodiment include a group of sub-images, that is, a group of sub-images together constitute the image to be searched, the image to be searched is a recipe step image, each step corresponds to a sub-image, and the recipe step image includes the sub-images corresponding to each step. The image blocks containing a certain type of designated information of the corresponding text data in the image to be searched are called target image blocks, and the identification information of these target image blocks is the target identification feature, that is, the target identification feature is the label information of the target image block in the image to be searched or the image sample, and the label information belongs to this type of designated information. Taking the recipe cooking step text and the recipe step diagram as an example, the designated information can be the recipe ingredients, the target image block is the image block that identifies the recipe ingredients, and the target identification feature is the recipe ingredient information to which each target image block belongs; taking the electronic device description document and the electronic device manual image as an example, the designated information is the product structure of the electronic device, the target image block is the image block that identifies the product structure, and the target identification feature is the identification information that the target image block belongs to a certain type of product structure, such as a switch or indicator light.

S103: Based on the text heterogeneous graph network, obtain text features of the text to be searched that only contains one type of target text data.

The text of the present application includes the text to be searched and the text samples in the training sample set used in the subsequent model training process, which only contain one type of text data. The so-called one type of text data refers to the data in the text being of the same type. Taking the recipe text as an example, the recipe text may include three types of text data: dish name, recipe ingredients, and cooking steps. The text to be searched and the text samples of the present application can only contain one type of text data. Taking the server working principle description document as an example, this type of text may include two types of text data, namely, server structure composition and working principle. The text to be searched and the text samples of the present application can only contain one type of text data, that is, the text to be searched and the text samples only include the working principle of the server. After obtaining the trained model in the previous step, the corresponding text features are obtained by calculating the text heterogeneous graph network based on the text to be searched. The text features of this embodiment refer to the features obtained after performing graph structure operations on the text heterogeneous graph network, and the target text features are the data obtained by directly extracting the text to be searched using the text feature extraction method. There is an inclusion relationship between the target text feature of this step and the target recognition feature obtained in the previous step. For the convenience of description, it can be defined that the target text feature corresponding to the target text data includes the target recognition feature. The so-called inclusion relationship means that the target recognition feature exists in the target text feature corresponding to the target text data. Taking the recipe text as an example, the target recognition feature represents the recipe ingredients, and the target text feature represents the cooking steps; taking the electronic device manual as an example, the target recognition feature can be the product structure of the electronic device, and the target text feature can be the instruction manual. There is an inclusion relationship between the target text feature and the target recognition feature of this embodiment. The target recognition feature is composed of the recognition features corresponding to multiple target image blocks of each sub-image. For the convenience of description and without ambiguity, in the process of constructing a text heterogeneous graph network, the recognition feature of each target image block of each sub-image can be called a first-class node feature, and the target text feature is composed of multiple text features, each of which is called a second-class node feature. For a specified first-class node feature, if it is included in a second-class node feature, then the first-class node feature has an association relationship with the second node feature. After obtaining the target text features of the text to be searched and the target recognition features of the image to be searched, by analyzing each second-class node feature of the target text features, it is determined whether it contains a first-class node feature or several first-class node features of the target recognition features, and the correlation between the target recognition features and the target text features can be determined. After obtaining the target text features and the target recognition features, these two different types of features are used as heterogeneous node features of the graph structure network, and the connection edges of the graph structure network can be determined according to whether there is an inclusion relationship between different node features, that is, the target recognition features and the target text features are node features of the text heterogeneous graph network, and the connection edges of the text heterogeneous graph network are determined by the inclusion relationship between the target recognition features and the target text features. After substituting the text feature information of the text to be searched and the image recognition information of the image to be searched into the text heterogeneous graph network, the features corresponding to the graph structure can be extracted by performing graph structure operations, and this type of features is used as the text features in this step.

S104: Based on the image heterogeneous graph network, image features of the image to be searched including a group of sub-images are obtained.

The image heterogeneous graph network of this step also includes nodes and connecting edges. The nodes of the image heterogeneous graph network of this embodiment are heterogeneous nodes, that is, there are at least two features with different properties and structures. For images, the extracted image features can only be used as a node feature. Since the image features and text features have an associated corresponding relationship, the target recognition features extracted in S102 can be used as the node features of the image heterogeneous graph network. Considering that each first-class node feature of the target recognition feature is included in each second-class node feature of the target text feature, the first-class node feature can be used as the heterogeneous node feature of the image heterogeneous graph network, that is, the original image features of the image to be searched and the target recognition features are used as the node features of the image heterogeneous graph network. The connecting edges of the image heterogeneous graph network are determined by the association between the target recognition features and the original image features. The original image features refer to the image features extracted directly using image feature methods such as artificial convolutional neural networks, VGG16 (Visual Geometry Group Network, visual image generator), Resnet (Deep residual network, deep residual network), etc. The image features in this step are obtained by substituting the image features of each sub-image of the image to be searched into the image heterogeneous graph network and performing graph structure operations on the image heterogeneous graph network.

S105: Input the image features and text features into the image-text bidirectional search model to obtain image-text search results.

The image-text search result of this embodiment refers to the matching degree of the text features extracted in step S103 and the image features extracted in step S104, that is, after the text features and the image features are input into the image-text bidirectional search model, the image-text bidirectional search model can determine whether the features are close by calculating the vector distance such as the Euclidean distance. If they are close, the image to be searched and the text to be searched are matched, that is, the image to be searched and the text to be searched are a set of data corresponding to each other. If they are not close, the image to be searched and the text to be searched are not matched.

In the technical solution provided in the embodiment of the present application, graph neural networks for extracting corresponding features are constructed based on the data contained in the text and image and their internal relationships, which is conducive to extracting text features that can reflect the text and its internal correlation in the real world, and image features that reflect the image and its internal correlation in the real world. Model training is performed based on the extracted text features and image features, which is conducive to fully exploring the correlation between the fine-grained features of the image and the text, thereby obtaining a high-precision image-text bidirectional retrieval model, effectively improving the mutual retrieval accuracy of image data and text data.

The above embodiment does not limit how to extract the target identification features. Based on the above embodiment, the present application also provides an optional extraction implementation method of the target identification features, which may include:

An image recognition network is trained in advance using a target training sample set in which corresponding target recognition features are annotated in an image sample containing multiple sub-images; the image to be searched is input into the image recognition network to obtain the target recognition features contained in each sub-image of the image to be searched.

In this embodiment, the image recognition network is used to identify the category information of the target image block in the image to be searched, and the target training sample set contains multiple images marked with target features, that is, each image sample contained in the target training sample set carries a category label. Each image can be an image directly obtained from the original database, or it can be an image obtained by flipping, resizing, stretching, etc. the original image, which does not affect the implementation of the present application. The image recognition network can be built based on any existing model structure that can recognize image categories, such as convolutional neural networks, artificial neural networks, etc., and the present application does not impose any restrictions on this. As an optional implementation, the target recognition network structure may include an input layer, a convolution structure, a pooling layer and a classifier; the convolution structure includes a basic operation component and a residual operation component; the basic operation component is used to perform convolution processing, regularization processing, activation function processing and maximum pooling processing on the input image in sequence; the residual operation component includes a plurality of connected residual blocks, each residual block includes multiple layers of convolution layers, which are used to perform convolution calculations on the output features of the basic operation component; the pooling layer is used to convert the output features of the convolution structure into a target feature vector and transmit it to the classifier; the classifier is used to calculate the target feature vector and output the probability of the category label.

In order to make the technical solution of the present application more clearly understood by those skilled in the art, the present application takes recipe text and recipe image as examples to illustrate the implementation process of the present embodiment, that is, the process of classifying the main components of each recipe image through an image classification network and constructing component nodes with the classified category information may include:

First, a step diagram dataset is generated through multiple recipe step diagrams, and the main components of some recipe step diagrams are annotated, such as flour, sugar, papaya, etc. The annotated recipe step diagrams are used to train the ResNet50 network to classify the main components of the image. The ResNet50 network structure can include seven parts. The first part does not contain residual blocks, and mainly performs convolution, regularization, activation function, and maximum pooling calculations on the input. The second, third, fourth, and fifth parts of the structure all contain residual blocks. Each residual block contains three layers of convolution. After the convolution calculation of the first five parts, the pooling layer converts it into a feature vector. Finally, the classifier calculates this feature vector and outputs the category probability. The trained ResNet50 network can obtain the main component information of the input image very well.

It is understandable that the second type of text features from the text to be searched to the target text features need to undergo a text feature extraction operation. In the above embodiment, there is no limitation on how to extract text features from the text to be searched. Based on the above embodiment, the present application also provides an optional implementation of text features, which may include the following contents:

In response to the text splitting instruction, the target recognition feature is split into multiple text phrases and/or text words, and the target text data is split into multiple text sentences; each text phrase and/or text word is input into a pre-trained text feature extraction model to obtain multiple first-category node features; each text sentence is input into the text feature extraction model to obtain multiple second-category node features.

The text splitting instruction is used to split the text to be searched into multiple text sentences, and the target recognition feature is split into multiple text phrases or text words, and any text data splitting algorithm can be used. For this implementation, correspondingly, the method for determining each connection edge in the text heterogeneous graph network can be: for each text phrase or text word in the target recognition feature, traverse each text sentence of the target text data in turn; if the target phrase contained in the current text sentence is the same as the current text phrase, then the second type of node feature corresponding to the current text sentence has a connection relationship with the first type of node feature corresponding to the current text phrase; if the target word contained in the current text sentence is the same as the current text word, then the second type of node feature corresponding to the current text sentence has a connection relationship with the first type of node feature corresponding to the current text word. The text feature extraction model of this embodiment is used to extract text features from input text data or target recognition features. As an optional implementation, the training process of the text feature extraction model is: building a language representation model; the language representation model includes a text information input layer, a feature extraction layer and a text feature output layer; the feature extraction layer is a bidirectional encoder based on a converter; the language representation model is trained using a natural language text sample data set, and the trained language representation model is used as a text feature extraction model. The language representation model may be, for example, Bert (Bidirectional Encoder Representation from Transformers, a pre-trained language representation model) or word2vec (word to vector, a word vector model), which does not affect the implementation of the present application. After obtaining the trained text feature extraction model, in order to further improve the accuracy of text feature extraction, the data type may also be set for the text data at the same time, and the data type includes a first identifier for identifying the target recognition feature and a second identifier for identifying the target text data or the target text feature. While inputting the text to be searched into the text feature extraction model, the data type of the data to be input into the text feature extraction model at the next moment is obtained, and the position information of each text sentence and each phrase and word contained in each text sentence in the current text sentence may also be input into the text feature extraction model. The data type is input into the text feature extraction model together with the corresponding data.

It is understandable that multiple second-category text features can be obtained by extracting target text data from the text to be searched. For each second-category text feature having a sequential execution order, or for a scenario where the second-category text feature has a sequential dependency relationship, in order to further extract text features that fit the actual text, the present application further performs temporal feature extraction and provides a method for extracting temporal features, which may include the following contents:

If there is a sequential execution order between each second-category node feature, each second-category node feature and sequence information are input into a pre-trained temporal feature extraction model to obtain temporal information features. Optionally, the temporal feature extraction model can be a bidirectional long short-term memory neural network. Accordingly, based on the sequence between each second-category node feature, each second-category node feature can be input into the bidirectional long short-term memory neural network in sequence and reverse order to obtain the temporal coding features of each second-category node feature; the temporal information features are determined according to the temporal coding features of each second-category node feature. Optionally, for each second-category node feature, the temporal coding features can include forward coding features and reverse coding features. In order to integrate the temporal features into the final generated text features, the extracted temporal information features can be mapped to the text features through a fully connected layer. The acquisition of forward coding features and reverse coding features can be carried out by the following method: the forward coding relationship can be called to perform forward coding on the current second-category node feature to obtain the forward coding feature; the forward coding relationship can be expressed as:

Then, the reverse coding relation is called to encode the current second-category node feature in the forward order to obtain the reverse coding feature; the reverse coding relation can be expressed as:

Where q∈[1,Q],

Of course, for the extraction of time series features, this embodiment can also be implemented based on a long short-term memory neural network. After obtaining the second type of text features, the relationship can be called

q∈[1,Q] to obtain the time series feature information, where

Represents the output of the qth unit in the LSTM.

It represents the output of the q-1th unit in the LSTM, that is, the output of the previous state.

The above embodiment does not limit how to generate text features based on the text heterogeneous graph network. The extraction of text features is obtained through heterogeneous graph operations, and heterogeneous graph operations are also the process of updating the nodes of the text heterogeneous graph network. This embodiment provides an optional implementation method, which may include the following contents:

In order to improve the model accuracy of the text heterogeneous graph network, the embodiment may stack multiple layers of the same structure. For the convenience of description, each layer may be called a first graph attention network, and a first fully connected layer is also integrated after each layer of the first graph attention network; for each text heterogeneous node of each first graph attention network of the text heterogeneous graph network, the node feature of the current text heterogeneous node is updated according to whether there is a connection relationship between the current text heterogeneous node and the remaining text heterogeneous nodes and the association relationship between the text heterogeneous nodes; based on the node features of each text heterogeneous node of the updated text heterogeneous graph network, the text features of the text to be searched are generated.

The process of updating the node feature of the current text heterogeneous node according to whether the current text heterogeneous node has a connection relationship with the remaining text heterogeneous nodes and the association relationship between the text heterogeneous nodes may include:

The process of calculating the initial weight values of the current text heterogeneous node and each target text heterogeneous node based on the association relationship between the node feature of the current text heterogeneous node and the node features of each target text heterogeneous node may include:

The weight calculation formula is called to calculate the initial weight values of the current text heterogeneous node and each target text heterogeneous node respectively; the weight calculation formula can be:

dimensional matrix,

represents a d×d dimensional real vector,

represents a real vector,

is the node feature of the qth text heterogeneous node,

is the node feature of the pth text heterogeneous node.

Among them, based on the weight value and each target text heterogeneous node, the node feature of the current text heterogeneous node is updated, including:

The initial update relation is called to update the node features of the current text heterogeneous nodes; the initial update relation can be expressed as:

In the formula,

dimensional matrix,

In order to make the technical solution of the present application more clearly understood by those skilled in the art, the present application assumes that the text to be searched is a recipe text, and the recipe text includes cooking step data, which can be referred to as steps, and the cooking steps have a sequence. The generation process of the entire text feature is described below:

In this embodiment, text features are constructed into a graph structure, which includes nodes, node features and connection relationships. As shown in FIG2 , the text features extracted from the first type of text data are

i＝1,2,3,4; the text features extracted from the second type of text data are

i＝1, 2, 3, 4. Each text feature extracted from the first type of text data and each text feature extracted from the second type of text data are used as nodes of the graph structure, and each text feature, that is, the connection relationship between each node, e ₁₁ , e ₃₂ , e ₃₃ , is the connection relationship of the graph structure. Since the text to be searched only contains one type of text data, that is, one type of text feature is obtained, in order to construct a heterogeneous graph network, the present application can extract features from the image to be searched as another type of node feature. The image to be searched in this embodiment is a recipe step diagram. First, a step diagram data set is generated through multiple recipe step sample diagrams, and the main components of some recipe step sample diagrams are annotated, such as flour, sugar, papaya, etc. The ResNet50 network is trained using the annotated recipe step sample diagram to classify the main components of the image. The image to be searched, that is, the recipe step diagram to be searched, is input into the trained ResNet50 network to obtain the main component information of the recipe step diagram to be searched, that is, the corresponding target recognition feature. The components and steps are different from structure to nature, so they are called heterogeneous nodes. In this embodiment, each step is called a node, and similarly, each component is called a node. A node is composed of a sentence or a phrase. In this embodiment, the Bert model can be used to extract the features of each sentence or each word, and the implementation method is as follows:

The principal component information extracted from all recipe text connections is input from the text information at the bottom, and the position information and data type accompanying the recipe text information and the principal component information are also input. Position information means that if there are 5 words "peel and slice the mango" in a sentence, their position information is "1, 2, 3, 4, 5" respectively. The data type means: if the input is step data, its data type is 1; if the input is component data, its data type is 2. Through the BERT model, the encoding features of each sentence and each word can be obtained. This feature is used to represent the node features, namely the component node features and the step node features. The component node features and the step node features are both high-dimensional vectors with dimensions of

Dimension (d-dimensional real vector). After determining the node features, if the principal component exists in the operation step, the component node and the step node need to be connected by an edge, that is, there is a connection relationship between the two nodes. Optionally, the step information can be traversed through the text comparison method, each step text can be extracted, and then the principal component can be searched in turn. If the word in the principal component appears in the step, an edge is connected between the step and the principal component, that is, there is a connection relationship. By traversing all step texts, the connection relationship between the step node and the component node can be constructed, that is, the connection relationship of the heterogeneous graph. After the heterogeneous graph is established, the heterogeneous graph information update can use the graph attention network to realize feature aggregation and update. The update method is to traverse each heterogeneous node in turn for update. The aggregation and extraction of text features are realized by heterogeneous graph operations. The calculation method can be as follows:

First, update the step node.

is the node feature of the qth node of the step node,

Represents the feature of the pth node of the component node. If the qth node of the step node is connected (edge) to the pth node of the component node, the feature of the pth node of the component node is used to update the qth node feature of the step node. During the update process, the correlation between the nodes needs to be considered. In this embodiment, the correlation between the nodes can be represented by assigning weights. Optionally, the following relationship (1) can be used to calculate the correlation weight z _qp between the qth node of the step node and the pth node feature of the component node. For each step node, for example

Traverse all component nodes that have edges connected to it, assuming there are N _P nodes, and get the corresponding relevant weight z _qp .

Among them, _Wa , _Wb , and _Wc are known

dimensional matrix,

Represents matrix multiplication, aka vector mapping.

After updating each step node, the relevant weights of all component nodes of the edges connected to the step node can be normalized, that is, the following relationship (2) can be called to obtain the normalized relevant weight a _qp :

Where _aqp represents the normalized weight of the qth node of the step node and the pth node of the component node, 1 represents the first component node, exp represents the exponential function, and exp( _zqp ) represents the exponential function of _zqp .

It represents the sum of the relevant weights of all the component nodes of the edges connected to the step node. Finally, the node features of the step node are updated by the normalized relevant weights, that is, the following relationship (3) is called for calculation:

Where σ represents a hyperparameter in the interval [0, 1]. _{W v} is

dimensional matrix,

It is the new feature vector updated by the component nodes connected to it.

Furthermore, based on the idea of residual network, the updated

Compared with the initial features before

Addition:

Similarly, we can use equation (5) to perform the same calculation and update on the component nodes.

for

Updated features:

in,

is the normalized weight of the k-th layer network of the q-th node of the step node and the p-th node feature of the component node,

is the trainable weight matrix of the k-th layer network, and N _Q is the set of N neighbor nodes of the q-th node of the step node.

After traversing all component nodes and step nodes, the network update of one layer of the graph attention network is completed. Usually, T layers of graph attention networks can be superimposed, with t representing the tth layer of the graph attention network. The update method of the node features of each layer is as above. Usually, an integrated fully connected layer is added after each layer of the graph attention network to realize the re-encoding of node features (including component nodes and step nodes), as shown in the following relationship (6):

FFN stands for fully connected layer.

Represents the initial node features of the graph attention network at layer t+1.

As mentioned above, the update of the node features is completed. In order to realize the retrieval of recipe images, it is also necessary to summarize and synthesize the features of all text nodes, such as operation steps and ingredient information. Since the step node integrates the ingredient node information, the ingredient node is updated through the graph neural network, and the relevant step node features are emphasized in the form of keywords. After obtaining the features of each text, the BiLSTM (Bi-directional Long Short-Term Memory) method can be used to further mine the temporal information of the step node, realize the induction and synthesis of the text node features, and package them into a vector.

In this embodiment, the following equations (7) and (8) can be used to extract the timing information features of all step nodes:

Among them, the left and right arrows represent the direction of LSTM encoding, that is, the forward and reverse encoding of step node features.

represents the output of the qth unit in BiLSTM, and the different directions of the arrows represent the BiLSTM encoding output obtained according to the different order of step node input. Similarly,

represents the output of the q-1th unit in the BiLSTM, that is, the output of the previous state. Assume that there are Q steps in the recipe.

is 0,

Represents the features of the qth step node of the Tth layer of the graph neural network. According to the order and reverse order of the steps, they are input into the corresponding BiLSTM network in sequence, and finally the BiLSTM encoding of all step nodes is obtained, as shown in the following relation (9):

After obtaining the output of all BiLSTM units, the output of the entire text feature can be obtained by summing and averaging. Among them, e _rec represents the output of the text feature, which is used for the next step of retrieval. The e _rec feature is fused with the dish name title feature e _rec = [e _rec , e _ttl ], where [] represents feature concatenation, that is, the features are connected end to end. The e _rec feature will finally pass through a fully connected layer for feature mapping, that is, e _rec = fc(e _rec ), to obtain a vector of a new dimension, that is, the text feature information of the recipe text, which is used as the encoding feature for matching with the recipe image.

The above embodiment does not limit how to perform step S103. Based on the above embodiment, the present application also provides an optional implementation method, including the following contents:

Similarly, in order to improve the model performance, the image heterogeneous graph network may include multiple layers of second graph attention networks, and each layer of the second graph attention network is further integrated with a second fully connected layer; the image to be searched is input into a pre-trained image feature extraction model to obtain the original image features of the image to be searched; for each image heterogeneous node of each second graph attention network of the image heterogeneous graph network, the node features of the current image heterogeneous node are updated according to whether there is a connection relationship between the current image heterogeneous node and the remaining image heterogeneous nodes and the association relationship between the image heterogeneous nodes; based on the node features of each image heterogeneous node of the updated image heterogeneous graph network, the image encoding features of the text to be searched are generated; the image encoding features are input into a pre-trained image feature generation model to obtain the image features of the image to be searched.

Among them, the image feature extraction model is used to extract the original image features of the image to be searched and the image sample, which can be extracted based on any existing image feature extraction model, which does not affect the implementation of this application. As for the graph operation of the image heterogeneous graph network, it can be implemented based on the graph operation method of the text heterogeneous graph network provided in the above embodiment, and it will not be repeated here. The image targeted by this embodiment is an image containing a group of images, and the image feature generation model is used to integrate all image features of the image to be searched.

Similarly, in order to make the technical personnel in the relevant field more clearly understand the technical solution of the present application, this embodiment takes the image to be searched as a recipe step diagram as an example to illustrate the generation process of the entire image feature:

First, we can use the ResNet backbone network to extract the original image features of each recipe step diagram, obtain the features of the ResNet network before the classification layer as the features of each image, and use the features to construct the image nodes of the image heterogeneous graph network, denoted as

Ingredients are the ingredients of a dish, and are uniformly represented by ingredients below. The main ingredients of the dish in this embodiment are obtained by classifying the recipe step diagram to obtain category labels. The dish has as many ingredients as the number of category labels obtained through image classification. For example, scrambled eggs with tomatoes includes labels such as tomatoes, eggs, and oil. As shown in Figure 3, the image heterogeneous graph network contains nodes and relationships. The following row

Represents the component node, which is the classification label for the image from the image classification network. For each category label, such as mango, we input it into the BERT network model to obtain the encoding features of each category word phrase, which represents the node feature. The establishment of the relationship is still established through the classification network. If the image classification result has this category, the step image feature will establish an edge with the component. As shown in Figure 3, mango appears in all step images, so all step images will establish edges with it. Above, the nodes and edges are established. The following is how to use the image heterogeneous graph network for calculation to obtain the corresponding image features:

First, update the step node.

is the node feature of the mth node of the step graph node,

Represents the feature of the nth node of the component node. If the mth node of the step graph node is connected (edge) to the nth node of the component node, the feature of the nth node of the component node is used to update the feature of the mth node of the step graph node. During the update process, the correlation between the nodes needs to be considered. In this embodiment, the correlation between the nodes can be represented by assigning weights. Optionally, the following relationship (10) can be called to calculate the correlation weight z _mn between the mth node of the step graph node and the nth node feature of the component node. For each step graph node, for example

Traverse all component nodes that have edges connected to it, assuming there are N _N nodes, and get the corresponding relevant weight z _mn .

Among them, _Wd , _We , _Wf are

dimensional matrix,

represents matrix multiplication, that is, vector mapping,

Represents matrix multiplication and also represents vector mapping.

After updating each step graph node, the relevant weights of all component nodes of the edges connected to the step graph nodes can be normalized, that is, the following relationship (11) can be called to obtain the normalized relevant weights a _mn :

In the formula, exp represents the exponential function.

It represents the sum of the relevant weights of all the component nodes of the edges connected to the step graph node. Finally, the node features of the step graph node are updated by the normalized relevant weights, that is, the following relation (12) is called for calculation:

in,

represents the node features of the updated step graph nodes, σ represents the hyperparameter, in the interval [0, 1]. W _v is

dimensional matrix,

It is the new feature vector updated by the component nodes connected to it.

Furthermore, based on the idea of residual network, the updated

Compared with the initial features before

Addition:

Similarly, N _M represents the common M step graph nodes connected to the component node, and the relationship (14) can be called to perform the same calculation and update on the component node:

Where a _mn represents the normalized weight of the mth node feature of the step node and the nth node feature of the component node, a _qp represents the normalized weight of the qth node feature of the step node and the pth node feature of the component node,

represents the initial features before the update,

represents the updated features,

Represents matrix multiplication, which is

Mapped to W _v ,

represents the trainable weight matrix of the k-th layer network,

Represents matrix multiplication, which is

Map to

After traversing all component nodes and step nodes, the network update of one layer of the graph attention network is completed. Usually, T layers of graph attention networks can be superimposed, with t representing the tth layer of the graph attention network. The update method of the node features of each layer is as above. Usually, an integrated fully connected layer is added after each layer of the graph attention network to realize the re-encoding of node features (including component nodes and step graph nodes), as shown in the following relationship (15):

FFN stands for fully connected layer.

After providing the image heterogeneous graph network to obtain the image features of the recipe step graph, the image features can be input into the long short-term memory neural network LSTM to obtain the overall features of the recipe step image, that is, the relationship

Obtained. Among them, LSTM represents each unit of the LSTM network.

Represents the output of the mth LSTM unit.

represents the recipe step graph feature, which comes from the heterogeneous graph node feature of the last layer, and m represents the mth image. Accordingly, the feature encoding output of the last LSTM unit is used as the feature output of the recipe step graph e _csi , that is,

Based on the above embodiment, this embodiment further provides a training method for a bidirectional search model of image data and text data, see FIG4 , which may include the following contents:

S401: Pre-build a bidirectional image-text search model;

S402: For each group of training samples in the training sample set, respectively obtain original image features, target recognition features, image features of image samples in the current group of training samples and target text features and text features of text samples.

The training sample set of this step includes multiple groups of training samples, each group of training samples includes a corresponding text sample and an image sample, that is, the text sample and the image sample are a set of sample data that match each other. The number of training sample groups contained in the training sample set can be determined according to the actual training needs and the actual application scenarios, and this application does not impose any restrictions on this. The text samples in the training sample set can be obtained from any existing database, and the image samples corresponding to the text samples can be obtained from the corresponding database. Of course, in order to expand the number of training sample sets. The text sample or image text can also be the data after the original text sample or image text sample is processed by cropping, splicing, stretching, etc.

S403: Based on using the target recognition feature and the target text feature as text heterogeneous node features respectively, and determining the connection edge according to the inclusion relationship between the target recognition feature and the target text feature, a text heterogeneous graph network of the graph-text bidirectional search model is constructed;

S404: Based on using the original image features and the target recognition features as image heterogeneous node features respectively, and determining the connection edge according to the correlation relationship between the target recognition features and the original image features, an image heterogeneous graph network of the image-text bidirectional search model is constructed;

S405: Input the image features of each group of training samples into the image heterogeneous graph network and the text features into the text heterogeneous graph network to train the image-text bidirectional search model.

In this embodiment, the text feature information of a text sample corresponds to the image feature of an image sample. During the model training process, a loss function is used to guide the training of the model, and then the network parameters of the image-text bidirectional search model are updated by methods such as gradient backpropagation until the model training conditions are met, such as reaching the number of iterations or the convergence effect is good. For example, the training process of the image-text bidirectional search model may include a forward propagation stage and a backpropagation stage. The forward propagation stage is the stage in which data is propagated from a low level to a high level, and the backpropagation stage is the stage in which the error is propagated from a high level to a low level when the result obtained by the forward propagation does not match the expectation. First, initialize all network layer weights, such as random initialization; then input image features and text feature information through the forward propagation of each layer such as the graph neural network, convolution layer, downsampling layer, and fully connected layer to obtain the output value; calculate the model output value of the image-text bidirectional search model, and calculate the loss value of the output value based on the loss function. The error is backpropagated back to the image-text bidirectional search model, and the backpropagation errors of each part of the image-text bidirectional search model such as the graph neural network layer, the fully connected layer, and the convolution layer are obtained in turn. Each layer of the image-text bidirectional search model adjusts all weight coefficients of the image-text bidirectional search model according to the back propagation error of each layer to update the weight. Randomly select a new batch of image features and text feature information, and then repeat the above process to obtain the output value of the network forward propagation. Infinite reciprocating iterations, when the error between the calculated model output value and the target value (that is, the label) is less than the preset threshold, or the number of iterations exceeds the preset number of iterations, the model training ends. All layer parameters of the model corresponding to the end of model training are used as the network parameters of the trained image-text bidirectional search model.

In order to improve the model training accuracy, this embodiment also provides an optional implementation method of a loss function, that is, based on the text features and corresponding image features of each group of training samples, a loss function is called to guide the training process of the image-text bidirectional search model; the loss function can be expressed as:

In the formula,

is the loss function, min d() is used to represent the function of calculating the minimum distance, y _n is

and

The category label _of

and

The category label, N is the number of training sample groups, the model training is traversed N times, N represents the total number of paired samples in this batch. First, the image group features

Traverse (a total of N), and the selected image samples are called

a represents anchor (anchor sample). The text feature encoding paired with the anchor sample is recorded as

p represents positive. Similarly, in this batch

The unpaired text features are recorded as

is a hyperparameter, which is fixed during training, for example, set to 0.3. Similarly, the same traversal operation is performed for text features.

Represents the sample selected in the traversal, and its corresponding positive image group feature sample is recorded as

The non-corresponding

is a hyperparameter.

The same steps and similar steps as those in the above-mentioned embodiments can be found in the implementation methods described in the above-mentioned embodiments, and will not be described in detail here.

It should be noted that there is no strict order of execution between the steps in the present application. As long as they conform to the logical order, these steps can be executed simultaneously or in a preset order. Figures 1 and 4 are only a schematic diagram and do not mean that this is the only execution order.

The embodiment of the present application also provides a corresponding device for the image-text bidirectional search method and the image-text matching model training method, which further makes the method more practical. Among them, the device can be described from the perspective of functional modules and hardware. The image-text bidirectional search device and the image-text matching model training device provided in the embodiment of the present application are introduced below. The image-text bidirectional search device and the image-text matching model training device described below can be referenced to each other with the image-text bidirectional search method and the image-text matching model training method described above.

From the perspective of functional modules, please refer to FIG. 5 , which is a structural diagram of a bidirectional image-text search device provided in an embodiment of the present application in one implementation manner. The device may include:

The image recognition module 501 is used to call the image recognition network of the pre-trained image-text bidirectional search model to obtain the target recognition features of the target image block contained in each sub-image of the image to be searched;

A text feature extraction module 502 is used to obtain text features of a text to be searched that only contains one type of target text data based on a text heterogeneous graph network of a text-text bidirectional search model; the target text features corresponding to the target text data include target recognition features; the target recognition features and the target text features are node features of the text heterogeneous graph network, and the connection edges of the text heterogeneous graph network are determined by the inclusion relationship between the target recognition features and the target text features;

An image feature extraction module 503 is used to obtain image features of an image to be searched including a group of sub-images based on an image heterogeneous graph network of an image-text bidirectional search model; the original image features and target recognition features of the image to be searched are used as node features of the image heterogeneous graph network, and the connection edges of the image heterogeneous graph network are determined by the association relationship between the target recognition features and the original image features;

The bidirectional search module 504 is used to input image features and text features into a pre-trained image-text bidirectional search model to obtain image-text search results; the image-text bidirectional search model includes a text heterogeneous graph network, an image heterogeneous graph network and an image recognition network.

Optionally, in some implementations of the present embodiment, the above-mentioned text feature extraction module 502 can also be used to: obtain text features of the text to be searched that only contains one type of target text data, including: responding to a text splitting instruction, splitting the target recognition features into multiple text phrases and/or text words, and splitting the target text data into multiple text sentences; inputting each text phrase and/or text word into a pre-trained text feature extraction model to obtain multiple first-category node features; inputting each text sentence into the text feature extraction model to obtain multiple second-category node features.

As an optional implementation of the above embodiment, the above text feature extraction module 502 may also include a feature extraction unit for building a language representation model; the language representation model includes a text information input layer, a feature extraction layer and a text feature output layer; the feature extraction layer is a bidirectional encoder based on a converter; the language representation model is trained using a natural language text sample data set, and the trained language representation model is used as a text feature extraction model.

As another optional implementation of the above embodiment, the above text feature extraction module 502 may also include a position input unit for inputting the position information of each text sentence and each phrase and each word contained in each text sentence in the current text sentence into the text feature extraction model.

As another optional implementation of the above embodiment, the above text feature extraction module 502 may also include an identification processing unit for obtaining the data type of data to be input into the text feature extraction model at the next moment, so as to input the data type together with the corresponding data into the text feature extraction model; the data type includes a first identification for identifying the target recognition feature, and a second identification for identifying the target text data.

As another optional implementation of the above embodiment, the above text feature extraction module 502 may further include an edge connection determination unit, which is used to traverse each text sentence of the target text data in turn for each text phrase or text word in the target recognition feature; if the target phrase contained in the current text sentence is the same as the current text phrase, then the second-category node feature corresponding to the current text sentence and the first-category node feature corresponding to the current text phrase have a connection relationship; if the target word contained in the current text sentence is the same as the current text word, then the second-category node feature corresponding to the current text sentence and the first-category node feature corresponding to the current text word have a connection relationship.

Optionally, as an optional implementation of the above embodiment, the above image recognition module 501 can also be used to pre-train an image recognition network using a target training sample set that annotates corresponding target recognition features in an image sample containing multiple sub-images; input the image to be searched into the image recognition network to obtain the target recognition features contained in each sub-image of the image to be searched.

As an optional implementation of the above embodiment, the target recognition network structure includes an input layer, a convolution structure, a pooling layer and a classifier; the convolution structure includes a basic operation component and a residual operation component; the basic operation component is used to perform convolution processing, regularization processing, activation function processing and maximum pooling processing on the input image in sequence; the residual operation component includes a plurality of connected residual blocks, each residual block includes multiple layers of convolution layers, which are used to perform convolution calculations on the output features of the basic operation components; the pooling layer is used to convert the output features of the convolution structure into a target feature vector and transmit it to the classifier; the classifier is used to calculate the target feature vector and output the probability of the category label.

Optionally, in some other implementations of the present embodiment, the above-mentioned text feature extraction module 502 may also include a graph operation unit, which is used for a text heterogeneous graph network including multiple layers of first graph attention networks, and each layer of the first graph attention network is also integrated with a first fully connected layer; for each text heterogeneous node of each first graph attention network of the text heterogeneous graph network, according to whether there is a connection relationship between the current text heterogeneous node and the remaining text heterogeneous nodes and the association relationship between the text heterogeneous nodes, the node feature of the current text heterogeneous node is updated; based on the node features of each text heterogeneous node of the updated text heterogeneous graph network, the text features of the text to be searched are generated.

As an optional implementation of the above embodiment, the above graph operation unit can also be used to: determine a target text heterogeneous node that is connected to the current text heterogeneous node and is not of the same node type; calculate the initial weight values of the current text heterogeneous node and each target text heterogeneous node based on the association between the node features of the current text heterogeneous node and the node features of each target text heterogeneous node, and determine the weight value of the current text heterogeneous node according to each initial weight value; update the node feature of the current text heterogeneous node based on the weight value and each target text heterogeneous node, and use the sum of the node feature of the current text heterogeneous node after the update and the node feature before the update as the node feature of the current text heterogeneous node.

As another optional implementation of the above embodiment, the above graph operation unit can be further used to: call the weight calculation relationship to calculate the initial weight value of the current text heterogeneous node and each target text heterogeneous node respectively; the weight calculation relationship is:

dimensional matrix,

is the node feature of the qth text heterogeneous node,

is the node feature of the pth text heterogeneous node.

As another optional implementation of the above embodiment, the above graph operation unit can be further used to: call the initial update relational expression to update the node features of the current text heterogeneous node; the initial update relational expression is:

In the formula,

is the updated node feature of the qth text heterogeneous node, σ is a hyperparameter, _aqp is the normalized weight of the qth node of the step node and the pth node feature of the component node, _Wv is known

dimensional matrix,

Optionally, in some further implementations of the present embodiment, the above-mentioned text feature extraction module 502 may further include a timing feature extraction unit, which is used to have a sequential execution order between each second-category node feature, and input each second-category node feature and sequence information into a pre-trained timing feature extraction model to obtain timing information features; the timing information features are mapped to the text features through a fully connected layer.

As an optional implementation of the above embodiment, the above-mentioned timing feature extraction unit can be further used to: based on the sequence between each second-category node feature, input each second-category node feature into the bidirectional long short-term memory neural network in sequence and reverse order to obtain the timing coding feature of each second-category node feature; determine the timing information feature according to the timing coding feature of each second-category node feature.

As another optional implementation of the above embodiment, the above time series feature extraction unit can be further used to: for each second-category node feature, call the positive sequence coding relationship formula, perform positive sequence coding on the current second-category node feature, and obtain a positive sequence coding feature; the positive sequence coding relationship formula is:

Where q∈[1,Q],

Optionally, in some implementations of the present embodiment, the above-mentioned image feature extraction module 503 can also be used for: the image heterogeneous graph network includes multiple layers of second graph attention networks, and each layer of the second graph attention network is also integrated with a second fully connected layer; the image to be searched is input into a pre-trained image feature extraction model to obtain the original image features of the image to be searched; for each image heterogeneous node of each second graph attention network of the image heterogeneous graph network, according to whether there is a connection relationship between the current image heterogeneous node and the remaining image heterogeneous nodes and the association relationship between the image heterogeneous nodes, the node features of the current image heterogeneous node are updated; based on the node features of each image heterogeneous node of the updated image heterogeneous graph network, the image coding features of the text to be searched are generated; the image coding features are input into a pre-trained image feature generation model to obtain the image features of the image to be searched.

Next, please refer to FIG. 6 , which is a structural diagram of a training device for an image-text matching model provided in an embodiment of the present application in one implementation manner, and the device may include:

The feature extraction module 601 is used to obtain the original image features, target recognition features, image features of the image samples in the current group of training samples and the target text features and text features of the text samples for each group of training samples in the training sample set; the target text features include the target recognition features; the image samples include a group of sub-images;

Model building module 602, used to pre-build a bidirectional image-text search model; based on using target recognition features and target text features as text heterogeneous node features respectively, and determining connecting edges according to the inclusion relationship between the target recognition features and the target text features, a text heterogeneous graph network of the bidirectional image-text search model is constructed; based on using original image features and target recognition features as image heterogeneous node features respectively, and determining connecting edges according to the correlation relationship between each target recognition feature and the original image feature, an image heterogeneous graph network of the bidirectional image-text search model is constructed;

The model training module 603 is used to input the image features of each group of training samples into the image heterogeneous graph network and the text features into the text heterogeneous graph network to train the image-text bidirectional search model.

The functions of the various functional modules of the image-text bidirectional search device and the image-text matching model training device in the embodiment of the present application can be implemented according to the method in the above-mentioned method embodiment. The implementation process can refer to the relevant description of the above-mentioned method embodiment, which will not be repeated here.

It can be seen from the above that the embodiments of the present application can effectively improve the accuracy of two-way search between image data and text data.

The image-text bidirectional search device and the image-text matching model training device mentioned above are described from the perspective of functional modules. Furthermore, the present application also provides an image-text bidirectional search device, which is described from the perspective of hardware. Figure 7 is a structural schematic diagram of the image-text bidirectional search device provided in an embodiment of the present application under one implementation. As shown in Figure 7, the image-text bidirectional search device may include a memory 70 for storing computer programs; a processor 71 for implementing the steps of the image-text bidirectional search method and the image-text matching model training method mentioned in any of the above embodiments when executing the computer program. The human-computer interaction component 72 is used to receive the training sample set selection request, model training request, search request input by the user through the information input/information output interface, and to display the image-text search results to the user; the communication component 73 is used to transmit data and instructions during the training process of the image-text matching model and the execution process of the image-text bidirectional search task.

The processor 71 may include one or more processing cores, such as a 4-core processor or an 8-core processor. The processor 71 may also be a controller, a microcontroller, a microprocessor or other data processing chip. The processor 71 may be implemented in at least one hardware form of DSP (Digital Signal Processing), FPGA (Field-Programmable Gate Array), and PLA (Programmable Logic Array). The processor 71 may also include a main processor and a coprocessor. The main processor is a processor for processing data in the awake state, also known as CPU (Central Processing Unit); the coprocessor is a low-power processor for processing data in the standby state. In some embodiments, the processor 71 may be integrated with a GPU (Graphics Processing Unit), which is responsible for rendering and drawing the content to be displayed on the display screen. In some embodiments, the processor 71 may also include an AI (Artificial Intelligence) processor, which is used to process computing operations related to machine learning.

The memory 70 may include one or more computer-readable storage media, and the computer non-volatile readable storage media may be non-transitory. The memory 70 may also include high-speed random access memory and non-volatile memory, such as one or more disk storage devices and flash memory storage devices. In some embodiments, the memory 70 may be an internal storage unit of the image-text bidirectional search device, such as a hard disk of a server. In other embodiments, the memory 70 may also be an external storage device of the image-text bidirectional search device, such as a plug-in hard disk equipped on a server, a smart memory card (Smart Media Card, SMC), a secure digital (Secure Digital, SD) card, a flash card (Flash Card), etc. Further, the memory 70 may also include both an internal storage unit and an external storage device of the image-text bidirectional search device. The memory 70 may not only be used to store application software and various types of data installed in the image-text bidirectional search device, such as: the code of the program used and generated in the process of executing the image-text bidirectional search and the training process of the image-text matching model, but also be used to temporarily store data that has been output or is to be output. In this embodiment, the memory 70 is at least used to store the following computer program 701, wherein, after the computer program is loaded and executed by the processor 71, it can implement the relevant steps of the image-text bidirectional search method and the image-text matching model training method disclosed in any of the aforementioned embodiments. In addition, the resources stored in the memory 70 may also include an operating system 702 and data 703, etc., and the storage method may be temporary storage or permanent storage. Among them, the operating system 702 may include Windows, Unix, Linux, etc. The data 703 may include but is not limited to the data generated during the image-text bidirectional search process and the image-text matching model training process, as well as the data corresponding to the bidirectional search results, etc.

The human-computer interaction component 72 may include a display screen, an information input/output interface such as a keyboard or a mouse. The display screen and the information input/output interface belong to the user interface. The optional user interface may also include a standard wired interface, a wireless interface, etc. Optionally, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, and an OLED (Organic Light-Emitting Diode) touch device, etc. The display may also be appropriately referred to as a display screen or a display unit, which is used to display information processed in the mutual retrieval device and to display a visual user interface. The communication component 73 may include a communication interface or a network interface, a communication bus, etc. The communication interface may optionally include a wired interface and/or a wireless interface, such as a WI-FI interface, a Bluetooth interface, etc., which is usually used to establish a communication connection between the image and text two-way search device and other devices. The communication bus may be a peripheral component interconnect standard (PCI) bus or an extended industry standard architecture (EISA) bus, etc. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of representation, FIG7 is represented by only one thick line, but it does not mean that there is only one bus or one type of bus. In some embodiments, the mutual search device may also include a power supply 74 and a sensor 75 for implementing various functions. Those skilled in the art will appreciate that the structure shown in FIG7 does not constitute a limitation on the image-text bidirectional search device, and may include more or fewer components than shown in the figure.

Furthermore, the present embodiment does not limit the number of image-text bidirectional search devices, and it may be a method for training an image-text bidirectional search model and/or a method for training an image-text matching model that is jointly completed by multiple image-text bidirectional search devices. In a possible implementation, please refer to Figure 8, which is a schematic diagram of a hardware composition framework applicable to another method for training an image-text bidirectional search model and/or a method for training an image-text matching model provided in an embodiment of the present application. As can be seen from Figure 8, the hardware composition framework may include: a first image-text bidirectional search device 81 and a second image-text bidirectional search device 82, which are connected via a network.

In the embodiment of the present application, the hardware structure of the first image-text bidirectional search device 81 and the second image-text bidirectional search device 82 can refer to the electronic device in FIG7. That is, it can be understood that there are two electronic devices in this embodiment, and the two exchange data. The trained image-text bidirectional search model shown in FIG9 can be pre-deployed in any device. Further, the embodiment of the present application does not limit the form of the network, that is, the network can be a wireless network (such as WIFI, Bluetooth, etc.) or a wired network.

Among them, the first image-text bidirectional search device 81 and the second image-text bidirectional search device 82 can be the same electronic device, such as the first image-text bidirectional search device 81 and the second image-text bidirectional search device 82 are both servers; they can also be different types of electronic devices, for example, the first image-text bidirectional search device 81 can be a smart phone or other smart terminal, and the second image-text bidirectional search device 82 can be a server. In this embodiment, in order to improve the overall performance, the model training process and the trained image-text bidirectional search model can be pre-deployed on the end with high computing performance. That is, a server with strong computing power can be used as the second image-text bidirectional search device 82 to improve data processing efficiency and reliability, thereby improving the processing efficiency of model training and/or image-text bidirectional retrieval. At the same time, a low-cost and widely used smart phone is used as the first image-text bidirectional search device 81 to realize the interaction between the second image-text bidirectional search device 82 and the user. It can be understood that the interaction process can be, for example, that the smart phone obtains a training sample set from the server, obtains the labels of the training sample set, sends these labels to the server, and the server uses the obtained labels to perform subsequent model training steps. After generating the image-text bidirectional search model, the server obtains the search request sent by the smart phone. The search request is issued by the user and carries the data to be searched. After obtaining the search request, the server determines the data to be searched by parsing the search request, and calls the image-text bidirectional search model to perform corresponding processing on the data to be searched, obtains the corresponding search results, and feeds back the search results to the first image-text bidirectional search device 81.

The functions of the various functional modules of the image-text bidirectional search device in the embodiment of the present application can be implemented according to the method in the above method embodiment, and the implementation process can refer to the relevant description of the above method embodiment, which will not be repeated here.

It is understandable that if the image-text bidirectional search method in the above embodiment is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer non-volatile readable storage medium. Based on this understanding, the technical solution of the present application is essentially or the part that contributes to the prior art or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a non-volatile storage medium to execute all or part of the steps of the various embodiments of the present application. The aforementioned non-volatile storage medium includes: U disk, mobile hard disk, read-only memory (ROM), random access memory (RAM), electrically erasable programmable ROM, register, hard disk, multimedia card, card-type memory (such as SD or DX memory, etc.), magnetic memory, removable disk, CD-ROM, disk or optical disk, etc. Various media that can store program codes.

Based on this, an embodiment of the present application further provides a non-volatile readable storage medium storing a computer program, and when the computer program is executed by a processor, the steps of the image-text bidirectional search method described in any of the above embodiments are performed.

In this specification, each embodiment is described in a progressive manner, and each embodiment focuses on the differences from other embodiments. The same or similar parts between the embodiments can be referred to each other. As for the hardware disclosed in the embodiments, including devices and equipment, since they correspond to the methods disclosed in the embodiments, the description is relatively simple, and the relevant parts can be referred to the method part description.

Professionals may further appreciate that the units and algorithm steps of each example described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, computer software, or a combination of the two. In order to clearly illustrate the interchangeability of hardware and software, the composition and steps of each example have been generally described in the above description according to function. Whether these functions are performed in hardware or software depends on the specific application and design constraints of the technical solution. Professionals and technicians may use different methods to implement the described functions for each specific application, but such implementation should not be considered to be beyond the scope of this application.

The above is a detailed introduction to a method, device, equipment and non-volatile readable storage medium for bidirectional search of images and texts provided by the present application. Specific examples are used herein to illustrate the principles and implementation methods of the present application. The description of the above embodiments is only used to help understand the method and core idea of the present application. It should be pointed out that for ordinary technicians in this technical field, without departing from the principles of the present application, several improvements and modifications can be made to the present application, and these improvements and modifications also fall within the scope of protection of the claims of the present application.

Claims

A method for bidirectional search of images and texts, characterized by comprising:

Pre-training a bidirectional image-text search model; the bidirectional image-text search model includes a text heterogeneous graph network, an image heterogeneous graph network and an image recognition network;

Calling the image recognition network to obtain target recognition features of the target image blocks contained in each sub-image of the image to be searched;

Based on the text heterogeneous graph network, text features of the text to be searched that only contain one type of target text data are obtained; the target text features corresponding to the target text data include the target recognition features; the target recognition features and the target text features are node features of the text heterogeneous graph network, and the connection edges of the text heterogeneous graph network are determined by the inclusion relationship between the target recognition features and the target text features;

Based on the image heterogeneous graph network, image features of an image to be searched including a group of sub-images are obtained; the original image features of the image to be searched and the target recognition features are used as node features of the image heterogeneous graph network, and the connection edges of the image heterogeneous graph network are determined by the association relationship between the target recognition features and the original image features;

The image features and the text features are input into the image-text bidirectional search model to obtain image-text search results.
The image-text bidirectional search method according to claim 1 is characterized in that after the image-text bidirectional search model is pre-trained, it also includes:

In response to the text splitting instruction, the target recognition feature is split into a plurality of text phrases and/or text words, and the target text data is split into a plurality of text sentences;

Inputting each text phrase and/or text word into a pre-trained text feature extraction model to obtain a plurality of first-category node features;

Each text sentence is input into the text feature extraction model to obtain a plurality of second-category node features.
The image-text bidirectional search method according to claim 2 is characterized in that, before obtaining the text features of the text to be searched that only contains one type of target text data, it also includes:

Build a language representation model; the language representation model includes a text information input layer, a feature extraction layer and a text feature output layer; the feature extraction layer is a bidirectional encoder based on a converter;

The language representation model is trained using a natural language text sample data set, and the trained language representation model is used as a text feature extraction model.
The image-text bidirectional search method according to claim 2, characterized in that the inputting of each text sentence into the text feature extraction model comprises:

The position information of each text sentence and each phrase and each word contained in each text sentence in the current text sentence is input into the text feature extraction model.
The image-text bidirectional search method according to claim 2 is characterized in that before inputting each text phrase and/or text word into a pre-built text feature extraction model to obtain a plurality of first-class node features, and before inputting each text sentence into the text feature extraction model to obtain a plurality of second-class node features, it further comprises:

Acquire the data type of data to be input into the text feature extraction model at the next moment, so as to input the data type together with the corresponding data into the text feature extraction model;

The data type includes a first identifier for identifying the target identification feature and a second identifier for identifying the target text data.
The image-text bidirectional search method according to claim 2 is characterized in that the connection edges of the text heterogeneous graph network are determined by the inclusion relationship between the target recognition feature and the target text feature, including:

For each text phrase or text word in the target recognition feature, traverse each text sentence of the target text data in sequence;

If the target phrase included in the current text sentence is the same as the current text phrase, then the second type of node feature corresponding to the current text sentence has a connection relationship with the first type of node feature corresponding to the current text phrase;

If the target word included in the current text sentence is the same as the current text word, the second type of node feature corresponding to the current text sentence and the first type of node feature corresponding to the current text word have a connection relationship.
The image-text bidirectional search method according to claim 1 is characterized in that the calling of the image recognition network to obtain the target recognition features of the target image blocks contained in each sub-image of the image to be searched comprises:

Preliminarily using a target training sample set in which corresponding target recognition features are annotated in an image sample containing a plurality of sub-images, an image recognition network is trained;

The image to be searched is input into the image recognition network to obtain target recognition features contained in each sub-image of the image to be searched.
The image-text bidirectional search method according to claim 7 is characterized in that, before training the image recognition network using the target training sample set in which the corresponding target recognition features are annotated in the image sample containing the plurality of sub-images, the method further comprises:

Pre-constructing a target recognition network structure, wherein the target recognition network structure includes an input layer, a convolution structure, a pooling layer, and a classifier;

The convolution structure includes a basic operation component and a residual operation component; the basic operation component is used to perform convolution processing, regularization processing, activation function processing and maximum pooling processing on the input image in sequence; the residual operation component includes a plurality of connected residual blocks, each residual block includes multiple convolution layers, which are used to perform convolution calculation on the output features of the basic operation component;

The pooling layer is used to convert the output features of the convolution structure into a target feature vector and transmit it to the classifier;

The classifier is used to calculate the target feature vector and output the probability of the category label.
The image-text bidirectional search method according to claim 1 is characterized in that the text heterogeneous graph network includes multiple layers of first graph attention networks, and each layer of the first graph attention network is further integrated with a first fully connected layer; the step of obtaining text features of the text to be searched that contains only one type of target text data includes:

For each text heterogeneous node of each first graph attention network of the text heterogeneous graph network, according to whether there is a connection relationship between the current text heterogeneous node and the remaining text heterogeneous nodes and the association relationship between the text heterogeneous nodes, updating the node feature of the current text heterogeneous node;

Based on the node features of each text heterogeneous node in the updated text heterogeneous graph network, the text features of the text to be searched are generated.
The image-text bidirectional search method according to claim 9 is characterized in that the updating of the node feature of the current text heterogeneous node according to whether the current text heterogeneous node has a connection relationship with the remaining text heterogeneous nodes and the association relationship between the text heterogeneous nodes comprises:

Determine a target text heterogeneous node that is connected to the current text heterogeneous node and is not of the same node type;

Based on the correlation between the node feature of the current text heterogeneous node and the node features of each target text heterogeneous node, the initial weight value of the current text heterogeneous node and each target text heterogeneous node is calculated, and the weight value of the current text heterogeneous node is determined according to each initial weight value;

Based on the weight value and each target text heterogeneous node, the node feature of the current text heterogeneous node is updated, and the sum of the node feature after the update and the node feature before the update of the current text heterogeneous node is used as the node feature of the current text heterogeneous node.
The image-text bidirectional search method according to claim 10 is characterized in that the initial weight value of the current text heterogeneous node and each target text heterogeneous node is calculated based on the association relationship between the node feature of the current text heterogeneous node and the node feature of each target text heterogeneous node, including:

The weight calculation formula is called to calculate the initial weight values of the current text heterogeneous node and each target text heterogeneous node respectively; the weight calculation formula is:

Where zqp is the initial weight value of the qth text heterogeneous node and the pth text heterogeneous node, LeakyReLU() is the activation function, and Wa , Wb , and Wc are known
dimensional matrix,
is the node feature of the qth text heterogeneous node,
is the node feature of the pth text heterogeneous node.
The image-text bidirectional search method according to claim 10 is characterized in that the updating of node features of the current text heterogeneous nodes based on the weight value and each target text heterogeneous node comprises:

The initial update relational expression is called to update the node features of the heterogeneous nodes of the current text; the initial update relational expression is:

In the formula,
is the updated node feature of the qth text heterogeneous node, σ is a hyperparameter, aqp is the normalized weight of the qth node of the step node and the pth node of the component node, and Wv is the known
dimensional matrix,
is the node feature of the pth text heterogeneous node, and NP is the total number of target text heterogeneous nodes.
The image-text bidirectional search method according to any one of claims 1 to 12 is characterized in that the second-class node features corresponding to the target text data have a sequential execution order, and after obtaining the text features of the text to be searched containing only one class of target text data based on the text heterogeneous graph network, the method further comprises:

Input each second-category node feature and sequence information into a pre-trained time series feature extraction model to obtain time series information features;

The temporal information features are mapped to the text features through a fully connected layer.
The image-text bidirectional search method according to claim 13 is characterized in that the step of inputting each second-type node feature and sequence information into a pre-trained temporal feature extraction model to obtain temporal information features comprises:

Based on the sequence between the features of each second-category node, the features of each second-category node are input into the bidirectional long short-term memory neural network in sequence and reverse order to obtain the temporal coding features of each second-category node feature;

The timing information feature is determined according to the characteristic timing coding feature of each second-category node.
The image-text bidirectional search method according to claim 13 is characterized in that, based on the sequence between the second-type node features, the second-type node features are sequentially and reversely input into the bidirectional long short-term memory neural network to obtain the temporal coding features of the second-type node features, including:

For each second-category node feature, the positive sequence coding relational expression is called to perform positive sequence coding on the current second-category node feature to obtain a positive sequence coding feature; the positive sequence coding relational expression is:

The reverse order coding relational formula is called to perform forward coding on the current second type node feature to obtain a reverse order coding feature; the reverse order coding relational formula is:

Using the positive sequence coding feature and the reverse sequence coding feature as the time sequence coding feature of the current second type node feature;

Where q∈[1,Q],
is the output of the qth unit in the forward encoding direction of the bidirectional long short-term memory neural network,
is the qth second-category node feature of the Tth layer graph attention network in the text heterogeneous graph network,
is the output of the q-1th unit in the forward encoding direction of the bidirectional long short-term memory neural network, Q is the total number of second-category node features,
is the output of the qth unit in the backward encoding direction of the bidirectional long short-term memory neural network,
is the output of the q+1th unit in the backward encoding direction of the bidirectional long short-term memory neural network,
is the backward encoding function of the bidirectional long short-term memory neural network,
is the forward encoding function of the bidirectional long short-term memory neural network.
The image-text bidirectional search method according to claim 1 is characterized in that the image heterogeneous graph network includes multiple layers of second graph attention networks, and each layer of the second graph attention network is followed by a second fully connected layer; the step of obtaining image features of an image to be searched including a group of sub-images comprises:

Inputting the image to be searched into a pre-trained image feature extraction model to obtain original image features of the image to be searched;

For each image heterogeneous node of each second graph attention network of the image heterogeneous graph network, according to whether there is a connection relationship between the current image heterogeneous node and the remaining image heterogeneous nodes and the association relationship between the image heterogeneous nodes, updating the node feature of the current image heterogeneous node;

Generate image encoding features of the text to be searched based on the node features of each image heterogeneous node of the updated image heterogeneous graph network;

The image coding features are input into a pre-trained image feature generation model to obtain the image features of the image to be searched.
A training method for an image-text matching model, comprising:

Pre-build a bidirectional image and text search model;

For each group of training samples in the training sample set, original image features, target recognition features, image features of image samples in the current group of training samples and target text features and text features of text samples are obtained respectively; the target text features include the target recognition features; the image samples include a group of sub-images;

Based on taking the target recognition feature and the target text feature as text heterogeneous node features respectively, and determining the connection edge according to the inclusion relationship between the target recognition feature and the target text feature, a text heterogeneous graph network of the image-text bidirectional search model is constructed;

Based on taking the original image features and the target recognition features as image heterogeneous node features respectively, and determining the connection edge according to the correlation relationship between the target recognition features and the original image features, an image heterogeneous graph network of the image-text bidirectional search model is constructed;

The image features of each group of training samples are input into the image heterogeneous graph network, and the text features are input into the text heterogeneous graph network to train the image-text bidirectional search model.
A bidirectional image and text search device, characterized by comprising:

An image recognition module is used to call a pre-trained image recognition network of a bidirectional image-text search model to obtain target recognition features of a target image block contained in each sub-image of the image to be searched;

A text feature extraction module is used to obtain text features of a text to be searched that only contains one type of target text data based on the text heterogeneous graph network of the text-image bidirectional search model; the target text features corresponding to the target text data include the target recognition features; the target recognition features and the target text features are node features of the text heterogeneous graph network, and the connection edges of the text heterogeneous graph network are determined by the inclusion relationship between the target recognition features and the target text features;

An image feature extraction module is used to obtain image features of an image to be searched including a group of sub-images based on the image heterogeneous graph network of the image-text bidirectional search model; the original image features of the image to be searched and the target recognition features are used as node features of the image heterogeneous graph network, and the connection edges of the image heterogeneous graph network are determined by the association relationship between the target recognition features and the original image features;

The bidirectional search module is used to input the image features and the text features into a pre-trained image-text bidirectional search model to obtain image-text search results; the image-text bidirectional search model includes a text heterogeneous graph network, an image heterogeneous graph network and an image recognition network.
A training device for an image-text matching model, comprising:

A feature extraction module is used to obtain, for each group of training samples in the training sample set, original image features, target recognition features, image features of image samples in the current group of training samples and target text features and text features of text samples; the target text features include the target recognition features; the image samples include a group of sub-images;

A model building module is used to pre-build a bidirectional image-text search model; based on taking the target recognition feature and the target text feature as text heterogeneous node features respectively, and determining the connection edge according to the inclusion relationship between the target recognition feature and the target text feature, a text heterogeneous graph network of the bidirectional image-text search model is constructed; based on taking the original image feature and the target recognition feature as image heterogeneous node features respectively, and determining the connection edge according to the association relationship between each target recognition feature and the original image feature, an image heterogeneous graph network of the bidirectional image-text search model is constructed;

The model training module is used to input the image features of each group of training samples into the image heterogeneous graph network and the text features into the text heterogeneous graph network to train the image-text bidirectional search model.
A bidirectional image and text search device, characterized by comprising a processor, a memory, a human-computer interaction component and a communication component;

The human-computer interaction component is used to receive a training sample set selection request, a model training request, a search request input by a user through an information input/information output interface, and to display graphic and text search results to the user;

The communication component is used to transmit data and instructions during the training process of the image-text matching model and the execution process of the image-text bidirectional search task;

The processor is used to implement the steps of the image-text bidirectional search method as described in any one of claims 1 or 16 and/or the image-text matching model training method as described in claim 17 when executing the computer program stored in the memory.
A non-volatile readable storage medium, characterized in that a computer program is stored on the non-volatile readable storage medium, and when the computer program is executed by a processor, the steps of the image-text bidirectional search method as described in any one of claims 1 or 16 and/or the image-text matching model training method as described in claim 17 are implemented.