WO2024098533A1 - Procédé, appareil et dispositif de recherche bidirectionnelle d'image-texte, et support de stockage lisible non volatil - Google Patents

Procédé, appareil et dispositif de recherche bidirectionnelle d'image-texte, et support de stockage lisible non volatil Download PDF

Info

Publication number
WO2024098533A1
WO2024098533A1 PCT/CN2022/142513 CN2022142513W WO2024098533A1 WO 2024098533 A1 WO2024098533 A1 WO 2024098533A1 CN 2022142513 W CN2022142513 W CN 2022142513W WO 2024098533 A1 WO2024098533 A1 WO 2024098533A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
image
features
node
heterogeneous
Prior art date
Application number
PCT/CN2022/142513
Other languages
English (en)
Chinese (zh)
Inventor
李仁刚
王立
范宝余
郭振华
Original Assignee
苏州元脑智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 苏州元脑智能科技有限公司 filed Critical 苏州元脑智能科技有限公司
Publication of WO2024098533A1 publication Critical patent/WO2024098533A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/583Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/5846Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using extracted text
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/19Recognition using electronic means
    • G06V30/19007Matching; Proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/19Recognition using electronic means
    • G06V30/191Design or setup of recognition systems or techniques; Extraction of features in feature space; Clustering techniques; Blind source separation
    • G06V30/19147Obtaining sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/19Recognition using electronic means
    • G06V30/191Design or setup of recognition systems or techniques; Extraction of features in feature space; Clustering techniques; Blind source separation
    • G06V30/19173Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the present application relates to the field of information retrieval technology, and in particular to a method, device, equipment and non-volatile readable storage medium for bidirectional search of images and texts.
  • the present application provides a method, device, equipment and non-volatile readable storage medium for bidirectional search between image data and text data, which effectively improves the accuracy of bidirectional search between image data and text data.
  • a first aspect of an embodiment of the present application provides a method for bidirectional search of images and texts, including:
  • the bidirectional image-text search model includes a text heterogeneous graph network, an image heterogeneous graph network, and an image recognition network;
  • text features of the text to be searched that only contains one type of target text data are obtained;
  • the target text features corresponding to the target text data include target recognition features;
  • the target recognition features and the target text features are node features of the text heterogeneous graph network, and the connection edges of the text heterogeneous graph network are determined by the inclusion relationship between the target recognition features and the target text features;
  • image features of the image to be searched including a group of sub-images are obtained; the original image features and target recognition features of the image to be searched are used as node features of the image heterogeneous graph network, and the connection edges of the image heterogeneous graph network are determined by the association relationship between each target recognition feature and the original image feature;
  • the image features and text features are input into the image-text bidirectional search model to obtain the image-text search results.
  • the image-text bidirectional search model after pre-training the image-text bidirectional search model, it also includes:
  • the target recognition feature is split into a plurality of text phrases and/or text words, and the target text data is split into a plurality of text sentences;
  • Each text sentence is input into the text feature extraction model to obtain multiple second-category node features.
  • the following further includes:
  • the language representation model includes a text information input layer, a feature extraction layer, and a text feature output layer;
  • the feature extraction layer is a bidirectional encoder based on a converter;
  • the language representation model is trained using a natural language text sample dataset, and the trained language representation model is used as a text feature extraction model.
  • each text sentence is input into a text feature extraction model, including:
  • the position information of each text sentence and each phrase and each word contained in each text sentence in the current text sentence is input into the text feature extraction model.
  • the method before inputting each text phrase and/or text word into a pre-built text feature extraction model to obtain a plurality of first-category node features, and before inputting each text sentence into a text feature extraction model to obtain a plurality of second-category node features, the method further includes:
  • the data type includes a first identifier for identifying a target identification feature and a second identifier for identifying target text data.
  • connection edges of the text heterogeneous graph network are determined by the inclusion relationship between the target recognition feature and the target text feature, including:
  • the second type of node feature corresponding to the current text sentence has a connection relationship with the first type of node feature corresponding to the current text phrase
  • the second type of node feature corresponding to the current text sentence has a connection relationship with the first type of node feature corresponding to the current text word.
  • obtaining target recognition features of target image blocks contained in each sub-image of the image to be searched includes:
  • an image recognition network Preliminarily using a target training sample set in which corresponding target recognition features are annotated in an image sample containing a plurality of sub-images, an image recognition network is trained;
  • the image to be searched is input into the image recognition network to obtain the target recognition features contained in each sub-image of the image to be searched.
  • the method before training the image recognition network using the target training sample set in which corresponding target recognition features are annotated in the image sample including the plurality of sub-images, the method further includes:
  • Pre-build the target recognition network structure which includes input layer, convolution structure, pooling layer and classifier;
  • the convolution structure includes a basic operation component and a residual operation component;
  • the basic operation component is used to perform convolution processing, regularization processing, activation function processing and maximum pooling processing on the input image in sequence;
  • the residual operation component includes multiple connected residual blocks, each residual block includes multiple convolution layers, which are used to perform convolution calculations on the output features of the basic operation component;
  • the pooling layer is used to convert the output features of the convolution structure into target feature vectors and transmit them to the classifier;
  • the classifier is used to calculate the target feature vector and output the probability of the category label.
  • the text heterogeneous graph network includes multiple layers of first graph attention networks, and each layer of the first graph attention network is further integrated with a first fully connected layer; obtaining text features of the text to be searched that only contains one type of target text data includes:
  • the node feature of the current text heterogeneous node is updated;
  • the node feature of the current text heterogeneous node is updated, including:
  • the initial weight values of the current text heterogeneous node and each target text heterogeneous node are calculated, and the weight value of the current text heterogeneous node is determined according to each initial weight value;
  • the node feature of the current text heterogeneous node is updated, and the sum of the node feature after the update and the node feature before the update of the current text heterogeneous node is used as the node feature of the current text heterogeneous node.
  • the initial weight value of the current text heterogeneous node and each target text heterogeneous node is calculated, including:
  • the weight calculation formula is called to calculate the initial weight values of the current text heterogeneous node and each target text heterogeneous node respectively; the weight calculation formula is:
  • zqp is the initial weight value of the qth text heterogeneous node and the pth text heterogeneous node
  • LeakyReLU() is the activation function
  • Wa , Wb , and Wc are known dimensional matrix
  • Wa , Wb , and Wc are known dimensional matrix
  • the node feature of the current text heterogeneous node is updated, including:
  • is a hyperparameter
  • aqp is the normalized weight of the qth node of the step node and the pth node of the component node
  • Wv is the known dimensional matrix
  • NP is the total number of target text heterogeneous nodes.
  • the second-class node features corresponding to the target text data have a sequential execution order, and based on the text heterogeneous graph network, after obtaining the text features of the text to be searched that only contains one class of target text data, the method further includes:
  • the time series information features are mapped to the text features through the fully connected layer.
  • each second-category node feature and sequence information is input into a pre-trained time series feature extraction model to obtain time series information features, including:
  • the features of each second type of node are input into the bidirectional long short-term memory neural network in sequence and reverse order to obtain the temporal coding features of each second type of node feature;
  • the timing information feature is determined according to the characteristic timing coding feature of each second-category node.
  • each second-category node feature is input into a bidirectional long short-term memory neural network in order and in reverse order to obtain a temporal coding feature of each second-category node feature, including:
  • the positive sequence encoding relation is called to perform positive sequence encoding on the current second-class node feature to obtain the positive sequence encoding feature;
  • the positive sequence encoding relation is:
  • the reverse coding relational formula is called to encode the current second-category node feature in forward order to obtain the reverse coding feature; the reverse coding relational formula is:
  • the forward coding feature and the reverse coding feature are used as the temporal coding features of the current second type of node features
  • q ⁇ [1,Q] is the output of the qth unit in the forward encoding direction of the bidirectional long short-term memory neural network
  • Q is the total number of node features of the second category
  • is the backward encoding function of the bidirectional long short-term memory neural network is the forward encoding function of the bidirectional long short-term memory neural network.
  • the image heterogeneous graph network includes multiple layers of second graph attention networks, and each layer of the second graph attention network is further integrated with a second fully connected layer; obtaining image features of an image to be searched including a group of sub-images, including:
  • each image heterogeneous node of each second graph attention network of the image heterogeneous graph network For each image heterogeneous node of each second graph attention network of the image heterogeneous graph network, according to whether there is a connection relationship between the current image heterogeneous node and the remaining image heterogeneous nodes and the association relationship between the image heterogeneous nodes, update the node feature of the current image heterogeneous node;
  • the image encoding features are input into a pre-trained image feature generation model to obtain the image features of the image to be searched.
  • a second aspect of the embodiment of the present application provides a device for bidirectional search of images and texts, including:
  • An image recognition module is used to call the image recognition network of the pre-trained image-text bidirectional search model to obtain target recognition features of the target image block contained in each sub-image of the image to be searched;
  • a text feature extraction module is used for obtaining text features of a text to be searched that contains only one type of target text data based on a text heterogeneous graph network of a bidirectional search model for text and images; target text features corresponding to the target text data include target recognition features; target recognition features and target text features are node features of the text heterogeneous graph network, and the connection edges of the text heterogeneous graph network are determined by the inclusion relationship between the target recognition features and the target text features;
  • An image feature extraction module is used to obtain image features of an image to be searched including a group of sub-images based on an image heterogeneous graph network of an image-text bidirectional search model; the original image features and target recognition features of the image to be searched are used as node features of the image heterogeneous graph network, and the connection edges of the image heterogeneous graph network are determined by the association relationship between each target recognition feature and the original image feature;
  • the bidirectional search module is used to input image features and text features into a pre-trained image-text bidirectional search model to obtain image-text search results;
  • the image-text bidirectional search model includes a text heterogeneous graph network, an image heterogeneous graph network and an image recognition network.
  • a third aspect of the present application embodiment provides a method for training an image-text matching model, comprising:
  • a text heterogeneous graph network of the graph-text bidirectional search model is constructed
  • an image heterogeneous graph network of the image-text bidirectional search model is constructed;
  • the image features of each group of training samples are input into the image heterogeneous graph network, and the text features are input into the text heterogeneous graph network to train the image-text bidirectional search model.
  • a fourth aspect of the embodiments of the present application provides a training device for an image-text matching model, comprising:
  • the feature extraction module is used to obtain the original image features, target recognition features, image features of the image samples in the current group of training samples and the target text features and text features of the text samples for each group of training samples in the training sample set;
  • the target text features include the target recognition features;
  • the image samples include a group of sub-images;
  • a model building module is used to pre-build a bidirectional image-text search model; based on using target recognition features and target text features as text heterogeneous node features respectively, and determining the connection edges according to the inclusion relationship between the target recognition features and the target text features, a text heterogeneous graph network of the bidirectional image-text search model is constructed; based on using original image features and target recognition features as image heterogeneous node features respectively, and determining the connection edges according to the correlation relationship between each target recognition feature and the original image feature, an image heterogeneous graph network of the bidirectional image-text search model is constructed;
  • the model training module is used to input the image features of each group of training samples into the image heterogeneous graph network and the text features into the text heterogeneous graph network to train the image-text bidirectional search model.
  • a fifth aspect of the embodiment of the present application further provides a bidirectional image-text search device, including a processor, a memory, a human-computer interaction component, and a communication component;
  • the human-computer interaction component is used to receive training sample set selection requests, model training requests, and search requests input by users through the information input/information output interface, and to display graphic and text search results to users;
  • the communication component is used to transmit data and instructions during the training process of the image-text matching model and the execution process of the image-text bidirectional search task;
  • the processor is used to execute the computer program stored in the memory to implement the steps of any of the above-mentioned image-text bidirectional search methods and/or the above-mentioned image-text matching model training method.
  • the sixth aspect of the embodiment of the present application also provides a non-volatile readable storage medium, on which a computer program is stored.
  • a computer program is stored on which a computer program is stored.
  • the steps of any of the previous image-text bidirectional search methods and/or the previous image-text matching model training method are implemented.
  • a graph neural network for extracting corresponding features is constructed based on the data contained in a text containing only one type of text data and an image containing a group of sub-images and their internal relationships, which is conducive to extracting text features that can reflect the text and its internal correlation relationships in the real world, and image features that reflect the images and their internal correlation relationships in the real world.
  • Model training is performed based on the extracted text features and image features, which is conducive to fully exploring the correlation relationship between the fine-grained features of images and texts, thereby obtaining a high-precision image-text bidirectional retrieval model, effectively improving the mutual retrieval accuracy of image data and text data.
  • the embodiments of the present application also provide a training method for an image-text matching model and a corresponding implementation device, an image-text bidirectional search device and a non-volatile readable storage medium for the image-text bidirectional search method, thereby further making the image-text bidirectional search method more practical, and the image-text bidirectional search method, device, equipment and non-volatile readable storage medium have corresponding advantages.
  • FIG1 is a schematic diagram of a flow chart of a method for bidirectional image and text search provided by an embodiment of the present application
  • FIG2 is a schematic diagram of a text heterogeneous graph network structure provided in an embodiment of the present application.
  • FIG3 is a schematic diagram of an image heterogeneous graph network structure provided in an embodiment of the present application.
  • FIG4 is a flow chart of a method for training an image-text matching model provided in an embodiment of the present application.
  • FIG5 is a structural diagram of an implementation of a cross-media retrieval device provided in an embodiment of the present application.
  • FIG6 is a structural diagram of an implementation of a training device for an image-text matching model provided in an embodiment of the present application.
  • FIG7 is a structural diagram of an implementation of a bidirectional image-text search device provided in an embodiment of the present application.
  • FIG8 is a structural diagram of another implementation of a bidirectional image-text search device provided in an embodiment of the present application.
  • FIG9 is a schematic diagram of a framework of an exemplary application scenario provided in an embodiment of the present application.
  • FIG. 1 is a flow chart of a method for bidirectional image-text search provided by an embodiment of the present application.
  • the embodiment of the present application may include the following contents:
  • the bidirectional image-text search model of this embodiment is used to perform bidirectional image-text search tasks between text data and image data, that is, image data matching the text data to be searched can be determined from a known image database based on the text data to be searched, and text data matching the text data to be searched can also be determined from a known text database based on the image data to be searched.
  • the bidirectional image-text search model of this embodiment includes a text heterogeneous graph network, an image heterogeneous graph network, and an image recognition network; the text heterogeneous graph network is used to process input text data such as text samples or text to be searched and finally output text features corresponding to the text data, and the image heterogeneous graph network is used to process input image data such as image samples or images to be searched, and output the final image features of the image data.
  • the image recognition network used for the text heterogeneous graph network and the image heterogeneous graph network can be built based on any graph structure in any technology, which does not affect the implementation of this application.
  • the image recognition network is used to identify the category information of a certain type of image block in an image such as an image to be searched and an image sample used in the training model process, that is, the final output is the recognition label information corresponding to the specified recognition target included in the input image, which is called the target recognition feature for the convenience of description.
  • S102 calling an image recognition network to obtain target recognition features of a target image block contained in each sub-image of the image to be searched.
  • the image to be searched and the subsequent image samples of this embodiment include a group of sub-images, that is, a group of sub-images together constitute the image to be searched, the image to be searched is a recipe step image, each step corresponds to a sub-image, and the recipe step image includes the sub-images corresponding to each step.
  • the image blocks containing a certain type of designated information of the corresponding text data in the image to be searched are called target image blocks, and the identification information of these target image blocks is the target identification feature, that is, the target identification feature is the label information of the target image block in the image to be searched or the image sample, and the label information belongs to this type of designated information.
  • the designated information can be the recipe ingredients
  • the target image block is the image block that identifies the recipe ingredients
  • the target identification feature is the recipe ingredient information to which each target image block belongs
  • the designated information is the product structure of the electronic device
  • the target image block is the image block that identifies the product structure
  • the target identification feature is the identification information that the target image block belongs to a certain type of product structure, such as a switch or indicator light.
  • S103 Based on the text heterogeneous graph network, obtain text features of the text to be searched that only contains one type of target text data.
  • the text of the present application includes the text to be searched and the text samples in the training sample set used in the subsequent model training process, which only contain one type of text data.
  • the so-called one type of text data refers to the data in the text being of the same type.
  • the recipe text may include three types of text data: dish name, recipe ingredients, and cooking steps.
  • the text to be searched and the text samples of the present application can only contain one type of text data.
  • this type of text may include two types of text data, namely, server structure composition and working principle.
  • the text to be searched and the text samples of the present application can only contain one type of text data, that is, the text to be searched and the text samples only include the working principle of the server.
  • the corresponding text features are obtained by calculating the text heterogeneous graph network based on the text to be searched.
  • the text features of this embodiment refer to the features obtained after performing graph structure operations on the text heterogeneous graph network, and the target text features are the data obtained by directly extracting the text to be searched using the text feature extraction method.
  • the target text feature corresponding to the target text data includes the target recognition feature.
  • the so-called inclusion relationship means that the target recognition feature exists in the target text feature corresponding to the target text data.
  • the target recognition feature represents the recipe ingredients, and the target text feature represents the cooking steps; taking the electronic device manual as an example, the target recognition feature can be the product structure of the electronic device, and the target text feature can be the instruction manual. There is an inclusion relationship between the target text feature and the target recognition feature of this embodiment.
  • the target recognition feature is composed of the recognition features corresponding to multiple target image blocks of each sub-image.
  • the recognition feature of each target image block of each sub-image can be called a first-class node feature
  • the target text feature is composed of multiple text features, each of which is called a second-class node feature.
  • first-class node feature For a specified first-class node feature, if it is included in a second-class node feature, then the first-class node feature has an association relationship with the second node feature. After obtaining the target text features of the text to be searched and the target recognition features of the image to be searched, by analyzing each second-class node feature of the target text features, it is determined whether it contains a first-class node feature or several first-class node features of the target recognition features, and the correlation between the target recognition features and the target text features can be determined.
  • these two different types of features are used as heterogeneous node features of the graph structure network, and the connection edges of the graph structure network can be determined according to whether there is an inclusion relationship between different node features, that is, the target recognition features and the target text features are node features of the text heterogeneous graph network, and the connection edges of the text heterogeneous graph network are determined by the inclusion relationship between the target recognition features and the target text features.
  • the features corresponding to the graph structure can be extracted by performing graph structure operations, and this type of features is used as the text features in this step.
  • the image heterogeneous graph network of this step also includes nodes and connecting edges.
  • the nodes of the image heterogeneous graph network of this embodiment are heterogeneous nodes, that is, there are at least two features with different properties and structures.
  • the extracted image features can only be used as a node feature. Since the image features and text features have an associated corresponding relationship, the target recognition features extracted in S102 can be used as the node features of the image heterogeneous graph network.
  • the first-class node feature can be used as the heterogeneous node feature of the image heterogeneous graph network, that is, the original image features of the image to be searched and the target recognition features are used as the node features of the image heterogeneous graph network.
  • the connecting edges of the image heterogeneous graph network are determined by the association between the target recognition features and the original image features.
  • the original image features refer to the image features extracted directly using image feature methods such as artificial convolutional neural networks, VGG16 (Visual Geometry Group Network, visual image generator), Resnet (Deep residual network, deep residual network), etc.
  • the image features in this step are obtained by substituting the image features of each sub-image of the image to be searched into the image heterogeneous graph network and performing graph structure operations on the image heterogeneous graph network.
  • S105 Input the image features and text features into the image-text bidirectional search model to obtain image-text search results.
  • the image-text search result of this embodiment refers to the matching degree of the text features extracted in step S103 and the image features extracted in step S104, that is, after the text features and the image features are input into the image-text bidirectional search model, the image-text bidirectional search model can determine whether the features are close by calculating the vector distance such as the Euclidean distance. If they are close, the image to be searched and the text to be searched are matched, that is, the image to be searched and the text to be searched are a set of data corresponding to each other. If they are not close, the image to be searched and the text to be searched are not matched.
  • graph neural networks for extracting corresponding features are constructed based on the data contained in the text and image and their internal relationships, which is conducive to extracting text features that can reflect the text and its internal correlation in the real world, and image features that reflect the image and its internal correlation in the real world.
  • Model training is performed based on the extracted text features and image features, which is conducive to fully exploring the correlation between the fine-grained features of the image and the text, thereby obtaining a high-precision image-text bidirectional retrieval model, effectively improving the mutual retrieval accuracy of image data and text data.
  • the present application also provides an optional extraction implementation method of the target identification features, which may include:
  • An image recognition network is trained in advance using a target training sample set in which corresponding target recognition features are annotated in an image sample containing multiple sub-images; the image to be searched is input into the image recognition network to obtain the target recognition features contained in each sub-image of the image to be searched.
  • the image recognition network is used to identify the category information of the target image block in the image to be searched, and the target training sample set contains multiple images marked with target features, that is, each image sample contained in the target training sample set carries a category label.
  • Each image can be an image directly obtained from the original database, or it can be an image obtained by flipping, resizing, stretching, etc. the original image, which does not affect the implementation of the present application.
  • the image recognition network can be built based on any existing model structure that can recognize image categories, such as convolutional neural networks, artificial neural networks, etc., and the present application does not impose any restrictions on this.
  • the target recognition network structure may include an input layer, a convolution structure, a pooling layer and a classifier;
  • the convolution structure includes a basic operation component and a residual operation component;
  • the basic operation component is used to perform convolution processing, regularization processing, activation function processing and maximum pooling processing on the input image in sequence;
  • the residual operation component includes a plurality of connected residual blocks, each residual block includes multiple layers of convolution layers, which are used to perform convolution calculations on the output features of the basic operation component;
  • the pooling layer is used to convert the output features of the convolution structure into a target feature vector and transmit it to the classifier;
  • the classifier is used to calculate the target feature vector and output the probability of the category label.
  • the present application takes recipe text and recipe image as examples to illustrate the implementation process of the present embodiment, that is, the process of classifying the main components of each recipe image through an image classification network and constructing component nodes with the classified category information may include:
  • a step diagram dataset is generated through multiple recipe step diagrams, and the main components of some recipe step diagrams are annotated, such as flour, sugar, papaya, etc.
  • the annotated recipe step diagrams are used to train the ResNet50 network to classify the main components of the image.
  • the ResNet50 network structure can include seven parts. The first part does not contain residual blocks, and mainly performs convolution, regularization, activation function, and maximum pooling calculations on the input. The second, third, fourth, and fifth parts of the structure all contain residual blocks. Each residual block contains three layers of convolution. After the convolution calculation of the first five parts, the pooling layer converts it into a feature vector. Finally, the classifier calculates this feature vector and outputs the category probability.
  • the trained ResNet50 network can obtain the main component information of the input image very well.
  • the second type of text features from the text to be searched to the target text features need to undergo a text feature extraction operation.
  • the present application also provides an optional implementation of text features, which may include the following contents:
  • the target recognition feature is split into multiple text phrases and/or text words, and the target text data is split into multiple text sentences; each text phrase and/or text word is input into a pre-trained text feature extraction model to obtain multiple first-category node features; each text sentence is input into the text feature extraction model to obtain multiple second-category node features.
  • the text splitting instruction is used to split the text to be searched into multiple text sentences, and the target recognition feature is split into multiple text phrases or text words, and any text data splitting algorithm can be used.
  • the method for determining each connection edge in the text heterogeneous graph network can be: for each text phrase or text word in the target recognition feature, traverse each text sentence of the target text data in turn; if the target phrase contained in the current text sentence is the same as the current text phrase, then the second type of node feature corresponding to the current text sentence has a connection relationship with the first type of node feature corresponding to the current text phrase; if the target word contained in the current text sentence is the same as the current text word, then the second type of node feature corresponding to the current text sentence has a connection relationship with the first type of node feature corresponding to the current text word.
  • the text feature extraction model of this embodiment is used to extract text features from input text data or target recognition features.
  • the training process of the text feature extraction model is: building a language representation model; the language representation model includes a text information input layer, a feature extraction layer and a text feature output layer; the feature extraction layer is a bidirectional encoder based on a converter; the language representation model is trained using a natural language text sample data set, and the trained language representation model is used as a text feature extraction model.
  • the language representation model may be, for example, Bert (Bidirectional Encoder Representation from Transformers, a pre-trained language representation model) or word2vec (word to vector, a word vector model), which does not affect the implementation of the present application.
  • the data type may also be set for the text data at the same time, and the data type includes a first identifier for identifying the target recognition feature and a second identifier for identifying the target text data or the target text feature.
  • the data type of the data to be input into the text feature extraction model at the next moment is obtained, and the position information of each text sentence and each phrase and word contained in each text sentence in the current text sentence may also be input into the text feature extraction model.
  • the data type is input into the text feature extraction model together with the corresponding data.
  • second-category text features can be obtained by extracting target text data from the text to be searched.
  • the present application further performs temporal feature extraction and provides a method for extracting temporal features, which may include the following contents:
  • each second-category node feature and sequence information are input into a pre-trained temporal feature extraction model to obtain temporal information features.
  • the temporal feature extraction model can be a bidirectional long short-term memory neural network. Accordingly, based on the sequence between each second-category node feature, each second-category node feature can be input into the bidirectional long short-term memory neural network in sequence and reverse order to obtain the temporal coding features of each second-category node feature; the temporal information features are determined according to the temporal coding features of each second-category node feature.
  • the temporal coding features can include forward coding features and reverse coding features.
  • the extracted temporal information features can be mapped to the text features through a fully connected layer.
  • the acquisition of forward coding features and reverse coding features can be carried out by the following method: the forward coding relationship can be called to perform forward coding on the current second-category node feature to obtain the forward coding feature; the forward coding relationship can be expressed as:
  • the reverse coding relation is called to encode the current second-category node feature in the forward order to obtain the reverse coding feature;
  • the reverse coding relation can be expressed as:
  • q ⁇ [1,Q] is the output of the qth unit in the forward encoding direction of the bidirectional long short-term memory neural network
  • Q is the total number of node features of the second category
  • is the backward encoding function of the bidirectional long short-term memory neural network is the forward encoding function of the bidirectional long short-term memory neural network.
  • this embodiment can also be implemented based on a long short-term memory neural network.
  • the relationship can be called q ⁇ [1,Q] to obtain the time series feature information, where Represents the output of the qth unit in the LSTM. It represents the output of the q-1th unit in the LSTM, that is, the output of the previous state.
  • the above embodiment does not limit how to generate text features based on the text heterogeneous graph network.
  • the extraction of text features is obtained through heterogeneous graph operations, and heterogeneous graph operations are also the process of updating the nodes of the text heterogeneous graph network.
  • This embodiment provides an optional implementation method, which may include the following contents:
  • the embodiment may stack multiple layers of the same structure.
  • each layer may be called a first graph attention network, and a first fully connected layer is also integrated after each layer of the first graph attention network; for each text heterogeneous node of each first graph attention network of the text heterogeneous graph network, the node feature of the current text heterogeneous node is updated according to whether there is a connection relationship between the current text heterogeneous node and the remaining text heterogeneous nodes and the association relationship between the text heterogeneous nodes; based on the node features of each text heterogeneous node of the updated text heterogeneous graph network, the text features of the text to be searched are generated.
  • the process of updating the node feature of the current text heterogeneous node according to whether the current text heterogeneous node has a connection relationship with the remaining text heterogeneous nodes and the association relationship between the text heterogeneous nodes may include:
  • the initial weight values of the current text heterogeneous node and each target text heterogeneous node are calculated, and the weight value of the current text heterogeneous node is determined according to each initial weight value;
  • the node feature of the current text heterogeneous node is updated, and the sum of the node feature after the update and the node feature before the update of the current text heterogeneous node is used as the node feature of the current text heterogeneous node.
  • the process of calculating the initial weight values of the current text heterogeneous node and each target text heterogeneous node based on the association relationship between the node feature of the current text heterogeneous node and the node features of each target text heterogeneous node may include:
  • the weight calculation formula is called to calculate the initial weight values of the current text heterogeneous node and each target text heterogeneous node respectively; the weight calculation formula can be:
  • zqp is the initial weight value of the qth text heterogeneous node and the pth text heterogeneous node
  • LeakyReLU() is the activation function
  • Wa , Wb , and Wc are known dimensional matrix, represents a d ⁇ d dimensional real vector, represents a real vector, is the node feature of the qth text heterogeneous node, is the node feature of the pth text heterogeneous node.
  • the node feature of the current text heterogeneous node is updated, including:
  • the initial update relation is called to update the node features of the current text heterogeneous nodes; the initial update relation can be expressed as:
  • is a hyperparameter
  • aqp is the normalized weight of the qth node of the step node and the pth node of the component node
  • Wv is the known dimensional matrix
  • NP is the total number of target text heterogeneous nodes.
  • the text to be searched is a recipe text
  • the recipe text includes cooking step data, which can be referred to as steps, and the cooking steps have a sequence.
  • the generation process of the entire text feature is described below:
  • text features are constructed into a graph structure, which includes nodes, node features and connection relationships.
  • Each text feature extracted from the first type of text data and each text feature extracted from the second type of text data are used as nodes of the graph structure, and each text feature, that is, the connection relationship between each node, e 11 , e 32 , e 33 , is the connection relationship of the graph structure.
  • the image to be searched in this embodiment is a recipe step diagram.
  • a step diagram data set is generated through multiple recipe step sample diagrams, and the main components of some recipe step sample diagrams are annotated, such as flour, sugar, papaya, etc.
  • the ResNet50 network is trained using the annotated recipe step sample diagram to classify the main components of the image.
  • the image to be searched that is, the recipe step diagram to be searched
  • the image to be searched is input into the trained ResNet50 network to obtain the main component information of the recipe step diagram to be searched, that is, the corresponding target recognition feature.
  • the components and steps are different from structure to nature, so they are called heterogeneous nodes.
  • each step is called a node, and similarly, each component is called a node.
  • a node is composed of a sentence or a phrase.
  • the Bert model can be used to extract the features of each sentence or each word, and the implementation method is as follows:
  • the principal component information extracted from all recipe text connections is input from the text information at the bottom, and the position information and data type accompanying the recipe text information and the principal component information are also input.
  • Position information means that if there are 5 words "peel and slice the mango" in a sentence, their position information is "1, 2, 3, 4, 5" respectively.
  • the data type means: if the input is step data, its data type is 1; if the input is component data, its data type is 2.
  • This feature is used to represent the node features, namely the component node features and the step node features.
  • the component node features and the step node features are both high-dimensional vectors with dimensions of Dimension (d-dimensional real vector).
  • the step information can be traversed through the text comparison method, each step text can be extracted, and then the principal component can be searched in turn. If the word in the principal component appears in the step, an edge is connected between the step and the principal component, that is, there is a connection relationship.
  • the connection relationship between the step node and the component node can be constructed, that is, the connection relationship of the heterogeneous graph.
  • the heterogeneous graph information update can use the graph attention network to realize feature aggregation and update.
  • the update method is to traverse each heterogeneous node in turn for update.
  • the aggregation and extraction of text features are realized by heterogeneous graph operations.
  • the calculation method can be as follows:
  • update the step node is the node feature of the qth node of the step node, Represents the feature of the pth node of the component node. If the qth node of the step node is connected (edge) to the pth node of the component node, the feature of the pth node of the component node is used to update the qth node feature of the step node.
  • the correlation between the nodes needs to be considered.
  • the correlation between the nodes can be represented by assigning weights.
  • the following relationship (1) can be used to calculate the correlation weight z qp between the qth node of the step node and the pth node feature of the component node. For each step node, for example Traverse all component nodes that have edges connected to it, assuming there are N P nodes, and get the corresponding relevant weight z qp .
  • Wa , Wb , and Wc are known dimensional matrix, Represents matrix multiplication, aka vector mapping.
  • the relevant weights of all component nodes of the edges connected to the step node can be normalized, that is, the following relationship (2) can be called to obtain the normalized relevant weight a qp :
  • aqp represents the normalized weight of the qth node of the step node and the pth node of the component node
  • 1 represents the first component node
  • exp represents the exponential function
  • exp( zqp ) represents the exponential function of zqp . It represents the sum of the relevant weights of all the component nodes of the edges connected to the step node.
  • W v is dimensional matrix, It is the new feature vector updated by the component nodes connected to it.
  • N Q is the set of N neighbor nodes of the q-th node of the step node.
  • the network update of one layer of the graph attention network is completed.
  • T layers of graph attention networks can be superimposed, with t representing the tth layer of the graph attention network.
  • the update method of the node features of each layer is as above.
  • an integrated fully connected layer is added after each layer of the graph attention network to realize the re-encoding of node features (including component nodes and step nodes), as shown in the following relationship (6):
  • FFN stands for fully connected layer. Represents the initial node features of the graph attention network at layer t+1.
  • the update of the node features is completed.
  • the step node integrates the ingredient node information
  • the ingredient node is updated through the graph neural network, and the relevant step node features are emphasized in the form of keywords.
  • the BiLSTM (Bi-directional Long Short-Term Memory) method can be used to further mine the temporal information of the step node, realize the induction and synthesis of the text node features, and package them into a vector.
  • the left and right arrows represent the direction of LSTM encoding, that is, the forward and reverse encoding of step node features.
  • the different directions of the arrows represent the BiLSTM encoding output obtained according to the different order of step node input.
  • the output of the entire text feature can be obtained by summing and averaging.
  • e rec represents the output of the text feature, which is used for the next step of retrieval.
  • the image heterogeneous graph network may include multiple layers of second graph attention networks, and each layer of the second graph attention network is further integrated with a second fully connected layer; the image to be searched is input into a pre-trained image feature extraction model to obtain the original image features of the image to be searched; for each image heterogeneous node of each second graph attention network of the image heterogeneous graph network, the node features of the current image heterogeneous node are updated according to whether there is a connection relationship between the current image heterogeneous node and the remaining image heterogeneous nodes and the association relationship between the image heterogeneous nodes; based on the node features of each image heterogeneous node of the updated image heterogeneous graph network, the image encoding features of the text to be searched are generated; the image encoding features are input into a pre-trained image feature generation model to obtain the image features of the image to be searched.
  • the image feature extraction model is used to extract the original image features of the image to be searched and the image sample, which can be extracted based on any existing image feature extraction model, which does not affect the implementation of this application.
  • the graph operation of the image heterogeneous graph network it can be implemented based on the graph operation method of the text heterogeneous graph network provided in the above embodiment, and it will not be repeated here.
  • the image targeted by this embodiment is an image containing a group of images, and the image feature generation model is used to integrate all image features of the image to be searched.
  • this embodiment takes the image to be searched as a recipe step diagram as an example to illustrate the generation process of the entire image feature:
  • the ResNet backbone network can be used to extract the original image features of each recipe step diagram, obtain the features of the ResNet network before the classification layer as the features of each image, and use the features to construct the image nodes of the image heterogeneous graph network, denoted as Ingredients are the ingredients of a dish, and are uniformly represented by ingredients below.
  • the main ingredients of the dish in this embodiment are obtained by classifying the recipe step diagram to obtain category labels.
  • the dish has as many ingredients as the number of category labels obtained through image classification. For example, scrambled eggs with tomatoes includes labels such as tomatoes, eggs, and oil.
  • the image heterogeneous graph network contains nodes and relationships. The following row Represents the component node, which is the classification label for the image from the image classification network.
  • each category label such as mango
  • the establishment of the relationship is still established through the classification network. If the image classification result has this category, the step image feature will establish an edge with the component. As shown in Figure 3, mango appears in all step images, so all step images will establish edges with it. Above, the nodes and edges are established. The following is how to use the image heterogeneous graph network for calculation to obtain the corresponding image features:
  • update the step node is the node feature of the mth node of the step graph node, Represents the feature of the nth node of the component node. If the mth node of the step graph node is connected (edge) to the nth node of the component node, the feature of the nth node of the component node is used to update the feature of the mth node of the step graph node.
  • the correlation between the nodes needs to be considered. In this embodiment, the correlation between the nodes can be represented by assigning weights.
  • the following relationship (10) can be called to calculate the correlation weight z mn between the mth node of the step graph node and the nth node feature of the component node.
  • the following relationship (10) can be called to calculate the correlation weight z mn between the mth node of the step graph node and the nth node feature of the component node.
  • Wd dimensional matrix
  • Wf matrix multiplication
  • the relevant weights of all component nodes of the edges connected to the step graph nodes can be normalized, that is, the following relationship (11) can be called to obtain the normalized relevant weights a mn :
  • exp represents the exponential function. It represents the sum of the relevant weights of all the component nodes of the edges connected to the step graph node. Finally, the node features of the step graph node are updated by the normalized relevant weights, that is, the following relation (12) is called for calculation:
  • W v is dimensional matrix, It is the new feature vector updated by the component nodes connected to it.
  • N M represents the common M step graph nodes connected to the component node, and the relationship (14) can be called to perform the same calculation and update on the component node:
  • a mn represents the normalized weight of the mth node feature of the step node and the nth node feature of the component node
  • a qp represents the normalized weight of the qth node feature of the step node and the pth node feature of the component node
  • Represents matrix multiplication, which is Mapped to W v represents the trainable weight matrix of the k-th layer network
  • the network update of one layer of the graph attention network is completed.
  • T layers of graph attention networks can be superimposed, with t representing the tth layer of the graph attention network.
  • the update method of the node features of each layer is as above.
  • an integrated fully connected layer is added after each layer of the graph attention network to realize the re-encoding of node features (including component nodes and step graph nodes), as shown in the following relationship (15):
  • FFN stands for fully connected layer. Represents the initial node features of the graph attention network at layer t+1.
  • the image features can be input into the long short-term memory neural network LSTM to obtain the overall features of the recipe step image, that is, the relationship Obtained.
  • LSTM represents each unit of the LSTM network. Represents the output of the mth LSTM unit. represents the recipe step graph feature, which comes from the heterogeneous graph node feature of the last layer, and m represents the mth image. Accordingly, the feature encoding output of the last LSTM unit is used as the feature output of the recipe step graph e csi , that is,
  • this embodiment further provides a training method for a bidirectional search model of image data and text data, see FIG4 , which may include the following contents:
  • S402 For each group of training samples in the training sample set, respectively obtain original image features, target recognition features, image features of image samples in the current group of training samples and target text features and text features of text samples.
  • the training sample set of this step includes multiple groups of training samples, each group of training samples includes a corresponding text sample and an image sample, that is, the text sample and the image sample are a set of sample data that match each other.
  • the number of training sample groups contained in the training sample set can be determined according to the actual training needs and the actual application scenarios, and this application does not impose any restrictions on this.
  • the text samples in the training sample set can be obtained from any existing database, and the image samples corresponding to the text samples can be obtained from the corresponding database. Of course, in order to expand the number of training sample sets.
  • the text sample or image text can also be the data after the original text sample or image text sample is processed by cropping, splicing, stretching, etc.
  • S405 Input the image features of each group of training samples into the image heterogeneous graph network and the text features into the text heterogeneous graph network to train the image-text bidirectional search model.
  • the text feature information of a text sample corresponds to the image feature of an image sample.
  • a loss function is used to guide the training of the model, and then the network parameters of the image-text bidirectional search model are updated by methods such as gradient backpropagation until the model training conditions are met, such as reaching the number of iterations or the convergence effect is good.
  • the training process of the image-text bidirectional search model may include a forward propagation stage and a backpropagation stage.
  • the forward propagation stage is the stage in which data is propagated from a low level to a high level
  • the backpropagation stage is the stage in which the error is propagated from a high level to a low level when the result obtained by the forward propagation does not match the expectation.
  • the error is backpropagated back to the image-text bidirectional search model, and the backpropagation errors of each part of the image-text bidirectional search model such as the graph neural network layer, the fully connected layer, and the convolution layer are obtained in turn.
  • Each layer of the image-text bidirectional search model adjusts all weight coefficients of the image-text bidirectional search model according to the back propagation error of each layer to update the weight. Randomly select a new batch of image features and text feature information, and then repeat the above process to obtain the output value of the network forward propagation.
  • the model training ends. All layer parameters of the model corresponding to the end of model training are used as the network parameters of the trained image-text bidirectional search model.
  • this embodiment also provides an optional implementation method of a loss function, that is, based on the text features and corresponding image features of each group of training samples, a loss function is called to guide the training process of the image-text bidirectional search model; the loss function can be expressed as:
  • N is the number of training sample groups, the model training is traversed N times, N represents the total number of paired samples in this batch.
  • the image group features Traverse (a total of N), and the selected image samples are called a represents anchor (anchor sample).
  • the text feature encoding paired with the anchor sample is recorded as p represents positive.
  • the unpaired text features are recorded as is a hyperparameter, which is fixed during training, for example, set to 0.3.
  • the same traversal operation is performed for text features. Represents the sample selected in the traversal, and its corresponding positive image group feature sample is recorded as The non-corresponding is a hyperparameter.
  • the embodiment of the present application also provides a corresponding device for the image-text bidirectional search method and the image-text matching model training method, which further makes the method more practical.
  • the device can be described from the perspective of functional modules and hardware.
  • the image-text bidirectional search device and the image-text matching model training device provided in the embodiment of the present application are introduced below.
  • the image-text bidirectional search device and the image-text matching model training device described below can be referenced to each other with the image-text bidirectional search method and the image-text matching model training method described above.
  • FIG. 5 is a structural diagram of a bidirectional image-text search device provided in an embodiment of the present application in one implementation manner.
  • the device may include:
  • the image recognition module 501 is used to call the image recognition network of the pre-trained image-text bidirectional search model to obtain the target recognition features of the target image block contained in each sub-image of the image to be searched;
  • a text feature extraction module 502 is used to obtain text features of a text to be searched that only contains one type of target text data based on a text heterogeneous graph network of a text-text bidirectional search model; the target text features corresponding to the target text data include target recognition features; the target recognition features and the target text features are node features of the text heterogeneous graph network, and the connection edges of the text heterogeneous graph network are determined by the inclusion relationship between the target recognition features and the target text features;
  • An image feature extraction module 503 is used to obtain image features of an image to be searched including a group of sub-images based on an image heterogeneous graph network of an image-text bidirectional search model; the original image features and target recognition features of the image to be searched are used as node features of the image heterogeneous graph network, and the connection edges of the image heterogeneous graph network are determined by the association relationship between the target recognition features and the original image features;
  • the bidirectional search module 504 is used to input image features and text features into a pre-trained image-text bidirectional search model to obtain image-text search results;
  • the image-text bidirectional search model includes a text heterogeneous graph network, an image heterogeneous graph network and an image recognition network.
  • the above-mentioned text feature extraction module 502 can also be used to: obtain text features of the text to be searched that only contains one type of target text data, including: responding to a text splitting instruction, splitting the target recognition features into multiple text phrases and/or text words, and splitting the target text data into multiple text sentences; inputting each text phrase and/or text word into a pre-trained text feature extraction model to obtain multiple first-category node features; inputting each text sentence into the text feature extraction model to obtain multiple second-category node features.
  • the above text feature extraction module 502 may also include a feature extraction unit for building a language representation model; the language representation model includes a text information input layer, a feature extraction layer and a text feature output layer; the feature extraction layer is a bidirectional encoder based on a converter; the language representation model is trained using a natural language text sample data set, and the trained language representation model is used as a text feature extraction model.
  • the above text feature extraction module 502 may also include a position input unit for inputting the position information of each text sentence and each phrase and each word contained in each text sentence in the current text sentence into the text feature extraction model.
  • the above text feature extraction module 502 may also include an identification processing unit for obtaining the data type of data to be input into the text feature extraction model at the next moment, so as to input the data type together with the corresponding data into the text feature extraction model; the data type includes a first identification for identifying the target recognition feature, and a second identification for identifying the target text data.
  • the above text feature extraction module 502 may further include an edge connection determination unit, which is used to traverse each text sentence of the target text data in turn for each text phrase or text word in the target recognition feature; if the target phrase contained in the current text sentence is the same as the current text phrase, then the second-category node feature corresponding to the current text sentence and the first-category node feature corresponding to the current text phrase have a connection relationship; if the target word contained in the current text sentence is the same as the current text word, then the second-category node feature corresponding to the current text sentence and the first-category node feature corresponding to the current text word have a connection relationship.
  • the above image recognition module 501 can also be used to pre-train an image recognition network using a target training sample set that annotates corresponding target recognition features in an image sample containing multiple sub-images; input the image to be searched into the image recognition network to obtain the target recognition features contained in each sub-image of the image to be searched.
  • the target recognition network structure includes an input layer, a convolution structure, a pooling layer and a classifier;
  • the convolution structure includes a basic operation component and a residual operation component;
  • the basic operation component is used to perform convolution processing, regularization processing, activation function processing and maximum pooling processing on the input image in sequence;
  • the residual operation component includes a plurality of connected residual blocks, each residual block includes multiple layers of convolution layers, which are used to perform convolution calculations on the output features of the basic operation components;
  • the pooling layer is used to convert the output features of the convolution structure into a target feature vector and transmit it to the classifier;
  • the classifier is used to calculate the target feature vector and output the probability of the category label.
  • the above-mentioned text feature extraction module 502 may also include a graph operation unit, which is used for a text heterogeneous graph network including multiple layers of first graph attention networks, and each layer of the first graph attention network is also integrated with a first fully connected layer; for each text heterogeneous node of each first graph attention network of the text heterogeneous graph network, according to whether there is a connection relationship between the current text heterogeneous node and the remaining text heterogeneous nodes and the association relationship between the text heterogeneous nodes, the node feature of the current text heterogeneous node is updated; based on the node features of each text heterogeneous node of the updated text heterogeneous graph network, the text features of the text to be searched are generated.
  • a graph operation unit which is used for a text heterogeneous graph network including multiple layers of first graph attention networks, and each layer of the first graph attention network is also integrated with a first fully connected layer; for each text heterogeneous node of each first graph attention network of the
  • the above graph operation unit can also be used to: determine a target text heterogeneous node that is connected to the current text heterogeneous node and is not of the same node type; calculate the initial weight values of the current text heterogeneous node and each target text heterogeneous node based on the association between the node features of the current text heterogeneous node and the node features of each target text heterogeneous node, and determine the weight value of the current text heterogeneous node according to each initial weight value; update the node feature of the current text heterogeneous node based on the weight value and each target text heterogeneous node, and use the sum of the node feature of the current text heterogeneous node after the update and the node feature before the update as the node feature of the current text heterogeneous node.
  • the above graph operation unit can be further used to: call the weight calculation relationship to calculate the initial weight value of the current text heterogeneous node and each target text heterogeneous node respectively; the weight calculation relationship is:
  • zqp is the initial weight value of the qth text heterogeneous node and the pth text heterogeneous node
  • LeakyReLU() is the activation function
  • Wa , Wb , and Wc are known dimensional matrix
  • Wa , Wb , and Wc are known dimensional matrix
  • the above graph operation unit can be further used to: call the initial update relational expression to update the node features of the current text heterogeneous node; the initial update relational expression is:
  • is a hyperparameter
  • aqp is the normalized weight of the qth node of the step node and the pth node feature of the component node
  • Wv is known dimensional matrix
  • NP is the total number of target text heterogeneous nodes.
  • the above-mentioned text feature extraction module 502 may further include a timing feature extraction unit, which is used to have a sequential execution order between each second-category node feature, and input each second-category node feature and sequence information into a pre-trained timing feature extraction model to obtain timing information features; the timing information features are mapped to the text features through a fully connected layer.
  • a timing feature extraction unit which is used to have a sequential execution order between each second-category node feature, and input each second-category node feature and sequence information into a pre-trained timing feature extraction model to obtain timing information features; the timing information features are mapped to the text features through a fully connected layer.
  • the above-mentioned timing feature extraction unit can be further used to: based on the sequence between each second-category node feature, input each second-category node feature into the bidirectional long short-term memory neural network in sequence and reverse order to obtain the timing coding feature of each second-category node feature; determine the timing information feature according to the timing coding feature of each second-category node feature.
  • the above time series feature extraction unit can be further used to: for each second-category node feature, call the positive sequence coding relationship formula, perform positive sequence coding on the current second-category node feature, and obtain a positive sequence coding feature; the positive sequence coding relationship formula is:
  • the reverse coding relational formula is called to encode the current second-category node feature in forward order to obtain the reverse coding feature; the reverse coding relational formula is:
  • the forward coding feature and the reverse coding feature are used as the temporal coding features of the current second type of node features
  • q ⁇ [1,Q] is the output of the qth unit in the forward encoding direction of the bidirectional long short-term memory neural network
  • Q is the total number of node features of the second category
  • is the backward encoding function of the bidirectional long short-term memory neural network is the forward encoding function of the bidirectional long short-term memory neural network.
  • the above-mentioned image feature extraction module 503 can also be used for: the image heterogeneous graph network includes multiple layers of second graph attention networks, and each layer of the second graph attention network is also integrated with a second fully connected layer; the image to be searched is input into a pre-trained image feature extraction model to obtain the original image features of the image to be searched; for each image heterogeneous node of each second graph attention network of the image heterogeneous graph network, according to whether there is a connection relationship between the current image heterogeneous node and the remaining image heterogeneous nodes and the association relationship between the image heterogeneous nodes, the node features of the current image heterogeneous node are updated; based on the node features of each image heterogeneous node of the updated image heterogeneous graph network, the image coding features of the text to be searched are generated; the image coding features are input into a pre-trained image feature generation model to obtain the image features of the image to be searched.
  • FIG. 6 is a structural diagram of a training device for an image-text matching model provided in an embodiment of the present application in one implementation manner, and the device may include:
  • the feature extraction module 601 is used to obtain the original image features, target recognition features, image features of the image samples in the current group of training samples and the target text features and text features of the text samples for each group of training samples in the training sample set; the target text features include the target recognition features; the image samples include a group of sub-images;
  • Model building module 602 used to pre-build a bidirectional image-text search model; based on using target recognition features and target text features as text heterogeneous node features respectively, and determining connecting edges according to the inclusion relationship between the target recognition features and the target text features, a text heterogeneous graph network of the bidirectional image-text search model is constructed; based on using original image features and target recognition features as image heterogeneous node features respectively, and determining connecting edges according to the correlation relationship between each target recognition feature and the original image feature, an image heterogeneous graph network of the bidirectional image-text search model is constructed;
  • the model training module 603 is used to input the image features of each group of training samples into the image heterogeneous graph network and the text features into the text heterogeneous graph network to train the image-text bidirectional search model.
  • the functions of the various functional modules of the image-text bidirectional search device and the image-text matching model training device in the embodiment of the present application can be implemented according to the method in the above-mentioned method embodiment.
  • the implementation process can refer to the relevant description of the above-mentioned method embodiment, which will not be repeated here.
  • FIG. 7 is a structural schematic diagram of the image-text bidirectional search device provided in an embodiment of the present application under one implementation.
  • the image-text bidirectional search device may include a memory 70 for storing computer programs; a processor 71 for implementing the steps of the image-text bidirectional search method and the image-text matching model training method mentioned in any of the above embodiments when executing the computer program.
  • the human-computer interaction component 72 is used to receive the training sample set selection request, model training request, search request input by the user through the information input/information output interface, and to display the image-text search results to the user;
  • the communication component 73 is used to transmit data and instructions during the training process of the image-text matching model and the execution process of the image-text bidirectional search task.
  • the processor 71 may include one or more processing cores, such as a 4-core processor or an 8-core processor.
  • the processor 71 may also be a controller, a microcontroller, a microprocessor or other data processing chip.
  • the processor 71 may be implemented in at least one hardware form of DSP (Digital Signal Processing), FPGA (Field-Programmable Gate Array), and PLA (Programmable Logic Array).
  • the processor 71 may also include a main processor and a coprocessor.
  • the main processor is a processor for processing data in the awake state, also known as CPU (Central Processing Unit); the coprocessor is a low-power processor for processing data in the standby state.
  • CPU Central Processing Unit
  • the processor 71 may be integrated with a GPU (Graphics Processing Unit), which is responsible for rendering and drawing the content to be displayed on the display screen.
  • the processor 71 may also include an AI (Artificial Intelligence) processor, which is used to process computing operations related to machine learning.
  • AI Artificial Intelligence
  • the memory 70 may include one or more computer-readable storage media, and the computer non-volatile readable storage media may be non-transitory.
  • the memory 70 may also include high-speed random access memory and non-volatile memory, such as one or more disk storage devices and flash memory storage devices.
  • the memory 70 may be an internal storage unit of the image-text bidirectional search device, such as a hard disk of a server.
  • the memory 70 may also be an external storage device of the image-text bidirectional search device, such as a plug-in hard disk equipped on a server, a smart memory card (Smart Media Card, SMC), a secure digital (Secure Digital, SD) card, a flash card (Flash Card), etc.
  • the memory 70 may also include both an internal storage unit and an external storage device of the image-text bidirectional search device.
  • the memory 70 may not only be used to store application software and various types of data installed in the image-text bidirectional search device, such as: the code of the program used and generated in the process of executing the image-text bidirectional search and the training process of the image-text matching model, but also be used to temporarily store data that has been output or is to be output.
  • the memory 70 is at least used to store the following computer program 701, wherein, after the computer program is loaded and executed by the processor 71, it can implement the relevant steps of the image-text bidirectional search method and the image-text matching model training method disclosed in any of the aforementioned embodiments.
  • the resources stored in the memory 70 may also include an operating system 702 and data 703, etc., and the storage method may be temporary storage or permanent storage.
  • the operating system 702 may include Windows, Unix, Linux, etc.
  • the data 703 may include but is not limited to the data generated during the image-text bidirectional search process and the image-text matching model training process, as well as the data corresponding to the bidirectional search results, etc.
  • the human-computer interaction component 72 may include a display screen, an information input/output interface such as a keyboard or a mouse.
  • the display screen and the information input/output interface belong to the user interface.
  • the optional user interface may also include a standard wired interface, a wireless interface, etc.
  • the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, and an OLED (Organic Light-Emitting Diode) touch device, etc.
  • the display may also be appropriately referred to as a display screen or a display unit, which is used to display information processed in the mutual retrieval device and to display a visual user interface.
  • the communication component 73 may include a communication interface or a network interface, a communication bus, etc.
  • the communication interface may optionally include a wired interface and/or a wireless interface, such as a WI-FI interface, a Bluetooth interface, etc., which is usually used to establish a communication connection between the image and text two-way search device and other devices.
  • the communication bus may be a peripheral component interconnect standard (PCI) bus or an extended industry standard architecture (EISA) bus, etc.
  • the bus may be divided into an address bus, a data bus, a control bus, etc.
  • FIG7 is represented by only one thick line, but it does not mean that there is only one bus or one type of bus.
  • the mutual search device may also include a power supply 74 and a sensor 75 for implementing various functions.
  • FIG7 does not constitute a limitation on the image-text bidirectional search device, and may include more or fewer components than shown in the figure.
  • the present embodiment does not limit the number of image-text bidirectional search devices, and it may be a method for training an image-text bidirectional search model and/or a method for training an image-text matching model that is jointly completed by multiple image-text bidirectional search devices.
  • Figure 8 is a schematic diagram of a hardware composition framework applicable to another method for training an image-text bidirectional search model and/or a method for training an image-text matching model provided in an embodiment of the present application.
  • the hardware composition framework may include: a first image-text bidirectional search device 81 and a second image-text bidirectional search device 82, which are connected via a network.
  • the hardware structure of the first image-text bidirectional search device 81 and the second image-text bidirectional search device 82 can refer to the electronic device in FIG7. That is, it can be understood that there are two electronic devices in this embodiment, and the two exchange data.
  • the trained image-text bidirectional search model shown in FIG9 can be pre-deployed in any device.
  • the embodiment of the present application does not limit the form of the network, that is, the network can be a wireless network (such as WIFI, Bluetooth, etc.) or a wired network.
  • the first image-text bidirectional search device 81 and the second image-text bidirectional search device 82 can be the same electronic device, such as the first image-text bidirectional search device 81 and the second image-text bidirectional search device 82 are both servers; they can also be different types of electronic devices, for example, the first image-text bidirectional search device 81 can be a smart phone or other smart terminal, and the second image-text bidirectional search device 82 can be a server.
  • the model training process and the trained image-text bidirectional search model can be pre-deployed on the end with high computing performance.
  • a server with strong computing power can be used as the second image-text bidirectional search device 82 to improve data processing efficiency and reliability, thereby improving the processing efficiency of model training and/or image-text bidirectional retrieval.
  • a low-cost and widely used smart phone is used as the first image-text bidirectional search device 81 to realize the interaction between the second image-text bidirectional search device 82 and the user.
  • the interaction process can be, for example, that the smart phone obtains a training sample set from the server, obtains the labels of the training sample set, sends these labels to the server, and the server uses the obtained labels to perform subsequent model training steps.
  • the server After generating the image-text bidirectional search model, the server obtains the search request sent by the smart phone.
  • the search request is issued by the user and carries the data to be searched.
  • the server determines the data to be searched by parsing the search request, and calls the image-text bidirectional search model to perform corresponding processing on the data to be searched, obtains the corresponding search results, and feeds back the search results to the first image-text bidirectional search device 81.
  • the image-text bidirectional search method in the above embodiment is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer non-volatile readable storage medium.
  • the technical solution of the present application is essentially or the part that contributes to the prior art or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a non-volatile storage medium to execute all or part of the steps of the various embodiments of the present application.
  • the aforementioned non-volatile storage medium includes: U disk, mobile hard disk, read-only memory (ROM), random access memory (RAM), electrically erasable programmable ROM, register, hard disk, multimedia card, card-type memory (such as SD or DX memory, etc.), magnetic memory, removable disk, CD-ROM, disk or optical disk, etc.
  • Various media that can store program codes include: U disk, mobile hard disk, read-only memory (ROM), random access memory (RAM), electrically erasable programmable ROM, register, hard disk, multimedia card, card-type memory (such as SD or DX memory, etc.), magnetic memory, removable disk, CD-ROM, disk or optical disk, etc.
  • Various media that can store program codes include: U disk, mobile hard disk, read-only memory (ROM), random access memory (RAM), electrically erasable programmable ROM, register, hard disk, multimedia card, card-type memory (such as SD or DX memory, etc.), magnetic memory, removable disk, CD-ROM, disk or optical disk,
  • an embodiment of the present application further provides a non-volatile readable storage medium storing a computer program, and when the computer program is executed by a processor, the steps of the image-text bidirectional search method described in any of the above embodiments are performed.
  • each embodiment is described in a progressive manner, and each embodiment focuses on the differences from other embodiments.
  • the same or similar parts between the embodiments can be referred to each other.
  • the hardware disclosed in the embodiments including devices and equipment, since they correspond to the methods disclosed in the embodiments, the description is relatively simple, and the relevant parts can be referred to the method part description.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Multimedia (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Library & Information Science (AREA)
  • Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

L'invention concerne un procédé, un appareil et un dispositif de recherche bidirectionnelle d'image-texte, et un support de stockage lisible non volatil, qui sont appliqués au domaine technique de la récupération d'informations. Le procédé consiste à : pré-entraîner un modèle de recherche bidirectionnelle d'image-texte, qui comprend un réseau de graphes hétérogènes de texte, un réseau de graphes hétérogènes d'image et un réseau de reconnaissance d'image ; appeler le réseau de reconnaissance d'image pour acquérir des caractéristiques de reconnaissance cibles d'une image devant faire l'objet d'une recherche ; sur la base du réseau de graphes hétérogènes de texte, acquérir des caractéristiques de texte et des caractéristiques de texte cibles de texte devant faire l'objet d'une recherche, le réseau de graphes hétérogènes de texte étant construit en utilisant les caractéristiques de texte cibles et les caractéristiques de reconnaissance cibles en tant que nœuds ; acquérir des caractéristiques d'image de ladite image sur la base du réseau de graphes hétérogènes d'image, le réseau de graphes hétérogènes d'image étant construit en utilisant des caractéristiques d'image d'origine et les caractéristiques de reconnaissance cibles de ladite image en tant que nœuds ; et entrer les caractéristiques d'image et les caractéristiques de texte dans le modèle de recherche bidirectionnelle d'image-texte, de façon à obtenir un résultat de recherche d'image-texte.
PCT/CN2022/142513 2022-11-08 2022-12-27 Procédé, appareil et dispositif de recherche bidirectionnelle d'image-texte, et support de stockage lisible non volatil WO2024098533A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211388778.5A CN115438215B (zh) 2022-11-08 2022-11-08 图文双向搜索及匹配模型训练方法、装置、设备及介质
CN202211388778.5 2022-11-08

Publications (1)

Publication Number Publication Date
WO2024098533A1 true WO2024098533A1 (fr) 2024-05-16

Family

ID=84252309

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/142513 WO2024098533A1 (fr) 2022-11-08 2022-12-27 Procédé, appareil et dispositif de recherche bidirectionnelle d'image-texte, et support de stockage lisible non volatil

Country Status (2)

Country Link
CN (1) CN115438215B (fr)
WO (1) WO2024098533A1 (fr)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115438215B (zh) * 2022-11-08 2023-04-18 苏州浪潮智能科技有限公司 图文双向搜索及匹配模型训练方法、装置、设备及介质
CN115858848B (zh) * 2023-02-27 2023-08-15 浪潮电子信息产业股份有限公司 图文互检方法及装置、训练方法及装置、服务器、介质
CN116049459B (zh) * 2023-03-30 2023-07-14 浪潮电子信息产业股份有限公司 跨模态互检索的方法、装置、服务器及存储介质
CN116226434B (zh) * 2023-05-04 2023-07-21 浪潮电子信息产业股份有限公司 一种多元异构模型训练及应用方法、设备及可读存储介质
CN116992167B (zh) * 2023-09-22 2024-01-23 深圳市智慧城市科技发展集团有限公司 地址搜索方法、系统及计算机可读存储介质

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113127669A (zh) * 2020-01-15 2021-07-16 百度在线网络技术(北京)有限公司 广告配图方法、装置、设备和存储介质
CN113902764A (zh) * 2021-11-19 2022-01-07 东北大学 基于语义的图像-文本的跨模态检索方法
CN114821605A (zh) * 2022-06-30 2022-07-29 苏州浪潮智能科技有限公司 一种文本的处理方法、装置、设备和介质
CN114896429A (zh) * 2022-07-12 2022-08-12 苏州浪潮智能科技有限公司 一种图文互检方法、系统、设备及计算机可读存储介质
CN114969405A (zh) * 2022-04-30 2022-08-30 苏州浪潮智能科技有限公司 一种跨模态图文互检方法
CN115062208A (zh) * 2022-05-30 2022-09-16 苏州浪潮智能科技有限公司 数据处理方法、系统及计算机设备
US20220343626A1 (en) * 2019-08-15 2022-10-27 Vision Semantics Limited Text Based Image Search
CN115438215A (zh) * 2022-11-08 2022-12-06 苏州浪潮智能科技有限公司 图文双向搜索及匹配模型训练方法、装置、设备及介质

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7461073B2 (en) * 2006-02-14 2008-12-02 Microsoft Corporation Co-clustering objects of heterogeneous types
CN111985184A (zh) * 2020-06-30 2020-11-24 上海翎腾智能科技有限公司 基于ai视觉下的书写字体临摹辅助方法、系统、装置
CN113111154B (zh) * 2021-06-11 2021-10-29 北京世纪好未来教育科技有限公司 相似度评估方法、答案搜索方法、装置、设备及介质

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220343626A1 (en) * 2019-08-15 2022-10-27 Vision Semantics Limited Text Based Image Search
CN113127669A (zh) * 2020-01-15 2021-07-16 百度在线网络技术(北京)有限公司 广告配图方法、装置、设备和存储介质
CN113902764A (zh) * 2021-11-19 2022-01-07 东北大学 基于语义的图像-文本的跨模态检索方法
CN114969405A (zh) * 2022-04-30 2022-08-30 苏州浪潮智能科技有限公司 一种跨模态图文互检方法
CN115062208A (zh) * 2022-05-30 2022-09-16 苏州浪潮智能科技有限公司 数据处理方法、系统及计算机设备
CN114821605A (zh) * 2022-06-30 2022-07-29 苏州浪潮智能科技有限公司 一种文本的处理方法、装置、设备和介质
CN114896429A (zh) * 2022-07-12 2022-08-12 苏州浪潮智能科技有限公司 一种图文互检方法、系统、设备及计算机可读存储介质
CN115438215A (zh) * 2022-11-08 2022-12-06 苏州浪潮智能科技有限公司 图文双向搜索及匹配模型训练方法、装置、设备及介质

Also Published As

Publication number Publication date
CN115438215A (zh) 2022-12-06
CN115438215B (zh) 2023-04-18

Similar Documents

Publication Publication Date Title
WO2024098533A1 (fr) Procédé, appareil et dispositif de recherche bidirectionnelle d'image-texte, et support de stockage lisible non volatil
CN108959246B (zh) 基于改进的注意力机制的答案选择方法、装置和电子设备
CN109241524B (zh) 语义解析方法及装置、计算机可读存储介质、电子设备
US12032906B2 (en) Method, apparatus and device for quality control and storage medium
WO2023236977A1 (fr) Procédé de traitement de données et dispositif associé
WO2024098623A1 (fr) Procédé et appareil de récupération inter-média, procédé et appareil d'apprentissage de modèle de récupération inter-média, dispositif et système de récupération de recette
US20230130006A1 (en) Method of processing video, method of quering video, and method of training model
CN112084789B (zh) 文本处理方法、装置、设备及存储介质
WO2024098524A1 (fr) Procédé et appareil de recherche croisée de texte et de vidéo, procédé et appareil d'apprentissage de modèle, dispositif et support
US11676410B1 (en) Latent space encoding of text for named entity recognition
CN112580328A (zh) 事件信息的抽取方法及装置、存储介质、电子设备
CN117114063A (zh) 用于训练生成式大语言模型和用于处理图像任务的方法
CN113158687B (zh) 语义的消歧方法及装置、存储介质、电子装置
CN113158656B (zh) 讽刺内容识别方法、装置、电子设备以及存储介质
WO2024199423A1 (fr) Procédé de traitement de données et dispositif associé
CN110163121A (zh) 图像处理方法、装置、计算机设备及存储介质
CN114612921B (zh) 表单识别方法、装置、电子设备和计算机可读介质
JP2022534375A (ja) テキスト知能化洗浄方法、装置及びコンピュータ読み取り可能な記憶媒体
US20230114673A1 (en) Method for recognizing token, electronic device and storage medium
CN114943511A (zh) 一种政务办公自动化平台及其优化实现方法
CN110019952A (zh) 视频描述方法、系统及装置
WO2024098763A1 (fr) Procédé et appareil d'extraction mutuelle de diagramme d'opération textuelle, procédé et appareil d'entraînement de modèle d'extraction mutuelle de diagramme d'opération textuelle, dispositif et support
CN112906368B (zh) 行业文本增量方法、相关装置及计算机程序产品
WO2023173617A1 (fr) Procédé et appareil de traitement d'image, dispositif et support de stockage
WO2023137903A1 (fr) Procédé et appareil de détermination de déclaration de réponse basés sur une sémantique grossière, et dispositif électronique

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22964997

Country of ref document: EP

Kind code of ref document: A1