WO2024098533A1 - Image-text bidirectional search method, apparatus and device, and non-volatile readable storage medium - Google Patents

Image-text bidirectional search method, apparatus and device, and non-volatile readable storage medium Download PDF

Info

Publication number
WO2024098533A1
WO2024098533A1 PCT/CN2022/142513 CN2022142513W WO2024098533A1 WO 2024098533 A1 WO2024098533 A1 WO 2024098533A1 CN 2022142513 W CN2022142513 W CN 2022142513W WO 2024098533 A1 WO2024098533 A1 WO 2024098533A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
image
features
node
heterogeneous
Prior art date
Application number
PCT/CN2022/142513
Other languages
French (fr)
Chinese (zh)
Inventor
李仁刚
王立
范宝余
郭振华
Original Assignee
苏州元脑智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 苏州元脑智能科技有限公司 filed Critical 苏州元脑智能科技有限公司
Publication of WO2024098533A1 publication Critical patent/WO2024098533A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/583Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/5846Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using extracted text
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/19Recognition using electronic means
    • G06V30/19007Matching; Proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/19Recognition using electronic means
    • G06V30/191Design or setup of recognition systems or techniques; Extraction of features in feature space; Clustering techniques; Blind source separation
    • G06V30/19147Obtaining sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/19Recognition using electronic means
    • G06V30/191Design or setup of recognition systems or techniques; Extraction of features in feature space; Clustering techniques; Blind source separation
    • G06V30/19173Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the present application relates to the field of information retrieval technology, and in particular to a method, device, equipment and non-volatile readable storage medium for bidirectional search of images and texts.
  • the present application provides a method, device, equipment and non-volatile readable storage medium for bidirectional search between image data and text data, which effectively improves the accuracy of bidirectional search between image data and text data.
  • a first aspect of an embodiment of the present application provides a method for bidirectional search of images and texts, including:
  • the bidirectional image-text search model includes a text heterogeneous graph network, an image heterogeneous graph network, and an image recognition network;
  • text features of the text to be searched that only contains one type of target text data are obtained;
  • the target text features corresponding to the target text data include target recognition features;
  • the target recognition features and the target text features are node features of the text heterogeneous graph network, and the connection edges of the text heterogeneous graph network are determined by the inclusion relationship between the target recognition features and the target text features;
  • image features of the image to be searched including a group of sub-images are obtained; the original image features and target recognition features of the image to be searched are used as node features of the image heterogeneous graph network, and the connection edges of the image heterogeneous graph network are determined by the association relationship between each target recognition feature and the original image feature;
  • the image features and text features are input into the image-text bidirectional search model to obtain the image-text search results.
  • the image-text bidirectional search model after pre-training the image-text bidirectional search model, it also includes:
  • the target recognition feature is split into a plurality of text phrases and/or text words, and the target text data is split into a plurality of text sentences;
  • Each text sentence is input into the text feature extraction model to obtain multiple second-category node features.
  • the following further includes:
  • the language representation model includes a text information input layer, a feature extraction layer, and a text feature output layer;
  • the feature extraction layer is a bidirectional encoder based on a converter;
  • the language representation model is trained using a natural language text sample dataset, and the trained language representation model is used as a text feature extraction model.
  • each text sentence is input into a text feature extraction model, including:
  • the position information of each text sentence and each phrase and each word contained in each text sentence in the current text sentence is input into the text feature extraction model.
  • the method before inputting each text phrase and/or text word into a pre-built text feature extraction model to obtain a plurality of first-category node features, and before inputting each text sentence into a text feature extraction model to obtain a plurality of second-category node features, the method further includes:
  • the data type includes a first identifier for identifying a target identification feature and a second identifier for identifying target text data.
  • connection edges of the text heterogeneous graph network are determined by the inclusion relationship between the target recognition feature and the target text feature, including:
  • the second type of node feature corresponding to the current text sentence has a connection relationship with the first type of node feature corresponding to the current text phrase
  • the second type of node feature corresponding to the current text sentence has a connection relationship with the first type of node feature corresponding to the current text word.
  • obtaining target recognition features of target image blocks contained in each sub-image of the image to be searched includes:
  • an image recognition network Preliminarily using a target training sample set in which corresponding target recognition features are annotated in an image sample containing a plurality of sub-images, an image recognition network is trained;
  • the image to be searched is input into the image recognition network to obtain the target recognition features contained in each sub-image of the image to be searched.
  • the method before training the image recognition network using the target training sample set in which corresponding target recognition features are annotated in the image sample including the plurality of sub-images, the method further includes:
  • Pre-build the target recognition network structure which includes input layer, convolution structure, pooling layer and classifier;
  • the convolution structure includes a basic operation component and a residual operation component;
  • the basic operation component is used to perform convolution processing, regularization processing, activation function processing and maximum pooling processing on the input image in sequence;
  • the residual operation component includes multiple connected residual blocks, each residual block includes multiple convolution layers, which are used to perform convolution calculations on the output features of the basic operation component;
  • the pooling layer is used to convert the output features of the convolution structure into target feature vectors and transmit them to the classifier;
  • the classifier is used to calculate the target feature vector and output the probability of the category label.
  • the text heterogeneous graph network includes multiple layers of first graph attention networks, and each layer of the first graph attention network is further integrated with a first fully connected layer; obtaining text features of the text to be searched that only contains one type of target text data includes:
  • the node feature of the current text heterogeneous node is updated;
  • the node feature of the current text heterogeneous node is updated, including:
  • the initial weight values of the current text heterogeneous node and each target text heterogeneous node are calculated, and the weight value of the current text heterogeneous node is determined according to each initial weight value;
  • the node feature of the current text heterogeneous node is updated, and the sum of the node feature after the update and the node feature before the update of the current text heterogeneous node is used as the node feature of the current text heterogeneous node.
  • the initial weight value of the current text heterogeneous node and each target text heterogeneous node is calculated, including:
  • the weight calculation formula is called to calculate the initial weight values of the current text heterogeneous node and each target text heterogeneous node respectively; the weight calculation formula is:
  • zqp is the initial weight value of the qth text heterogeneous node and the pth text heterogeneous node
  • LeakyReLU() is the activation function
  • Wa , Wb , and Wc are known dimensional matrix
  • Wa , Wb , and Wc are known dimensional matrix
  • the node feature of the current text heterogeneous node is updated, including:
  • is a hyperparameter
  • aqp is the normalized weight of the qth node of the step node and the pth node of the component node
  • Wv is the known dimensional matrix
  • NP is the total number of target text heterogeneous nodes.
  • the second-class node features corresponding to the target text data have a sequential execution order, and based on the text heterogeneous graph network, after obtaining the text features of the text to be searched that only contains one class of target text data, the method further includes:
  • the time series information features are mapped to the text features through the fully connected layer.
  • each second-category node feature and sequence information is input into a pre-trained time series feature extraction model to obtain time series information features, including:
  • the features of each second type of node are input into the bidirectional long short-term memory neural network in sequence and reverse order to obtain the temporal coding features of each second type of node feature;
  • the timing information feature is determined according to the characteristic timing coding feature of each second-category node.
  • each second-category node feature is input into a bidirectional long short-term memory neural network in order and in reverse order to obtain a temporal coding feature of each second-category node feature, including:
  • the positive sequence encoding relation is called to perform positive sequence encoding on the current second-class node feature to obtain the positive sequence encoding feature;
  • the positive sequence encoding relation is:
  • the reverse coding relational formula is called to encode the current second-category node feature in forward order to obtain the reverse coding feature; the reverse coding relational formula is:
  • the forward coding feature and the reverse coding feature are used as the temporal coding features of the current second type of node features
  • q ⁇ [1,Q] is the output of the qth unit in the forward encoding direction of the bidirectional long short-term memory neural network
  • Q is the total number of node features of the second category
  • is the backward encoding function of the bidirectional long short-term memory neural network is the forward encoding function of the bidirectional long short-term memory neural network.
  • the image heterogeneous graph network includes multiple layers of second graph attention networks, and each layer of the second graph attention network is further integrated with a second fully connected layer; obtaining image features of an image to be searched including a group of sub-images, including:
  • each image heterogeneous node of each second graph attention network of the image heterogeneous graph network For each image heterogeneous node of each second graph attention network of the image heterogeneous graph network, according to whether there is a connection relationship between the current image heterogeneous node and the remaining image heterogeneous nodes and the association relationship between the image heterogeneous nodes, update the node feature of the current image heterogeneous node;
  • the image encoding features are input into a pre-trained image feature generation model to obtain the image features of the image to be searched.
  • a second aspect of the embodiment of the present application provides a device for bidirectional search of images and texts, including:
  • An image recognition module is used to call the image recognition network of the pre-trained image-text bidirectional search model to obtain target recognition features of the target image block contained in each sub-image of the image to be searched;
  • a text feature extraction module is used for obtaining text features of a text to be searched that contains only one type of target text data based on a text heterogeneous graph network of a bidirectional search model for text and images; target text features corresponding to the target text data include target recognition features; target recognition features and target text features are node features of the text heterogeneous graph network, and the connection edges of the text heterogeneous graph network are determined by the inclusion relationship between the target recognition features and the target text features;
  • An image feature extraction module is used to obtain image features of an image to be searched including a group of sub-images based on an image heterogeneous graph network of an image-text bidirectional search model; the original image features and target recognition features of the image to be searched are used as node features of the image heterogeneous graph network, and the connection edges of the image heterogeneous graph network are determined by the association relationship between each target recognition feature and the original image feature;
  • the bidirectional search module is used to input image features and text features into a pre-trained image-text bidirectional search model to obtain image-text search results;
  • the image-text bidirectional search model includes a text heterogeneous graph network, an image heterogeneous graph network and an image recognition network.
  • a third aspect of the present application embodiment provides a method for training an image-text matching model, comprising:
  • a text heterogeneous graph network of the graph-text bidirectional search model is constructed
  • an image heterogeneous graph network of the image-text bidirectional search model is constructed;
  • the image features of each group of training samples are input into the image heterogeneous graph network, and the text features are input into the text heterogeneous graph network to train the image-text bidirectional search model.
  • a fourth aspect of the embodiments of the present application provides a training device for an image-text matching model, comprising:
  • the feature extraction module is used to obtain the original image features, target recognition features, image features of the image samples in the current group of training samples and the target text features and text features of the text samples for each group of training samples in the training sample set;
  • the target text features include the target recognition features;
  • the image samples include a group of sub-images;
  • a model building module is used to pre-build a bidirectional image-text search model; based on using target recognition features and target text features as text heterogeneous node features respectively, and determining the connection edges according to the inclusion relationship between the target recognition features and the target text features, a text heterogeneous graph network of the bidirectional image-text search model is constructed; based on using original image features and target recognition features as image heterogeneous node features respectively, and determining the connection edges according to the correlation relationship between each target recognition feature and the original image feature, an image heterogeneous graph network of the bidirectional image-text search model is constructed;
  • the model training module is used to input the image features of each group of training samples into the image heterogeneous graph network and the text features into the text heterogeneous graph network to train the image-text bidirectional search model.
  • a fifth aspect of the embodiment of the present application further provides a bidirectional image-text search device, including a processor, a memory, a human-computer interaction component, and a communication component;
  • the human-computer interaction component is used to receive training sample set selection requests, model training requests, and search requests input by users through the information input/information output interface, and to display graphic and text search results to users;
  • the communication component is used to transmit data and instructions during the training process of the image-text matching model and the execution process of the image-text bidirectional search task;
  • the processor is used to execute the computer program stored in the memory to implement the steps of any of the above-mentioned image-text bidirectional search methods and/or the above-mentioned image-text matching model training method.
  • the sixth aspect of the embodiment of the present application also provides a non-volatile readable storage medium, on which a computer program is stored.
  • a computer program is stored on which a computer program is stored.
  • the steps of any of the previous image-text bidirectional search methods and/or the previous image-text matching model training method are implemented.
  • a graph neural network for extracting corresponding features is constructed based on the data contained in a text containing only one type of text data and an image containing a group of sub-images and their internal relationships, which is conducive to extracting text features that can reflect the text and its internal correlation relationships in the real world, and image features that reflect the images and their internal correlation relationships in the real world.
  • Model training is performed based on the extracted text features and image features, which is conducive to fully exploring the correlation relationship between the fine-grained features of images and texts, thereby obtaining a high-precision image-text bidirectional retrieval model, effectively improving the mutual retrieval accuracy of image data and text data.
  • the embodiments of the present application also provide a training method for an image-text matching model and a corresponding implementation device, an image-text bidirectional search device and a non-volatile readable storage medium for the image-text bidirectional search method, thereby further making the image-text bidirectional search method more practical, and the image-text bidirectional search method, device, equipment and non-volatile readable storage medium have corresponding advantages.
  • FIG1 is a schematic diagram of a flow chart of a method for bidirectional image and text search provided by an embodiment of the present application
  • FIG2 is a schematic diagram of a text heterogeneous graph network structure provided in an embodiment of the present application.
  • FIG3 is a schematic diagram of an image heterogeneous graph network structure provided in an embodiment of the present application.
  • FIG4 is a flow chart of a method for training an image-text matching model provided in an embodiment of the present application.
  • FIG5 is a structural diagram of an implementation of a cross-media retrieval device provided in an embodiment of the present application.
  • FIG6 is a structural diagram of an implementation of a training device for an image-text matching model provided in an embodiment of the present application.
  • FIG7 is a structural diagram of an implementation of a bidirectional image-text search device provided in an embodiment of the present application.
  • FIG8 is a structural diagram of another implementation of a bidirectional image-text search device provided in an embodiment of the present application.
  • FIG9 is a schematic diagram of a framework of an exemplary application scenario provided in an embodiment of the present application.
  • FIG. 1 is a flow chart of a method for bidirectional image-text search provided by an embodiment of the present application.
  • the embodiment of the present application may include the following contents:
  • the bidirectional image-text search model of this embodiment is used to perform bidirectional image-text search tasks between text data and image data, that is, image data matching the text data to be searched can be determined from a known image database based on the text data to be searched, and text data matching the text data to be searched can also be determined from a known text database based on the image data to be searched.
  • the bidirectional image-text search model of this embodiment includes a text heterogeneous graph network, an image heterogeneous graph network, and an image recognition network; the text heterogeneous graph network is used to process input text data such as text samples or text to be searched and finally output text features corresponding to the text data, and the image heterogeneous graph network is used to process input image data such as image samples or images to be searched, and output the final image features of the image data.
  • the image recognition network used for the text heterogeneous graph network and the image heterogeneous graph network can be built based on any graph structure in any technology, which does not affect the implementation of this application.
  • the image recognition network is used to identify the category information of a certain type of image block in an image such as an image to be searched and an image sample used in the training model process, that is, the final output is the recognition label information corresponding to the specified recognition target included in the input image, which is called the target recognition feature for the convenience of description.
  • S102 calling an image recognition network to obtain target recognition features of a target image block contained in each sub-image of the image to be searched.
  • the image to be searched and the subsequent image samples of this embodiment include a group of sub-images, that is, a group of sub-images together constitute the image to be searched, the image to be searched is a recipe step image, each step corresponds to a sub-image, and the recipe step image includes the sub-images corresponding to each step.
  • the image blocks containing a certain type of designated information of the corresponding text data in the image to be searched are called target image blocks, and the identification information of these target image blocks is the target identification feature, that is, the target identification feature is the label information of the target image block in the image to be searched or the image sample, and the label information belongs to this type of designated information.
  • the designated information can be the recipe ingredients
  • the target image block is the image block that identifies the recipe ingredients
  • the target identification feature is the recipe ingredient information to which each target image block belongs
  • the designated information is the product structure of the electronic device
  • the target image block is the image block that identifies the product structure
  • the target identification feature is the identification information that the target image block belongs to a certain type of product structure, such as a switch or indicator light.
  • S103 Based on the text heterogeneous graph network, obtain text features of the text to be searched that only contains one type of target text data.
  • the text of the present application includes the text to be searched and the text samples in the training sample set used in the subsequent model training process, which only contain one type of text data.
  • the so-called one type of text data refers to the data in the text being of the same type.
  • the recipe text may include three types of text data: dish name, recipe ingredients, and cooking steps.
  • the text to be searched and the text samples of the present application can only contain one type of text data.
  • this type of text may include two types of text data, namely, server structure composition and working principle.
  • the text to be searched and the text samples of the present application can only contain one type of text data, that is, the text to be searched and the text samples only include the working principle of the server.
  • the corresponding text features are obtained by calculating the text heterogeneous graph network based on the text to be searched.
  • the text features of this embodiment refer to the features obtained after performing graph structure operations on the text heterogeneous graph network, and the target text features are the data obtained by directly extracting the text to be searched using the text feature extraction method.
  • the target text feature corresponding to the target text data includes the target recognition feature.
  • the so-called inclusion relationship means that the target recognition feature exists in the target text feature corresponding to the target text data.
  • the target recognition feature represents the recipe ingredients, and the target text feature represents the cooking steps; taking the electronic device manual as an example, the target recognition feature can be the product structure of the electronic device, and the target text feature can be the instruction manual. There is an inclusion relationship between the target text feature and the target recognition feature of this embodiment.
  • the target recognition feature is composed of the recognition features corresponding to multiple target image blocks of each sub-image.
  • the recognition feature of each target image block of each sub-image can be called a first-class node feature
  • the target text feature is composed of multiple text features, each of which is called a second-class node feature.
  • first-class node feature For a specified first-class node feature, if it is included in a second-class node feature, then the first-class node feature has an association relationship with the second node feature. After obtaining the target text features of the text to be searched and the target recognition features of the image to be searched, by analyzing each second-class node feature of the target text features, it is determined whether it contains a first-class node feature or several first-class node features of the target recognition features, and the correlation between the target recognition features and the target text features can be determined.
  • these two different types of features are used as heterogeneous node features of the graph structure network, and the connection edges of the graph structure network can be determined according to whether there is an inclusion relationship between different node features, that is, the target recognition features and the target text features are node features of the text heterogeneous graph network, and the connection edges of the text heterogeneous graph network are determined by the inclusion relationship between the target recognition features and the target text features.
  • the features corresponding to the graph structure can be extracted by performing graph structure operations, and this type of features is used as the text features in this step.
  • the image heterogeneous graph network of this step also includes nodes and connecting edges.
  • the nodes of the image heterogeneous graph network of this embodiment are heterogeneous nodes, that is, there are at least two features with different properties and structures.
  • the extracted image features can only be used as a node feature. Since the image features and text features have an associated corresponding relationship, the target recognition features extracted in S102 can be used as the node features of the image heterogeneous graph network.
  • the first-class node feature can be used as the heterogeneous node feature of the image heterogeneous graph network, that is, the original image features of the image to be searched and the target recognition features are used as the node features of the image heterogeneous graph network.
  • the connecting edges of the image heterogeneous graph network are determined by the association between the target recognition features and the original image features.
  • the original image features refer to the image features extracted directly using image feature methods such as artificial convolutional neural networks, VGG16 (Visual Geometry Group Network, visual image generator), Resnet (Deep residual network, deep residual network), etc.
  • the image features in this step are obtained by substituting the image features of each sub-image of the image to be searched into the image heterogeneous graph network and performing graph structure operations on the image heterogeneous graph network.
  • S105 Input the image features and text features into the image-text bidirectional search model to obtain image-text search results.
  • the image-text search result of this embodiment refers to the matching degree of the text features extracted in step S103 and the image features extracted in step S104, that is, after the text features and the image features are input into the image-text bidirectional search model, the image-text bidirectional search model can determine whether the features are close by calculating the vector distance such as the Euclidean distance. If they are close, the image to be searched and the text to be searched are matched, that is, the image to be searched and the text to be searched are a set of data corresponding to each other. If they are not close, the image to be searched and the text to be searched are not matched.
  • graph neural networks for extracting corresponding features are constructed based on the data contained in the text and image and their internal relationships, which is conducive to extracting text features that can reflect the text and its internal correlation in the real world, and image features that reflect the image and its internal correlation in the real world.
  • Model training is performed based on the extracted text features and image features, which is conducive to fully exploring the correlation between the fine-grained features of the image and the text, thereby obtaining a high-precision image-text bidirectional retrieval model, effectively improving the mutual retrieval accuracy of image data and text data.
  • the present application also provides an optional extraction implementation method of the target identification features, which may include:
  • An image recognition network is trained in advance using a target training sample set in which corresponding target recognition features are annotated in an image sample containing multiple sub-images; the image to be searched is input into the image recognition network to obtain the target recognition features contained in each sub-image of the image to be searched.
  • the image recognition network is used to identify the category information of the target image block in the image to be searched, and the target training sample set contains multiple images marked with target features, that is, each image sample contained in the target training sample set carries a category label.
  • Each image can be an image directly obtained from the original database, or it can be an image obtained by flipping, resizing, stretching, etc. the original image, which does not affect the implementation of the present application.
  • the image recognition network can be built based on any existing model structure that can recognize image categories, such as convolutional neural networks, artificial neural networks, etc., and the present application does not impose any restrictions on this.
  • the target recognition network structure may include an input layer, a convolution structure, a pooling layer and a classifier;
  • the convolution structure includes a basic operation component and a residual operation component;
  • the basic operation component is used to perform convolution processing, regularization processing, activation function processing and maximum pooling processing on the input image in sequence;
  • the residual operation component includes a plurality of connected residual blocks, each residual block includes multiple layers of convolution layers, which are used to perform convolution calculations on the output features of the basic operation component;
  • the pooling layer is used to convert the output features of the convolution structure into a target feature vector and transmit it to the classifier;
  • the classifier is used to calculate the target feature vector and output the probability of the category label.
  • the present application takes recipe text and recipe image as examples to illustrate the implementation process of the present embodiment, that is, the process of classifying the main components of each recipe image through an image classification network and constructing component nodes with the classified category information may include:
  • a step diagram dataset is generated through multiple recipe step diagrams, and the main components of some recipe step diagrams are annotated, such as flour, sugar, papaya, etc.
  • the annotated recipe step diagrams are used to train the ResNet50 network to classify the main components of the image.
  • the ResNet50 network structure can include seven parts. The first part does not contain residual blocks, and mainly performs convolution, regularization, activation function, and maximum pooling calculations on the input. The second, third, fourth, and fifth parts of the structure all contain residual blocks. Each residual block contains three layers of convolution. After the convolution calculation of the first five parts, the pooling layer converts it into a feature vector. Finally, the classifier calculates this feature vector and outputs the category probability.
  • the trained ResNet50 network can obtain the main component information of the input image very well.
  • the second type of text features from the text to be searched to the target text features need to undergo a text feature extraction operation.
  • the present application also provides an optional implementation of text features, which may include the following contents:
  • the target recognition feature is split into multiple text phrases and/or text words, and the target text data is split into multiple text sentences; each text phrase and/or text word is input into a pre-trained text feature extraction model to obtain multiple first-category node features; each text sentence is input into the text feature extraction model to obtain multiple second-category node features.
  • the text splitting instruction is used to split the text to be searched into multiple text sentences, and the target recognition feature is split into multiple text phrases or text words, and any text data splitting algorithm can be used.
  • the method for determining each connection edge in the text heterogeneous graph network can be: for each text phrase or text word in the target recognition feature, traverse each text sentence of the target text data in turn; if the target phrase contained in the current text sentence is the same as the current text phrase, then the second type of node feature corresponding to the current text sentence has a connection relationship with the first type of node feature corresponding to the current text phrase; if the target word contained in the current text sentence is the same as the current text word, then the second type of node feature corresponding to the current text sentence has a connection relationship with the first type of node feature corresponding to the current text word.
  • the text feature extraction model of this embodiment is used to extract text features from input text data or target recognition features.
  • the training process of the text feature extraction model is: building a language representation model; the language representation model includes a text information input layer, a feature extraction layer and a text feature output layer; the feature extraction layer is a bidirectional encoder based on a converter; the language representation model is trained using a natural language text sample data set, and the trained language representation model is used as a text feature extraction model.
  • the language representation model may be, for example, Bert (Bidirectional Encoder Representation from Transformers, a pre-trained language representation model) or word2vec (word to vector, a word vector model), which does not affect the implementation of the present application.
  • the data type may also be set for the text data at the same time, and the data type includes a first identifier for identifying the target recognition feature and a second identifier for identifying the target text data or the target text feature.
  • the data type of the data to be input into the text feature extraction model at the next moment is obtained, and the position information of each text sentence and each phrase and word contained in each text sentence in the current text sentence may also be input into the text feature extraction model.
  • the data type is input into the text feature extraction model together with the corresponding data.
  • second-category text features can be obtained by extracting target text data from the text to be searched.
  • the present application further performs temporal feature extraction and provides a method for extracting temporal features, which may include the following contents:
  • each second-category node feature and sequence information are input into a pre-trained temporal feature extraction model to obtain temporal information features.
  • the temporal feature extraction model can be a bidirectional long short-term memory neural network. Accordingly, based on the sequence between each second-category node feature, each second-category node feature can be input into the bidirectional long short-term memory neural network in sequence and reverse order to obtain the temporal coding features of each second-category node feature; the temporal information features are determined according to the temporal coding features of each second-category node feature.
  • the temporal coding features can include forward coding features and reverse coding features.
  • the extracted temporal information features can be mapped to the text features through a fully connected layer.
  • the acquisition of forward coding features and reverse coding features can be carried out by the following method: the forward coding relationship can be called to perform forward coding on the current second-category node feature to obtain the forward coding feature; the forward coding relationship can be expressed as:
  • the reverse coding relation is called to encode the current second-category node feature in the forward order to obtain the reverse coding feature;
  • the reverse coding relation can be expressed as:
  • q ⁇ [1,Q] is the output of the qth unit in the forward encoding direction of the bidirectional long short-term memory neural network
  • Q is the total number of node features of the second category
  • is the backward encoding function of the bidirectional long short-term memory neural network is the forward encoding function of the bidirectional long short-term memory neural network.
  • this embodiment can also be implemented based on a long short-term memory neural network.
  • the relationship can be called q ⁇ [1,Q] to obtain the time series feature information, where Represents the output of the qth unit in the LSTM. It represents the output of the q-1th unit in the LSTM, that is, the output of the previous state.
  • the above embodiment does not limit how to generate text features based on the text heterogeneous graph network.
  • the extraction of text features is obtained through heterogeneous graph operations, and heterogeneous graph operations are also the process of updating the nodes of the text heterogeneous graph network.
  • This embodiment provides an optional implementation method, which may include the following contents:
  • the embodiment may stack multiple layers of the same structure.
  • each layer may be called a first graph attention network, and a first fully connected layer is also integrated after each layer of the first graph attention network; for each text heterogeneous node of each first graph attention network of the text heterogeneous graph network, the node feature of the current text heterogeneous node is updated according to whether there is a connection relationship between the current text heterogeneous node and the remaining text heterogeneous nodes and the association relationship between the text heterogeneous nodes; based on the node features of each text heterogeneous node of the updated text heterogeneous graph network, the text features of the text to be searched are generated.
  • the process of updating the node feature of the current text heterogeneous node according to whether the current text heterogeneous node has a connection relationship with the remaining text heterogeneous nodes and the association relationship between the text heterogeneous nodes may include:
  • the initial weight values of the current text heterogeneous node and each target text heterogeneous node are calculated, and the weight value of the current text heterogeneous node is determined according to each initial weight value;
  • the node feature of the current text heterogeneous node is updated, and the sum of the node feature after the update and the node feature before the update of the current text heterogeneous node is used as the node feature of the current text heterogeneous node.
  • the process of calculating the initial weight values of the current text heterogeneous node and each target text heterogeneous node based on the association relationship between the node feature of the current text heterogeneous node and the node features of each target text heterogeneous node may include:
  • the weight calculation formula is called to calculate the initial weight values of the current text heterogeneous node and each target text heterogeneous node respectively; the weight calculation formula can be:
  • zqp is the initial weight value of the qth text heterogeneous node and the pth text heterogeneous node
  • LeakyReLU() is the activation function
  • Wa , Wb , and Wc are known dimensional matrix, represents a d ⁇ d dimensional real vector, represents a real vector, is the node feature of the qth text heterogeneous node, is the node feature of the pth text heterogeneous node.
  • the node feature of the current text heterogeneous node is updated, including:
  • the initial update relation is called to update the node features of the current text heterogeneous nodes; the initial update relation can be expressed as:
  • is a hyperparameter
  • aqp is the normalized weight of the qth node of the step node and the pth node of the component node
  • Wv is the known dimensional matrix
  • NP is the total number of target text heterogeneous nodes.
  • the text to be searched is a recipe text
  • the recipe text includes cooking step data, which can be referred to as steps, and the cooking steps have a sequence.
  • the generation process of the entire text feature is described below:
  • text features are constructed into a graph structure, which includes nodes, node features and connection relationships.
  • Each text feature extracted from the first type of text data and each text feature extracted from the second type of text data are used as nodes of the graph structure, and each text feature, that is, the connection relationship between each node, e 11 , e 32 , e 33 , is the connection relationship of the graph structure.
  • the image to be searched in this embodiment is a recipe step diagram.
  • a step diagram data set is generated through multiple recipe step sample diagrams, and the main components of some recipe step sample diagrams are annotated, such as flour, sugar, papaya, etc.
  • the ResNet50 network is trained using the annotated recipe step sample diagram to classify the main components of the image.
  • the image to be searched that is, the recipe step diagram to be searched
  • the image to be searched is input into the trained ResNet50 network to obtain the main component information of the recipe step diagram to be searched, that is, the corresponding target recognition feature.
  • the components and steps are different from structure to nature, so they are called heterogeneous nodes.
  • each step is called a node, and similarly, each component is called a node.
  • a node is composed of a sentence or a phrase.
  • the Bert model can be used to extract the features of each sentence or each word, and the implementation method is as follows:
  • the principal component information extracted from all recipe text connections is input from the text information at the bottom, and the position information and data type accompanying the recipe text information and the principal component information are also input.
  • Position information means that if there are 5 words "peel and slice the mango" in a sentence, their position information is "1, 2, 3, 4, 5" respectively.
  • the data type means: if the input is step data, its data type is 1; if the input is component data, its data type is 2.
  • This feature is used to represent the node features, namely the component node features and the step node features.
  • the component node features and the step node features are both high-dimensional vectors with dimensions of Dimension (d-dimensional real vector).
  • the step information can be traversed through the text comparison method, each step text can be extracted, and then the principal component can be searched in turn. If the word in the principal component appears in the step, an edge is connected between the step and the principal component, that is, there is a connection relationship.
  • the connection relationship between the step node and the component node can be constructed, that is, the connection relationship of the heterogeneous graph.
  • the heterogeneous graph information update can use the graph attention network to realize feature aggregation and update.
  • the update method is to traverse each heterogeneous node in turn for update.
  • the aggregation and extraction of text features are realized by heterogeneous graph operations.
  • the calculation method can be as follows:
  • update the step node is the node feature of the qth node of the step node, Represents the feature of the pth node of the component node. If the qth node of the step node is connected (edge) to the pth node of the component node, the feature of the pth node of the component node is used to update the qth node feature of the step node.
  • the correlation between the nodes needs to be considered.
  • the correlation between the nodes can be represented by assigning weights.
  • the following relationship (1) can be used to calculate the correlation weight z qp between the qth node of the step node and the pth node feature of the component node. For each step node, for example Traverse all component nodes that have edges connected to it, assuming there are N P nodes, and get the corresponding relevant weight z qp .
  • Wa , Wb , and Wc are known dimensional matrix, Represents matrix multiplication, aka vector mapping.
  • the relevant weights of all component nodes of the edges connected to the step node can be normalized, that is, the following relationship (2) can be called to obtain the normalized relevant weight a qp :
  • aqp represents the normalized weight of the qth node of the step node and the pth node of the component node
  • 1 represents the first component node
  • exp represents the exponential function
  • exp( zqp ) represents the exponential function of zqp . It represents the sum of the relevant weights of all the component nodes of the edges connected to the step node.
  • W v is dimensional matrix, It is the new feature vector updated by the component nodes connected to it.
  • N Q is the set of N neighbor nodes of the q-th node of the step node.
  • the network update of one layer of the graph attention network is completed.
  • T layers of graph attention networks can be superimposed, with t representing the tth layer of the graph attention network.
  • the update method of the node features of each layer is as above.
  • an integrated fully connected layer is added after each layer of the graph attention network to realize the re-encoding of node features (including component nodes and step nodes), as shown in the following relationship (6):
  • FFN stands for fully connected layer. Represents the initial node features of the graph attention network at layer t+1.
  • the update of the node features is completed.
  • the step node integrates the ingredient node information
  • the ingredient node is updated through the graph neural network, and the relevant step node features are emphasized in the form of keywords.
  • the BiLSTM (Bi-directional Long Short-Term Memory) method can be used to further mine the temporal information of the step node, realize the induction and synthesis of the text node features, and package them into a vector.
  • the left and right arrows represent the direction of LSTM encoding, that is, the forward and reverse encoding of step node features.
  • the different directions of the arrows represent the BiLSTM encoding output obtained according to the different order of step node input.
  • the output of the entire text feature can be obtained by summing and averaging.
  • e rec represents the output of the text feature, which is used for the next step of retrieval.
  • the image heterogeneous graph network may include multiple layers of second graph attention networks, and each layer of the second graph attention network is further integrated with a second fully connected layer; the image to be searched is input into a pre-trained image feature extraction model to obtain the original image features of the image to be searched; for each image heterogeneous node of each second graph attention network of the image heterogeneous graph network, the node features of the current image heterogeneous node are updated according to whether there is a connection relationship between the current image heterogeneous node and the remaining image heterogeneous nodes and the association relationship between the image heterogeneous nodes; based on the node features of each image heterogeneous node of the updated image heterogeneous graph network, the image encoding features of the text to be searched are generated; the image encoding features are input into a pre-trained image feature generation model to obtain the image features of the image to be searched.
  • the image feature extraction model is used to extract the original image features of the image to be searched and the image sample, which can be extracted based on any existing image feature extraction model, which does not affect the implementation of this application.
  • the graph operation of the image heterogeneous graph network it can be implemented based on the graph operation method of the text heterogeneous graph network provided in the above embodiment, and it will not be repeated here.
  • the image targeted by this embodiment is an image containing a group of images, and the image feature generation model is used to integrate all image features of the image to be searched.
  • this embodiment takes the image to be searched as a recipe step diagram as an example to illustrate the generation process of the entire image feature:
  • the ResNet backbone network can be used to extract the original image features of each recipe step diagram, obtain the features of the ResNet network before the classification layer as the features of each image, and use the features to construct the image nodes of the image heterogeneous graph network, denoted as Ingredients are the ingredients of a dish, and are uniformly represented by ingredients below.
  • the main ingredients of the dish in this embodiment are obtained by classifying the recipe step diagram to obtain category labels.
  • the dish has as many ingredients as the number of category labels obtained through image classification. For example, scrambled eggs with tomatoes includes labels such as tomatoes, eggs, and oil.
  • the image heterogeneous graph network contains nodes and relationships. The following row Represents the component node, which is the classification label for the image from the image classification network.
  • each category label such as mango
  • the establishment of the relationship is still established through the classification network. If the image classification result has this category, the step image feature will establish an edge with the component. As shown in Figure 3, mango appears in all step images, so all step images will establish edges with it. Above, the nodes and edges are established. The following is how to use the image heterogeneous graph network for calculation to obtain the corresponding image features:
  • update the step node is the node feature of the mth node of the step graph node, Represents the feature of the nth node of the component node. If the mth node of the step graph node is connected (edge) to the nth node of the component node, the feature of the nth node of the component node is used to update the feature of the mth node of the step graph node.
  • the correlation between the nodes needs to be considered. In this embodiment, the correlation between the nodes can be represented by assigning weights.
  • the following relationship (10) can be called to calculate the correlation weight z mn between the mth node of the step graph node and the nth node feature of the component node.
  • the following relationship (10) can be called to calculate the correlation weight z mn between the mth node of the step graph node and the nth node feature of the component node.
  • Wd dimensional matrix
  • Wf matrix multiplication
  • the relevant weights of all component nodes of the edges connected to the step graph nodes can be normalized, that is, the following relationship (11) can be called to obtain the normalized relevant weights a mn :
  • exp represents the exponential function. It represents the sum of the relevant weights of all the component nodes of the edges connected to the step graph node. Finally, the node features of the step graph node are updated by the normalized relevant weights, that is, the following relation (12) is called for calculation:
  • W v is dimensional matrix, It is the new feature vector updated by the component nodes connected to it.
  • N M represents the common M step graph nodes connected to the component node, and the relationship (14) can be called to perform the same calculation and update on the component node:
  • a mn represents the normalized weight of the mth node feature of the step node and the nth node feature of the component node
  • a qp represents the normalized weight of the qth node feature of the step node and the pth node feature of the component node
  • Represents matrix multiplication, which is Mapped to W v represents the trainable weight matrix of the k-th layer network
  • the network update of one layer of the graph attention network is completed.
  • T layers of graph attention networks can be superimposed, with t representing the tth layer of the graph attention network.
  • the update method of the node features of each layer is as above.
  • an integrated fully connected layer is added after each layer of the graph attention network to realize the re-encoding of node features (including component nodes and step graph nodes), as shown in the following relationship (15):
  • FFN stands for fully connected layer. Represents the initial node features of the graph attention network at layer t+1.
  • the image features can be input into the long short-term memory neural network LSTM to obtain the overall features of the recipe step image, that is, the relationship Obtained.
  • LSTM represents each unit of the LSTM network. Represents the output of the mth LSTM unit. represents the recipe step graph feature, which comes from the heterogeneous graph node feature of the last layer, and m represents the mth image. Accordingly, the feature encoding output of the last LSTM unit is used as the feature output of the recipe step graph e csi , that is,
  • this embodiment further provides a training method for a bidirectional search model of image data and text data, see FIG4 , which may include the following contents:
  • S402 For each group of training samples in the training sample set, respectively obtain original image features, target recognition features, image features of image samples in the current group of training samples and target text features and text features of text samples.
  • the training sample set of this step includes multiple groups of training samples, each group of training samples includes a corresponding text sample and an image sample, that is, the text sample and the image sample are a set of sample data that match each other.
  • the number of training sample groups contained in the training sample set can be determined according to the actual training needs and the actual application scenarios, and this application does not impose any restrictions on this.
  • the text samples in the training sample set can be obtained from any existing database, and the image samples corresponding to the text samples can be obtained from the corresponding database. Of course, in order to expand the number of training sample sets.
  • the text sample or image text can also be the data after the original text sample or image text sample is processed by cropping, splicing, stretching, etc.
  • S405 Input the image features of each group of training samples into the image heterogeneous graph network and the text features into the text heterogeneous graph network to train the image-text bidirectional search model.
  • the text feature information of a text sample corresponds to the image feature of an image sample.
  • a loss function is used to guide the training of the model, and then the network parameters of the image-text bidirectional search model are updated by methods such as gradient backpropagation until the model training conditions are met, such as reaching the number of iterations or the convergence effect is good.
  • the training process of the image-text bidirectional search model may include a forward propagation stage and a backpropagation stage.
  • the forward propagation stage is the stage in which data is propagated from a low level to a high level
  • the backpropagation stage is the stage in which the error is propagated from a high level to a low level when the result obtained by the forward propagation does not match the expectation.
  • the error is backpropagated back to the image-text bidirectional search model, and the backpropagation errors of each part of the image-text bidirectional search model such as the graph neural network layer, the fully connected layer, and the convolution layer are obtained in turn.
  • Each layer of the image-text bidirectional search model adjusts all weight coefficients of the image-text bidirectional search model according to the back propagation error of each layer to update the weight. Randomly select a new batch of image features and text feature information, and then repeat the above process to obtain the output value of the network forward propagation.
  • the model training ends. All layer parameters of the model corresponding to the end of model training are used as the network parameters of the trained image-text bidirectional search model.
  • this embodiment also provides an optional implementation method of a loss function, that is, based on the text features and corresponding image features of each group of training samples, a loss function is called to guide the training process of the image-text bidirectional search model; the loss function can be expressed as:
  • N is the number of training sample groups, the model training is traversed N times, N represents the total number of paired samples in this batch.
  • the image group features Traverse (a total of N), and the selected image samples are called a represents anchor (anchor sample).
  • the text feature encoding paired with the anchor sample is recorded as p represents positive.
  • the unpaired text features are recorded as is a hyperparameter, which is fixed during training, for example, set to 0.3.
  • the same traversal operation is performed for text features. Represents the sample selected in the traversal, and its corresponding positive image group feature sample is recorded as The non-corresponding is a hyperparameter.
  • the embodiment of the present application also provides a corresponding device for the image-text bidirectional search method and the image-text matching model training method, which further makes the method more practical.
  • the device can be described from the perspective of functional modules and hardware.
  • the image-text bidirectional search device and the image-text matching model training device provided in the embodiment of the present application are introduced below.
  • the image-text bidirectional search device and the image-text matching model training device described below can be referenced to each other with the image-text bidirectional search method and the image-text matching model training method described above.
  • FIG. 5 is a structural diagram of a bidirectional image-text search device provided in an embodiment of the present application in one implementation manner.
  • the device may include:
  • the image recognition module 501 is used to call the image recognition network of the pre-trained image-text bidirectional search model to obtain the target recognition features of the target image block contained in each sub-image of the image to be searched;
  • a text feature extraction module 502 is used to obtain text features of a text to be searched that only contains one type of target text data based on a text heterogeneous graph network of a text-text bidirectional search model; the target text features corresponding to the target text data include target recognition features; the target recognition features and the target text features are node features of the text heterogeneous graph network, and the connection edges of the text heterogeneous graph network are determined by the inclusion relationship between the target recognition features and the target text features;
  • An image feature extraction module 503 is used to obtain image features of an image to be searched including a group of sub-images based on an image heterogeneous graph network of an image-text bidirectional search model; the original image features and target recognition features of the image to be searched are used as node features of the image heterogeneous graph network, and the connection edges of the image heterogeneous graph network are determined by the association relationship between the target recognition features and the original image features;
  • the bidirectional search module 504 is used to input image features and text features into a pre-trained image-text bidirectional search model to obtain image-text search results;
  • the image-text bidirectional search model includes a text heterogeneous graph network, an image heterogeneous graph network and an image recognition network.
  • the above-mentioned text feature extraction module 502 can also be used to: obtain text features of the text to be searched that only contains one type of target text data, including: responding to a text splitting instruction, splitting the target recognition features into multiple text phrases and/or text words, and splitting the target text data into multiple text sentences; inputting each text phrase and/or text word into a pre-trained text feature extraction model to obtain multiple first-category node features; inputting each text sentence into the text feature extraction model to obtain multiple second-category node features.
  • the above text feature extraction module 502 may also include a feature extraction unit for building a language representation model; the language representation model includes a text information input layer, a feature extraction layer and a text feature output layer; the feature extraction layer is a bidirectional encoder based on a converter; the language representation model is trained using a natural language text sample data set, and the trained language representation model is used as a text feature extraction model.
  • the above text feature extraction module 502 may also include a position input unit for inputting the position information of each text sentence and each phrase and each word contained in each text sentence in the current text sentence into the text feature extraction model.
  • the above text feature extraction module 502 may also include an identification processing unit for obtaining the data type of data to be input into the text feature extraction model at the next moment, so as to input the data type together with the corresponding data into the text feature extraction model; the data type includes a first identification for identifying the target recognition feature, and a second identification for identifying the target text data.
  • the above text feature extraction module 502 may further include an edge connection determination unit, which is used to traverse each text sentence of the target text data in turn for each text phrase or text word in the target recognition feature; if the target phrase contained in the current text sentence is the same as the current text phrase, then the second-category node feature corresponding to the current text sentence and the first-category node feature corresponding to the current text phrase have a connection relationship; if the target word contained in the current text sentence is the same as the current text word, then the second-category node feature corresponding to the current text sentence and the first-category node feature corresponding to the current text word have a connection relationship.
  • the above image recognition module 501 can also be used to pre-train an image recognition network using a target training sample set that annotates corresponding target recognition features in an image sample containing multiple sub-images; input the image to be searched into the image recognition network to obtain the target recognition features contained in each sub-image of the image to be searched.
  • the target recognition network structure includes an input layer, a convolution structure, a pooling layer and a classifier;
  • the convolution structure includes a basic operation component and a residual operation component;
  • the basic operation component is used to perform convolution processing, regularization processing, activation function processing and maximum pooling processing on the input image in sequence;
  • the residual operation component includes a plurality of connected residual blocks, each residual block includes multiple layers of convolution layers, which are used to perform convolution calculations on the output features of the basic operation components;
  • the pooling layer is used to convert the output features of the convolution structure into a target feature vector and transmit it to the classifier;
  • the classifier is used to calculate the target feature vector and output the probability of the category label.
  • the above-mentioned text feature extraction module 502 may also include a graph operation unit, which is used for a text heterogeneous graph network including multiple layers of first graph attention networks, and each layer of the first graph attention network is also integrated with a first fully connected layer; for each text heterogeneous node of each first graph attention network of the text heterogeneous graph network, according to whether there is a connection relationship between the current text heterogeneous node and the remaining text heterogeneous nodes and the association relationship between the text heterogeneous nodes, the node feature of the current text heterogeneous node is updated; based on the node features of each text heterogeneous node of the updated text heterogeneous graph network, the text features of the text to be searched are generated.
  • a graph operation unit which is used for a text heterogeneous graph network including multiple layers of first graph attention networks, and each layer of the first graph attention network is also integrated with a first fully connected layer; for each text heterogeneous node of each first graph attention network of the
  • the above graph operation unit can also be used to: determine a target text heterogeneous node that is connected to the current text heterogeneous node and is not of the same node type; calculate the initial weight values of the current text heterogeneous node and each target text heterogeneous node based on the association between the node features of the current text heterogeneous node and the node features of each target text heterogeneous node, and determine the weight value of the current text heterogeneous node according to each initial weight value; update the node feature of the current text heterogeneous node based on the weight value and each target text heterogeneous node, and use the sum of the node feature of the current text heterogeneous node after the update and the node feature before the update as the node feature of the current text heterogeneous node.
  • the above graph operation unit can be further used to: call the weight calculation relationship to calculate the initial weight value of the current text heterogeneous node and each target text heterogeneous node respectively; the weight calculation relationship is:
  • zqp is the initial weight value of the qth text heterogeneous node and the pth text heterogeneous node
  • LeakyReLU() is the activation function
  • Wa , Wb , and Wc are known dimensional matrix
  • Wa , Wb , and Wc are known dimensional matrix
  • the above graph operation unit can be further used to: call the initial update relational expression to update the node features of the current text heterogeneous node; the initial update relational expression is:
  • is a hyperparameter
  • aqp is the normalized weight of the qth node of the step node and the pth node feature of the component node
  • Wv is known dimensional matrix
  • NP is the total number of target text heterogeneous nodes.
  • the above-mentioned text feature extraction module 502 may further include a timing feature extraction unit, which is used to have a sequential execution order between each second-category node feature, and input each second-category node feature and sequence information into a pre-trained timing feature extraction model to obtain timing information features; the timing information features are mapped to the text features through a fully connected layer.
  • a timing feature extraction unit which is used to have a sequential execution order between each second-category node feature, and input each second-category node feature and sequence information into a pre-trained timing feature extraction model to obtain timing information features; the timing information features are mapped to the text features through a fully connected layer.
  • the above-mentioned timing feature extraction unit can be further used to: based on the sequence between each second-category node feature, input each second-category node feature into the bidirectional long short-term memory neural network in sequence and reverse order to obtain the timing coding feature of each second-category node feature; determine the timing information feature according to the timing coding feature of each second-category node feature.
  • the above time series feature extraction unit can be further used to: for each second-category node feature, call the positive sequence coding relationship formula, perform positive sequence coding on the current second-category node feature, and obtain a positive sequence coding feature; the positive sequence coding relationship formula is:
  • the reverse coding relational formula is called to encode the current second-category node feature in forward order to obtain the reverse coding feature; the reverse coding relational formula is:
  • the forward coding feature and the reverse coding feature are used as the temporal coding features of the current second type of node features
  • q ⁇ [1,Q] is the output of the qth unit in the forward encoding direction of the bidirectional long short-term memory neural network
  • Q is the total number of node features of the second category
  • is the backward encoding function of the bidirectional long short-term memory neural network is the forward encoding function of the bidirectional long short-term memory neural network.
  • the above-mentioned image feature extraction module 503 can also be used for: the image heterogeneous graph network includes multiple layers of second graph attention networks, and each layer of the second graph attention network is also integrated with a second fully connected layer; the image to be searched is input into a pre-trained image feature extraction model to obtain the original image features of the image to be searched; for each image heterogeneous node of each second graph attention network of the image heterogeneous graph network, according to whether there is a connection relationship between the current image heterogeneous node and the remaining image heterogeneous nodes and the association relationship between the image heterogeneous nodes, the node features of the current image heterogeneous node are updated; based on the node features of each image heterogeneous node of the updated image heterogeneous graph network, the image coding features of the text to be searched are generated; the image coding features are input into a pre-trained image feature generation model to obtain the image features of the image to be searched.
  • FIG. 6 is a structural diagram of a training device for an image-text matching model provided in an embodiment of the present application in one implementation manner, and the device may include:
  • the feature extraction module 601 is used to obtain the original image features, target recognition features, image features of the image samples in the current group of training samples and the target text features and text features of the text samples for each group of training samples in the training sample set; the target text features include the target recognition features; the image samples include a group of sub-images;
  • Model building module 602 used to pre-build a bidirectional image-text search model; based on using target recognition features and target text features as text heterogeneous node features respectively, and determining connecting edges according to the inclusion relationship between the target recognition features and the target text features, a text heterogeneous graph network of the bidirectional image-text search model is constructed; based on using original image features and target recognition features as image heterogeneous node features respectively, and determining connecting edges according to the correlation relationship between each target recognition feature and the original image feature, an image heterogeneous graph network of the bidirectional image-text search model is constructed;
  • the model training module 603 is used to input the image features of each group of training samples into the image heterogeneous graph network and the text features into the text heterogeneous graph network to train the image-text bidirectional search model.
  • the functions of the various functional modules of the image-text bidirectional search device and the image-text matching model training device in the embodiment of the present application can be implemented according to the method in the above-mentioned method embodiment.
  • the implementation process can refer to the relevant description of the above-mentioned method embodiment, which will not be repeated here.
  • FIG. 7 is a structural schematic diagram of the image-text bidirectional search device provided in an embodiment of the present application under one implementation.
  • the image-text bidirectional search device may include a memory 70 for storing computer programs; a processor 71 for implementing the steps of the image-text bidirectional search method and the image-text matching model training method mentioned in any of the above embodiments when executing the computer program.
  • the human-computer interaction component 72 is used to receive the training sample set selection request, model training request, search request input by the user through the information input/information output interface, and to display the image-text search results to the user;
  • the communication component 73 is used to transmit data and instructions during the training process of the image-text matching model and the execution process of the image-text bidirectional search task.
  • the processor 71 may include one or more processing cores, such as a 4-core processor or an 8-core processor.
  • the processor 71 may also be a controller, a microcontroller, a microprocessor or other data processing chip.
  • the processor 71 may be implemented in at least one hardware form of DSP (Digital Signal Processing), FPGA (Field-Programmable Gate Array), and PLA (Programmable Logic Array).
  • the processor 71 may also include a main processor and a coprocessor.
  • the main processor is a processor for processing data in the awake state, also known as CPU (Central Processing Unit); the coprocessor is a low-power processor for processing data in the standby state.
  • CPU Central Processing Unit
  • the processor 71 may be integrated with a GPU (Graphics Processing Unit), which is responsible for rendering and drawing the content to be displayed on the display screen.
  • the processor 71 may also include an AI (Artificial Intelligence) processor, which is used to process computing operations related to machine learning.
  • AI Artificial Intelligence
  • the memory 70 may include one or more computer-readable storage media, and the computer non-volatile readable storage media may be non-transitory.
  • the memory 70 may also include high-speed random access memory and non-volatile memory, such as one or more disk storage devices and flash memory storage devices.
  • the memory 70 may be an internal storage unit of the image-text bidirectional search device, such as a hard disk of a server.
  • the memory 70 may also be an external storage device of the image-text bidirectional search device, such as a plug-in hard disk equipped on a server, a smart memory card (Smart Media Card, SMC), a secure digital (Secure Digital, SD) card, a flash card (Flash Card), etc.
  • the memory 70 may also include both an internal storage unit and an external storage device of the image-text bidirectional search device.
  • the memory 70 may not only be used to store application software and various types of data installed in the image-text bidirectional search device, such as: the code of the program used and generated in the process of executing the image-text bidirectional search and the training process of the image-text matching model, but also be used to temporarily store data that has been output or is to be output.
  • the memory 70 is at least used to store the following computer program 701, wherein, after the computer program is loaded and executed by the processor 71, it can implement the relevant steps of the image-text bidirectional search method and the image-text matching model training method disclosed in any of the aforementioned embodiments.
  • the resources stored in the memory 70 may also include an operating system 702 and data 703, etc., and the storage method may be temporary storage or permanent storage.
  • the operating system 702 may include Windows, Unix, Linux, etc.
  • the data 703 may include but is not limited to the data generated during the image-text bidirectional search process and the image-text matching model training process, as well as the data corresponding to the bidirectional search results, etc.
  • the human-computer interaction component 72 may include a display screen, an information input/output interface such as a keyboard or a mouse.
  • the display screen and the information input/output interface belong to the user interface.
  • the optional user interface may also include a standard wired interface, a wireless interface, etc.
  • the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, and an OLED (Organic Light-Emitting Diode) touch device, etc.
  • the display may also be appropriately referred to as a display screen or a display unit, which is used to display information processed in the mutual retrieval device and to display a visual user interface.
  • the communication component 73 may include a communication interface or a network interface, a communication bus, etc.
  • the communication interface may optionally include a wired interface and/or a wireless interface, such as a WI-FI interface, a Bluetooth interface, etc., which is usually used to establish a communication connection between the image and text two-way search device and other devices.
  • the communication bus may be a peripheral component interconnect standard (PCI) bus or an extended industry standard architecture (EISA) bus, etc.
  • the bus may be divided into an address bus, a data bus, a control bus, etc.
  • FIG7 is represented by only one thick line, but it does not mean that there is only one bus or one type of bus.
  • the mutual search device may also include a power supply 74 and a sensor 75 for implementing various functions.
  • FIG7 does not constitute a limitation on the image-text bidirectional search device, and may include more or fewer components than shown in the figure.
  • the present embodiment does not limit the number of image-text bidirectional search devices, and it may be a method for training an image-text bidirectional search model and/or a method for training an image-text matching model that is jointly completed by multiple image-text bidirectional search devices.
  • Figure 8 is a schematic diagram of a hardware composition framework applicable to another method for training an image-text bidirectional search model and/or a method for training an image-text matching model provided in an embodiment of the present application.
  • the hardware composition framework may include: a first image-text bidirectional search device 81 and a second image-text bidirectional search device 82, which are connected via a network.
  • the hardware structure of the first image-text bidirectional search device 81 and the second image-text bidirectional search device 82 can refer to the electronic device in FIG7. That is, it can be understood that there are two electronic devices in this embodiment, and the two exchange data.
  • the trained image-text bidirectional search model shown in FIG9 can be pre-deployed in any device.
  • the embodiment of the present application does not limit the form of the network, that is, the network can be a wireless network (such as WIFI, Bluetooth, etc.) or a wired network.
  • the first image-text bidirectional search device 81 and the second image-text bidirectional search device 82 can be the same electronic device, such as the first image-text bidirectional search device 81 and the second image-text bidirectional search device 82 are both servers; they can also be different types of electronic devices, for example, the first image-text bidirectional search device 81 can be a smart phone or other smart terminal, and the second image-text bidirectional search device 82 can be a server.
  • the model training process and the trained image-text bidirectional search model can be pre-deployed on the end with high computing performance.
  • a server with strong computing power can be used as the second image-text bidirectional search device 82 to improve data processing efficiency and reliability, thereby improving the processing efficiency of model training and/or image-text bidirectional retrieval.
  • a low-cost and widely used smart phone is used as the first image-text bidirectional search device 81 to realize the interaction between the second image-text bidirectional search device 82 and the user.
  • the interaction process can be, for example, that the smart phone obtains a training sample set from the server, obtains the labels of the training sample set, sends these labels to the server, and the server uses the obtained labels to perform subsequent model training steps.
  • the server After generating the image-text bidirectional search model, the server obtains the search request sent by the smart phone.
  • the search request is issued by the user and carries the data to be searched.
  • the server determines the data to be searched by parsing the search request, and calls the image-text bidirectional search model to perform corresponding processing on the data to be searched, obtains the corresponding search results, and feeds back the search results to the first image-text bidirectional search device 81.
  • the image-text bidirectional search method in the above embodiment is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer non-volatile readable storage medium.
  • the technical solution of the present application is essentially or the part that contributes to the prior art or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a non-volatile storage medium to execute all or part of the steps of the various embodiments of the present application.
  • the aforementioned non-volatile storage medium includes: U disk, mobile hard disk, read-only memory (ROM), random access memory (RAM), electrically erasable programmable ROM, register, hard disk, multimedia card, card-type memory (such as SD or DX memory, etc.), magnetic memory, removable disk, CD-ROM, disk or optical disk, etc.
  • Various media that can store program codes include: U disk, mobile hard disk, read-only memory (ROM), random access memory (RAM), electrically erasable programmable ROM, register, hard disk, multimedia card, card-type memory (such as SD or DX memory, etc.), magnetic memory, removable disk, CD-ROM, disk or optical disk, etc.
  • Various media that can store program codes include: U disk, mobile hard disk, read-only memory (ROM), random access memory (RAM), electrically erasable programmable ROM, register, hard disk, multimedia card, card-type memory (such as SD or DX memory, etc.), magnetic memory, removable disk, CD-ROM, disk or optical disk,
  • an embodiment of the present application further provides a non-volatile readable storage medium storing a computer program, and when the computer program is executed by a processor, the steps of the image-text bidirectional search method described in any of the above embodiments are performed.
  • each embodiment is described in a progressive manner, and each embodiment focuses on the differences from other embodiments.
  • the same or similar parts between the embodiments can be referred to each other.
  • the hardware disclosed in the embodiments including devices and equipment, since they correspond to the methods disclosed in the embodiments, the description is relatively simple, and the relevant parts can be referred to the method part description.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Multimedia (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Library & Information Science (AREA)
  • Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Provided are an image-text bidirectional search method, apparatus and device, and a non-volatile readable storage medium, which are applied to the technical field of information retrieval. The method comprises: pre-training an image-text bidirectional search model, which comprises a text heterogeneous graph network, an image heterogeneous graph network, and an image recognition network; calling the image recognition network to acquire target recognition features of an image to be searched for; on the basis of the text heterogeneous graph network, acquiring text features and target text features of text to be searched for, wherein the text heterogeneous graph network is constructed by taking the target text features and the target recognition features as nodes; acquiring image features of said image on the basis of the image heterogeneous graph network, wherein the image heterogeneous graph network is constructed by taking original image features and the target recognition features of said image as nodes; and inputting the image features and the text features into the image-text bidirectional search model, so as to obtain an image-text search result.

Description

图文双向搜索方法、装置、设备及非易失性可读存储介质Image and text bidirectional search method, device, equipment and non-volatile readable storage medium
相关申请的交叉引用CROSS-REFERENCE TO RELATED APPLICATIONS
本申请要求于2022年11月08日提交中国专利局,申请号为202211388778.5,申请名称为“图文双向搜索及匹配模型训练方法、装置、设备及介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims priority to the Chinese patent application filed with the China Patent Office on November 8, 2022, with application number 202211388778.5 and application name “Graphic-text bidirectional search and matching model training method, device, equipment and medium”, all contents of which are incorporated by reference in this application.
技术领域Technical Field
本申请涉及信息检索技术领域,特别是涉及一种图文双向搜索方法、装置、设备及非易失性可读存储介质。The present application relates to the field of information retrieval technology, and in particular to a method, device, equipment and non-volatile readable storage medium for bidirectional search of images and texts.
背景技术Background technique
随着计算机技术以及网络技术被广泛地应用在日常工作生活中,数据量和数据类型都日益增多,表述同一目标的信息运行于不同媒体中,该信息以不同数据格式的数据存在,如图像数据、文本数据、音频数据、视频数据等。举例来说,对于同一款服务器来说,既可以采用文本数据描述该服务器的物理参数和性能信息发布在网页中,也可以直接以视频方式描述该服务器的物理参数和性能信息发布在视频网站中。相应的,用户可能会希望基于目标检索词如服务器型号检索到所有相关的、且不同格式的数据,也可能基于某一类格式的数据检索到与之相同的其他类型的数据,也即不同类型数据之间的双向搜索。As computer technology and network technology are widely used in daily work and life, the amount and type of data are increasing day by day. Information expressing the same goal runs in different media, and the information exists in different data formats, such as image data, text data, audio data, video data, etc. For example, for the same server, the physical parameters and performance information of the server can be described in text data and published on a web page, or it can be directly described in video form and published on a video website. Accordingly, users may want to retrieve all relevant data in different formats based on target search terms such as server models, or they may want to retrieve other types of data of the same type based on data of a certain format, that is, two-way search between different types of data.
相关技术通常基于注意力机制实现图像文本互检索,其利用注意力将提取到的图像特征加权到文本特征中,对文本特征进行重构,增强文本与图像之间的相似性。该方法虽然能够利用注意力重构电子文本特征,但是,其只是简单地在重构电子文本特征时使用自然图像对电子文本的单向注意力,由于自然图像与电子文本存在对应关系,相互对应的高阶特征间互相影响,仅仅重构电子文本特征而忽略自然图像特征,使得自然图像特征无法准确与电子文本特征对应,影响图像文本互相检索。且其无法获取在不同模态特征交互时的联合特征,对于涉及到先后顺序或者是具有依赖关系的数据,如基于步骤检索的任务中,会导致图像与文本之间的检索准确度较低。Related technologies usually implement image-text mutual retrieval based on the attention mechanism, which uses attention to weight the extracted image features into the text features, reconstruct the text features, and enhance the similarity between text and image. Although this method can use attention to reconstruct electronic text features, it simply uses the unidirectional attention of natural images to electronic text when reconstructing electronic text features. Since there is a corresponding relationship between natural images and electronic texts, the corresponding high-order features affect each other. Only reconstructing electronic text features while ignoring natural image features makes it impossible for natural image features to accurately correspond to electronic text features, affecting image-text mutual retrieval. In addition, it is unable to obtain joint features when different modal features interact. For data involving sequence or dependency, such as tasks based on step retrieval, the retrieval accuracy between images and texts is low.
鉴于此,如何提升图像数据和文本数据之间的双向搜索精度,是所属领域技术人员需要解决的技术问题。In view of this, how to improve the accuracy of two-way search between image data and text data is a technical problem that technicians in the relevant field need to solve.
发明内容Summary of the invention
本申请提供了一种图文双向搜索方法、装置、设备及非易失性可读存储介质,有效提升图像数据和文本数据之间的双向搜索精度。The present application provides a method, device, equipment and non-volatile readable storage medium for bidirectional search between image data and text data, which effectively improves the accuracy of bidirectional search between image data and text data.
为解决上述技术问题,本申请实施例提供以下技术方案:In order to solve the above technical problems, the embodiments of the present application provide the following technical solutions:
本申请实施例第一方面提供了一种图文双向搜索方法,包括:A first aspect of an embodiment of the present application provides a method for bidirectional search of images and texts, including:
预先训练图文双向搜索模型;图文双向搜索模型包括文本异质图网络、图像异质图网络和图像识别网络;Pre-training a bidirectional image-text search model; the bidirectional image-text search model includes a text heterogeneous graph network, an image heterogeneous graph network, and an image recognition network;
调用图像识别网络,获取待搜索图像的每张子图像所包含的目标图像块的目标识别特征;Calling the image recognition network to obtain target recognition features of the target image blocks contained in each sub-image of the image to be searched;
基于文本异质图网络,获取仅包含一类目标文本数据的待搜索文本的文本特征;目标文本数据对应的目标文本特征包括目标识别特征;目标识别特征和目标文本特征为文本异质图网络的节点特征,文本异质图网络的连接边由目标识别特征与目标文本特征间的包含关系确定;Based on the text heterogeneous graph network, text features of the text to be searched that only contains one type of target text data are obtained; the target text features corresponding to the target text data include target recognition features; the target recognition features and the target text features are node features of the text heterogeneous graph network, and the connection edges of the text heterogeneous graph network are determined by the inclusion relationship between the target recognition features and the target text features;
基于图像异质图网络,获取包括一组子图像的待搜索图像的图像特征;待搜索图像的原始图像特征和目标识别特征作为图像异质图网络的节点特征,图像异质图网络的连接边由各目标识别特征和原始图像特征之间的关联关系确定;Based on the image heterogeneous graph network, image features of the image to be searched including a group of sub-images are obtained; the original image features and target recognition features of the image to be searched are used as node features of the image heterogeneous graph network, and the connection edges of the image heterogeneous graph network are determined by the association relationship between each target recognition feature and the original image feature;
将图像特征和文本特征输入至图文双向搜索模型,得到图文搜索结果。The image features and text features are input into the image-text bidirectional search model to obtain the image-text search results.
可选的,预先训练图文双向搜索模型之后,还包括:Optionally, after pre-training the image-text bidirectional search model, it also includes:
响应文本拆分指令,将目标识别特征拆分为多个文本词组和/或文本单词,将目标文本数据拆分为多个文本语句;In response to the text splitting instruction, the target recognition feature is split into a plurality of text phrases and/or text words, and the target text data is split into a plurality of text sentences;
将各文本词组和/或文本单词输入至预先训练好的文本特征提取模型中,得到多个第一类节点特征;Inputting each text phrase and/or text word into a pre-trained text feature extraction model to obtain a plurality of first-category node features;
将各文本语句输入至文本特征提取模型中,得到多个第二类节点特征。Each text sentence is input into the text feature extraction model to obtain multiple second-category node features.
可选的,获取仅包含一类目标文本数据的待搜索文本的文本特征之前,还包括:Optionally, before obtaining the text features of the text to be searched that only contains one type of target text data, the following further includes:
搭建语言表征模型;语言表征模型包括文本信息输入层、特征提取层和文本特征输出层;特征提取层为基于转换器的双向编码器;Build a language representation model; the language representation model includes a text information input layer, a feature extraction layer, and a text feature output layer; the feature extraction layer is a bidirectional encoder based on a converter;
利用自然语言文本样本数据集训练语言表征模型,并将训练好的语言表征模型作为文本特征提取模型。The language representation model is trained using a natural language text sample dataset, and the trained language representation model is used as a text feature extraction model.
可选的,将各文本语句输入至文本特征提取模型中,包括:Optionally, each text sentence is input into a text feature extraction model, including:
将各文本语句以及每个文本语句中包含的各词组、各单词所在当前文本语句中的位置信息,输入至文本特征提取模型。The position information of each text sentence and each phrase and each word contained in each text sentence in the current text sentence is input into the text feature extraction model.
可选的,将各文本词组和/或文本单词输入至预先构建的文本特征提取模型中,得到多个第一类节点特征之前,以及将各文本语句输入至文本特征提取模型中,得到多个第二类节点特征之前,还包括:Optionally, before inputting each text phrase and/or text word into a pre-built text feature extraction model to obtain a plurality of first-category node features, and before inputting each text sentence into a text feature extraction model to obtain a plurality of second-category node features, the method further includes:
获取下一时刻输入至文本特征提取模型中的数据的数据类型,以将数据类型连同相应的数据一起输入至文本特征提取模型中;Obtaining the data type of data to be input into the text feature extraction model at the next moment, so as to input the data type together with the corresponding data into the text feature extraction model;
数据类型包括用于标识目标识别特征的第一标识,和用于标识目标文本数据的第二标识。The data type includes a first identifier for identifying a target identification feature and a second identifier for identifying target text data.
可选的,文本异质图网络的连接边由目标识别特征与目标文本特征间的包含关系确定,包括:Optionally, the connection edges of the text heterogeneous graph network are determined by the inclusion relationship between the target recognition feature and the target text feature, including:
对目标识别特征中的每个文本词组或文本单词,依次遍历目标文本数据的每个文本语句;For each text phrase or text word in the target recognition feature, traverse each text sentence of the target text data in turn;
若当前文本语句所包含的目标词组与当前文本词组相同,则当前文本语句对应的第二类节点特征与当前文本词组对应的第一类节点特征具有连接关系;If the target phrase contained in the current text sentence is the same as the current text phrase, then the second type of node feature corresponding to the current text sentence has a connection relationship with the first type of node feature corresponding to the current text phrase;
若当前文本语句所包含的目标单词与当前文本单词相同,则当前文本语句对应的第二类节点特征与当前文本单词对应的第一类节点特征具有连接关系。If the target word included in the current text sentence is the same as the current text word, then the second type of node feature corresponding to the current text sentence has a connection relationship with the first type of node feature corresponding to the current text word.
可选的,获取待搜索图像的每张子图像所包含的目标图像块的目标识别特征,包括:Optionally, obtaining target recognition features of target image blocks contained in each sub-image of the image to be searched includes:
预先利用在包含多张子图像的图像样本中标注相应目标识别特征的目标训练样本集,训练得到图像识别网络;Preliminarily using a target training sample set in which corresponding target recognition features are annotated in an image sample containing a plurality of sub-images, an image recognition network is trained;
将待搜索图像输入至图像识别网络中,得到待搜索图像的每张子图像所包含的目标识别特征。The image to be searched is input into the image recognition network to obtain the target recognition features contained in each sub-image of the image to be searched.
可选的,利用在包含多张子图像的图像样本中标注相应目标识别特征的目标训练样本集,训练得到图像识别网络之前,还包括:Optionally, before training the image recognition network using the target training sample set in which corresponding target recognition features are annotated in the image sample including the plurality of sub-images, the method further includes:
预先构建目标识别网络结构,目标识别网络结构包括输入层、卷积结构、池化层及分类器;Pre-build the target recognition network structure, which includes input layer, convolution structure, pooling layer and classifier;
卷积结构包括基础运算组件和残差运算组件;基础运算组件用于对输入图像依次进行卷积处理、正则化处理、激活函数处理及最大池化处理;残差运算组件包括多个相连的残差块,每个残差块均包括多层卷积层,用于对基础运算组件的输出特征进行卷积计算;The convolution structure includes a basic operation component and a residual operation component; the basic operation component is used to perform convolution processing, regularization processing, activation function processing and maximum pooling processing on the input image in sequence; the residual operation component includes multiple connected residual blocks, each residual block includes multiple convolution layers, which are used to perform convolution calculations on the output features of the basic operation component;
池化层,用于将卷积结构的输出特征转化为目标特征向量,并输送至分类器;The pooling layer is used to convert the output features of the convolution structure into target feature vectors and transmit them to the classifier;
分类器,用于通过对目标特征向量进行计算,并输出所属类别标签的概率。The classifier is used to calculate the target feature vector and output the probability of the category label.
可选的,文本异质图网络包括多层第一图注意力网络,每一层第一图注意网络之后还集成第一全连接层;获取仅包含一类目标文本数据的待搜索文本的文本特征,包括:Optionally, the text heterogeneous graph network includes multiple layers of first graph attention networks, and each layer of the first graph attention network is further integrated with a first fully connected layer; obtaining text features of the text to be searched that only contains one type of target text data includes:
对文本异质图网络的各第一图注意力网络的每个文本异质节点,根据当前文本异质节点与其余各文本异质节点之间是否具有连接关系以及各文本异质节点之间的关联关系,更新当前文本异质节点的节点特征;For each text heterogeneous node of each first graph attention network of the text heterogeneous graph network, according to whether there is a connection relationship between the current text heterogeneous node and the remaining text heterogeneous nodes and the association relationship between the text heterogeneous nodes, the node feature of the current text heterogeneous node is updated;
基于更新后的文本异质图网络的每个文本异质节点的节点特征,生成待搜索文本的文本特征。Based on the node features of each text heterogeneous node in the updated text heterogeneous graph network, text features of the text to be searched are generated.
可选的,根据当前文本异质节点与其余各文本异质节点之间是否具有连接关系以及各文本异质节点之间的关联关系,更新当前文本异质节点的节点特征,包括:Optionally, according to whether the current text heterogeneous node has a connection relationship with other text heterogeneous nodes and the association relationship between the text heterogeneous nodes, the node feature of the current text heterogeneous node is updated, including:
确定与当前文本异质节点具有相连关系、且不为同一节点类型的目标文本异质节点;Determine a target text heterogeneous node that is connected to the current text heterogeneous node and is not of the same node type;
基于当前文本异质节点的节点特征与各目标文本异质节点的节点特征之间的关联关系,计算当前文本异质节点与每个目标文本异质节点的初始权重值,并根据各初始权重值确定当前文本异质节点的权重值;Based on the correlation between the node features of the current text heterogeneous node and the node features of each target text heterogeneous node, the initial weight values of the current text heterogeneous node and each target text heterogeneous node are calculated, and the weight value of the current text heterogeneous node is determined according to each initial weight value;
基于权重值和各目标文本异质节点,对当前文本异质节点进行节点特征更新,并将当前文本异质节点更新后的节点特征和更新前的节点特征之和作为当前文本异质节点的节点特征。Based on the weight value and each target text heterogeneous node, the node feature of the current text heterogeneous node is updated, and the sum of the node feature after the update and the node feature before the update of the current text heterogeneous node is used as the node feature of the current text heterogeneous node.
可选的,基于当前文本异质节点的节点特征与各目标文本异质节点的节点特征之间的关联关系,计算当前文本异质节点与每个目标文本异质节点的初始权重值,包括:Optionally, based on the correlation between the node feature of the current text heterogeneous node and the node features of each target text heterogeneous node, the initial weight value of the current text heterogeneous node and each target text heterogeneous node is calculated, including:
调用权重计算关系式分别计算当前文本异质节点与每个目标文本异质节点的初始权重值;权重计算关系式为:The weight calculation formula is called to calculate the initial weight values of the current text heterogeneous node and each target text heterogeneous node respectively; the weight calculation formula is:
Figure PCTCN2022142513-appb-000001
Figure PCTCN2022142513-appb-000001
其中,z qp为第q个文本异质节点与第p个文本异质节点的初始权重值,LeakyReLU()为激活函数,W a、W b、W c为已知的
Figure PCTCN2022142513-appb-000002
维矩阵,
Figure PCTCN2022142513-appb-000003
为第q个文本异质节点的节点特征,
Figure PCTCN2022142513-appb-000004
为第p个文本异质节点的节点特征。
Where zqp is the initial weight value of the qth text heterogeneous node and the pth text heterogeneous node, LeakyReLU() is the activation function, and Wa , Wb , and Wc are known
Figure PCTCN2022142513-appb-000002
dimensional matrix,
Figure PCTCN2022142513-appb-000003
is the node feature of the qth text heterogeneous node,
Figure PCTCN2022142513-appb-000004
is the node feature of the pth text heterogeneous node.
可选的,基于权重值和各目标文本异质节点,对当前文本异质节点进行节点特征更新,包括:Optionally, based on the weight value and each target text heterogeneous node, the node feature of the current text heterogeneous node is updated, including:
调用初次更新关系式,对当前文本异质节点的节点特征进行更新;初次更新关系式为:Call the initial update relational expression to update the node features of the current text heterogeneous node; the initial update relational expression is:
Figure PCTCN2022142513-appb-000005
Figure PCTCN2022142513-appb-000005
式中,
Figure PCTCN2022142513-appb-000006
为第q个文本异质节点更新后的节点特征,σ为超参数,a qp为步骤节点的第q个节点与成分节点的第p个节点特征的归一化的权重,W v为已知的
Figure PCTCN2022142513-appb-000007
维矩阵,
Figure PCTCN2022142513-appb-000008
为第p个文本异质节点的节点特征,N P为目标文本异质节点总数。
In the formula,
Figure PCTCN2022142513-appb-000006
is the updated node feature of the qth text heterogeneous node, σ is a hyperparameter, aqp is the normalized weight of the qth node of the step node and the pth node of the component node, and Wv is the known
Figure PCTCN2022142513-appb-000007
dimensional matrix,
Figure PCTCN2022142513-appb-000008
is the node feature of the pth text heterogeneous node, and NP is the total number of target text heterogeneous nodes.
可选的,目标文本数据对应的各第二类节点特征之间具有先后执行顺序,基于文本异质图网络,获取 仅包含一类目标文本数据的待搜索文本的文本特征之后,还包括:Optionally, the second-class node features corresponding to the target text data have a sequential execution order, and based on the text heterogeneous graph network, after obtaining the text features of the text to be searched that only contains one class of target text data, the method further includes:
将各第二类节点特征以及顺序信息,输入至预先训练好的时序特征提取模型中,得到时序信息特征;Input each second-category node feature and sequence information into a pre-trained time series feature extraction model to obtain time series information features;
将时序信息特征,通过全连接层映射至文本特征中。The time series information features are mapped to the text features through the fully connected layer.
可选的,将各第二类节点特征以及顺序信息,输入至预先训练好的时序特征提取模型,得到时序信息特征,包括:Optionally, each second-category node feature and sequence information is input into a pre-trained time series feature extraction model to obtain time series information features, including:
基于各第二类节点特征之间的先后顺序,依次将各第二类节点特征按照顺序和逆序输入至双向长短期记忆神经网络,得到各第二类节点特征的时序编码特征;Based on the sequence between the features of each second type of node, the features of each second type of node are input into the bidirectional long short-term memory neural network in sequence and reverse order to obtain the temporal coding features of each second type of node feature;
根据每个第二类节点特征时序编码特征确定时序信息特征。The timing information feature is determined according to the characteristic timing coding feature of each second-category node.
可选的,依次将各第二类节点特征按照顺序和逆序输入至双向长短期记忆神经网络,得到各第二类节点特征的时序编码特征,包括:Optionally, each second-category node feature is input into a bidirectional long short-term memory neural network in order and in reverse order to obtain a temporal coding feature of each second-category node feature, including:
对每一个第二类节点特征,调用正序编码关系式,对当前第二类节点特征进行正序编码,得到正序编码特征;正序编码关系式为:For each second-class node feature, the positive sequence encoding relation is called to perform positive sequence encoding on the current second-class node feature to obtain the positive sequence encoding feature; the positive sequence encoding relation is:
Figure PCTCN2022142513-appb-000009
Figure PCTCN2022142513-appb-000009
调用倒序编码关系式,对当前第二类节点特征进行正序编码,得到倒序编码特征;倒序编码关系式为:The reverse coding relational formula is called to encode the current second-category node feature in forward order to obtain the reverse coding feature; the reverse coding relational formula is:
Figure PCTCN2022142513-appb-000010
Figure PCTCN2022142513-appb-000010
将正序编码特征和倒序编码特征作为当前第二类节点特征的时序编码特征;The forward coding feature and the reverse coding feature are used as the temporal coding features of the current second type of node features;
式中,q∈[1,Q],
Figure PCTCN2022142513-appb-000011
为双向长短期记忆神经网络的正向编码方向的第q个单元的输出,
Figure PCTCN2022142513-appb-000012
为文本异质图网络中第T层图注意力网络的第q个第二类节点特征,
Figure PCTCN2022142513-appb-000013
为双向长短期记忆神经网络的正向编码方向的第q-1个单元的输出,Q为第二类节点特征总数,
Figure PCTCN2022142513-appb-000014
为双向长短期记忆神经网络的倒向编码方向的第q个单元的输出,
Figure PCTCN2022142513-appb-000015
为双向长短期记忆神经网络的倒向编码方向的第q+1个单元的输出,
Figure PCTCN2022142513-appb-000016
为双向长短期记忆神经网络的倒向编码函数,
Figure PCTCN2022142513-appb-000017
为双向长短期记忆神经网络的正向编码函数。
Where q∈[1,Q],
Figure PCTCN2022142513-appb-000011
is the output of the qth unit in the forward encoding direction of the bidirectional long short-term memory neural network,
Figure PCTCN2022142513-appb-000012
is the qth second-category node feature of the Tth layer graph attention network in the text heterogeneous graph network,
Figure PCTCN2022142513-appb-000013
is the output of the q-1th unit in the forward encoding direction of the bidirectional long short-term memory neural network, Q is the total number of node features of the second category,
Figure PCTCN2022142513-appb-000014
is the output of the qth unit in the reverse encoding direction of the bidirectional long short-term memory neural network,
Figure PCTCN2022142513-appb-000015
is the output of the q+1th unit in the reverse encoding direction of the bidirectional long short-term memory neural network,
Figure PCTCN2022142513-appb-000016
is the backward encoding function of the bidirectional long short-term memory neural network,
Figure PCTCN2022142513-appb-000017
is the forward encoding function of the bidirectional long short-term memory neural network.
可选的,图像异质图网络包括多层第二图注意网络,每一层第二图注意网络之后还集成第二全连接层;获取包括一组子图像的待搜索图像的图像特征,包括:Optionally, the image heterogeneous graph network includes multiple layers of second graph attention networks, and each layer of the second graph attention network is further integrated with a second fully connected layer; obtaining image features of an image to be searched including a group of sub-images, including:
将待搜索图像输入至预先训练好的图像特征提取模型,得到待搜索图像的原始图像特征;Input the image to be searched into a pre-trained image feature extraction model to obtain the original image features of the image to be searched;
对图像异质图网络的各第二图注意力网络的每个图像异质节点,根据当前图像异质节点与其余各图像异质节点之间是否具有连接关系以及各图像异质节点之间的关联关系,更新当前图像异质节点的节点特征;For each image heterogeneous node of each second graph attention network of the image heterogeneous graph network, according to whether there is a connection relationship between the current image heterogeneous node and the remaining image heterogeneous nodes and the association relationship between the image heterogeneous nodes, update the node feature of the current image heterogeneous node;
基于更新后的图像异质图网络的每个图像异质节点的节点特征,生成待搜索文本的图像编码特征;Generate image encoding features of the text to be searched based on the node features of each image heterogeneous node of the updated image heterogeneous graph network;
将图像编码特征输入至预先训练好的图像特征生成模型,得到待搜索图像的图像特征。The image encoding features are input into a pre-trained image feature generation model to obtain the image features of the image to be searched.
本申请实施例第二方面提供了一种图文双向搜索装置,包括:A second aspect of the embodiment of the present application provides a device for bidirectional search of images and texts, including:
图像识别模块,用于调用预先训练好的图文双向搜索模型的图像识别网络,获取待搜索图像的每张子图像所包含的目标图像块的目标识别特征;An image recognition module is used to call the image recognition network of the pre-trained image-text bidirectional search model to obtain target recognition features of the target image block contained in each sub-image of the image to be searched;
文本特征提取模块,用于基于图文双向搜索模型的文本异质图网络,获取仅包含一类目标文本数据的待搜索文本的文本特征;目标文本数据对应的目标文本特征包括目标识别特征;目标识别特征和目标文本特征为文本异质图网络的节点特征,文本异质图网络的连接边由目标识别特征与目标文本特征间的包含关系确定;A text feature extraction module is used for obtaining text features of a text to be searched that contains only one type of target text data based on a text heterogeneous graph network of a bidirectional search model for text and images; target text features corresponding to the target text data include target recognition features; target recognition features and target text features are node features of the text heterogeneous graph network, and the connection edges of the text heterogeneous graph network are determined by the inclusion relationship between the target recognition features and the target text features;
图像特征提取模块,用于基于图文双向搜索模型的图像异质图网络,获取包括一组子图像的待搜索图像的图像特征;待搜索图像的原始图像特征和目标识别特征作为图像异质图网络的节点特征,图像异质图网络的连接边由各目标识别特征和原始图像特征之间的关联关系确定;An image feature extraction module is used to obtain image features of an image to be searched including a group of sub-images based on an image heterogeneous graph network of an image-text bidirectional search model; the original image features and target recognition features of the image to be searched are used as node features of the image heterogeneous graph network, and the connection edges of the image heterogeneous graph network are determined by the association relationship between each target recognition feature and the original image feature;
双向搜索模块,用于将图像特征和文本特征输入至预先训练好的图文双向搜索模型,得到图文搜索结果;图文双向搜索模型包括文本异质图网络、图像异质图网络和图像识别网络。The bidirectional search module is used to input image features and text features into a pre-trained image-text bidirectional search model to obtain image-text search results; the image-text bidirectional search model includes a text heterogeneous graph network, an image heterogeneous graph network and an image recognition network.
本申请实施例第三方面提供了一种图像文本匹配模型的训练方法,包括:A third aspect of the present application embodiment provides a method for training an image-text matching model, comprising:
预先搭建图文双向搜索模型;Pre-build a bidirectional image and text search model;
对训练样本集的每组训练样本,分别获取当前组训练样本中的图像样本的原始图像特征、目标识别特征、图像特征和文本样本的目标文本特征、文本特征;目标文本特征包括目标识别特征;图像样本包括一组子图像;For each group of training samples in the training sample set, original image features, target recognition features, image features of image samples in the current group of training samples and target text features and text features of text samples are obtained respectively; the target text features include target recognition features; the image samples include a group of sub-images;
基于将目标识别特征和目标文本特征分别作为文本异质节点特征,并根据目标识别特征与目标文本特征间的包含关系确定连接边,构建图文双向搜索模型的文本异质图网络;Based on taking the target recognition features and target text features as text heterogeneous node features respectively, and determining the connection edge according to the inclusion relationship between the target recognition features and the target text features, a text heterogeneous graph network of the graph-text bidirectional search model is constructed;
基于将原始图像特征和目标识别特征分别作为图像异质节点特征,并根据各目标识别特征与原始图像特征间的关联关系确定连接边,构建图文双向搜索模型的图像异质图网络;Based on taking original image features and target recognition features as image heterogeneous node features respectively, and determining the connection edges according to the correlation between each target recognition feature and the original image feature, an image heterogeneous graph network of the image-text bidirectional search model is constructed;
将每组训练样本的图像特征输入图像异质图网络、文本特征输入至文本异质图网络中,训练图文双向搜索模型。The image features of each group of training samples are input into the image heterogeneous graph network, and the text features are input into the text heterogeneous graph network to train the image-text bidirectional search model.
本申请实施例第四方面提供了一种图像文本匹配模型的训练装置,包括:A fourth aspect of the embodiments of the present application provides a training device for an image-text matching model, comprising:
特征提取模块,用于对训练样本集的每组训练样本,分别获取当前组训练样本中的图像样本的原始图像特征、目标识别特征、图像特征和文本样本的目标文本特征、文本特征;目标文本特征包括目标识别特征;图像样本包括一组子图像;The feature extraction module is used to obtain the original image features, target recognition features, image features of the image samples in the current group of training samples and the target text features and text features of the text samples for each group of training samples in the training sample set; the target text features include the target recognition features; the image samples include a group of sub-images;
模型搭建模块,用于预先搭建图文双向搜索模型;基于将目标识别特征和目标文本特征分别作为文本异质节点特征,并根据目标识别特征与目标文本特征间的包含关系确定连接边,构建图文双向搜索模型的文本异质图网络;基于将原始图像特征和目标识别特征分别作为图像异质节点特征,并根据各目标识别特征与原始图像特征间的关联关系确定连接边,构建图文双向搜索模型的图像异质图网络;A model building module is used to pre-build a bidirectional image-text search model; based on using target recognition features and target text features as text heterogeneous node features respectively, and determining the connection edges according to the inclusion relationship between the target recognition features and the target text features, a text heterogeneous graph network of the bidirectional image-text search model is constructed; based on using original image features and target recognition features as image heterogeneous node features respectively, and determining the connection edges according to the correlation relationship between each target recognition feature and the original image feature, an image heterogeneous graph network of the bidirectional image-text search model is constructed;
模型训练模块,用于将每组训练样本的图像特征输入图像异质图网络、文本特征输入至文本异质图网络中,训练图文双向搜索模型。The model training module is used to input the image features of each group of training samples into the image heterogeneous graph network and the text features into the text heterogeneous graph network to train the image-text bidirectional search model.
本申请实施例第五方面还提供了一种图文双向搜索设备,包括处理器、存储器、人机交互组件以及通信组件;A fifth aspect of the embodiment of the present application further provides a bidirectional image-text search device, including a processor, a memory, a human-computer interaction component, and a communication component;
人机交互组件用于通过信息输入/信息输出接口,接收用户输入的训练样本集选择请求、模型训练请求、搜索请求以及向用户展示图文搜索结果;The human-computer interaction component is used to receive training sample set selection requests, model training requests, and search requests input by users through the information input/information output interface, and to display graphic and text search results to users;
通信组件用于传输图像文本匹配模型的训练过程中以及图文双向搜索任务执行过程中的数据及指令;The communication component is used to transmit data and instructions during the training process of the image-text matching model and the execution process of the image-text bidirectional search task;
处理器用于执行存储器中存储的计算机程序时实现如前任一项图文双向搜索方法和/或如前图像文本匹配模型的训练方法的步骤。The processor is used to execute the computer program stored in the memory to implement the steps of any of the above-mentioned image-text bidirectional search methods and/or the above-mentioned image-text matching model training method.
本申请实施例第六方面还提供了一种非易失性可读存储介质,非易失性可读存储介质上存储有计算机程序,计算机程序被处理器执行时实现如前任一项图文双向搜索方法和/或如前图像文本匹配模型的训练方法的步骤。The sixth aspect of the embodiment of the present application also provides a non-volatile readable storage medium, on which a computer program is stored. When the computer program is executed by a processor, the steps of any of the previous image-text bidirectional search methods and/or the previous image-text matching model training method are implemented.
本申请提供的技术方案的优点在于,分别基于仅包含一类文本数据的文本和包含一组子图像的图像所包含的数据及其内部关系构建用于提取相应特征的图神经网络,从而有利于提取可反映现实世界中的文本及其内在关联关系的文本特征,反映现实世界中图像及其内在关联关系的图像特征,并基于提取的文本特征及图像特征进行模型训练,有利于充分挖掘图像与文本细粒度特征之间的关联关系,从而得到高精度的图文双向检索模型,有效提高图像数据与文本数据的相互检索精度。The advantage of the technical solution provided by the present application is that a graph neural network for extracting corresponding features is constructed based on the data contained in a text containing only one type of text data and an image containing a group of sub-images and their internal relationships, which is conducive to extracting text features that can reflect the text and its internal correlation relationships in the real world, and image features that reflect the images and their internal correlation relationships in the real world. Model training is performed based on the extracted text features and image features, which is conducive to fully exploring the correlation relationship between the fine-grained features of images and texts, thereby obtaining a high-precision image-text bidirectional retrieval model, effectively improving the mutual retrieval accuracy of image data and text data.
此外,本申请实施例还针对图文双向搜索方法提供了图像文本匹配模型的训练方法及相应的实现装置、图文双向搜索设备及非易失性可读存储介质相应的实现装置、图文双向搜索设备及非易失性可读存储介质,进一步使得图文双向搜索方法更具有实用性,图文双向搜索方法、装置、设备及非易失性可读存储介质具有相应的优点。In addition, the embodiments of the present application also provide a training method for an image-text matching model and a corresponding implementation device, an image-text bidirectional search device and a non-volatile readable storage medium for the image-text bidirectional search method, thereby further making the image-text bidirectional search method more practical, and the image-text bidirectional search method, device, equipment and non-volatile readable storage medium have corresponding advantages.
应当理解的是,以上的一般描述和后文的细节描述仅是示例性的,并不能限制本公开。It is to be understood that the foregoing general description and the following detailed description are exemplary only and are not restrictive of the present disclosure.
附图说明BRIEF DESCRIPTION OF THE DRAWINGS
为了更清楚的说明本申请实施例或相关技术的技术方案,下面将对实施例或相关技术描述中所需要使用的附图作简单的介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions of the embodiments of the present application or the related technologies, the drawings required for use in the embodiments or the related technical descriptions are briefly introduced below. Obviously, the drawings described below are only some embodiments of the present application. For ordinary technicians in this field, other drawings can be obtained based on these drawings without paying creative work.
图1为本申请实施例提供的一种图文双向搜索方法的流程示意图;FIG1 is a schematic diagram of a flow chart of a method for bidirectional image and text search provided by an embodiment of the present application;
图2为本申请实施例提供的文本异质图网络结构的一种示意图;FIG2 is a schematic diagram of a text heterogeneous graph network structure provided in an embodiment of the present application;
图3为本申请实施例提供的图像异质图网络结构的一种示意图;FIG3 is a schematic diagram of an image heterogeneous graph network structure provided in an embodiment of the present application;
图4为本申请实施例提供的一种图像文本匹配模型的训练方法的流程示意图;FIG4 is a flow chart of a method for training an image-text matching model provided in an embodiment of the present application;
图5为本申请实施例提供的跨媒体检索装置的一种实施方式结构图;FIG5 is a structural diagram of an implementation of a cross-media retrieval device provided in an embodiment of the present application;
图6为本申请实施例提供的图像文本匹配模型的训练装置的一种实施方式结构图;FIG6 is a structural diagram of an implementation of a training device for an image-text matching model provided in an embodiment of the present application;
图7为本申请实施例提供的图文双向搜索设备的一种实施方式结构图;FIG7 is a structural diagram of an implementation of a bidirectional image-text search device provided in an embodiment of the present application;
图8为本申请实施例提供的图文双向搜索设备的另一种实施方式结构图;FIG8 is a structural diagram of another implementation of a bidirectional image-text search device provided in an embodiment of the present application;
图9为本申请实施例提供的一个示例性应用场景的框架示意图。FIG9 is a schematic diagram of a framework of an exemplary application scenario provided in an embodiment of the present application.
具体实施方式Detailed ways
为了使本技术领域的人员更好地理解本申请方案,下面结合附图和实施方式对本申请作进一步的详细说明。显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。In order to enable those skilled in the art to better understand the present application, the present application is further described in detail below in conjunction with the accompanying drawings and implementation methods. Obviously, the described embodiments are only part of the embodiments of the present application, rather than all of the embodiments. Based on the embodiments in the present application, all other embodiments obtained by ordinary technicians in the field without making creative work are within the scope of protection of the present application.
本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”、“第三”、“第四”等是用于区别不同的对象,而不是用于描述特定的顺序。此外术语“包括”和“具有”以及他们任何变形,意图在于覆盖不排他的包含。例如包含了一系列步骤或单元的过程、方法、系统、产品或设备没有限定于已列出的步骤或单元,而是可包括没有列出的步骤或单元。The terms "first", "second", "third", "fourth", etc. in the specification and claims of this application and the above drawings are used to distinguish different objects rather than to describe a specific order. In addition, the terms "including" and "having" and any variations thereof are intended to cover non-exclusive inclusions. For example, a process, method, system, product or device that includes a series of steps or units is not limited to the listed steps or units, but may include steps or units that are not listed.
在介绍了本申请实施例的技术方案后,下面详细的说明本申请的各种非限制性实施方式。After introducing the technical solutions of the embodiments of the present application, various non-limiting implementation methods of the present application are described in detail below.
首先参见图1,图1为本申请实施例提供的一种图文双向搜索方法的流程示意图,本申请实施例可包括 以下内容:First, refer to FIG. 1, which is a flow chart of a method for bidirectional image-text search provided by an embodiment of the present application. The embodiment of the present application may include the following contents:
S101:预先训练图文双向搜索模型。S101: Pre-train the image-text bidirectional search model.
本实施例的图文双向搜索模型用于执行文本数据与图像数据之间的图文双向搜索任务,也即可以基于待搜索文本数据从已知图像数据库中确定与之相匹配的图像数据,也可基于待搜索图像数据从已知文本数据库中确定与之相匹配的文本数据。本实施例的图文双向搜索模型包括文本异质图网络、图像异质图网络和图像识别网络;文本异质图网络用于对输入文本数据如文本样本或待搜索文本进行处理并最终输出该文本数据对应的文本特征,图像异质图网络用于对输入图像数据如图像样本或待搜索图像进行处理,并输出该图像数据的最终图像特征。图像识别网络用于文本异质图网络和图像异质图网络可基于任何技术中的任何一种图结构进行搭建,这均不影响本申请的实现。图像识别网络用于识别图像如待搜索图像以及训练模型过程中所使用的图像样本中某类图像块的类别信息,也即最终输出的是输入图像包括的指定识别目标对应的识别标签信息,为了便于描述,称为目标识别特征。The bidirectional image-text search model of this embodiment is used to perform bidirectional image-text search tasks between text data and image data, that is, image data matching the text data to be searched can be determined from a known image database based on the text data to be searched, and text data matching the text data to be searched can also be determined from a known text database based on the image data to be searched. The bidirectional image-text search model of this embodiment includes a text heterogeneous graph network, an image heterogeneous graph network, and an image recognition network; the text heterogeneous graph network is used to process input text data such as text samples or text to be searched and finally output text features corresponding to the text data, and the image heterogeneous graph network is used to process input image data such as image samples or images to be searched, and output the final image features of the image data. The image recognition network used for the text heterogeneous graph network and the image heterogeneous graph network can be built based on any graph structure in any technology, which does not affect the implementation of this application. The image recognition network is used to identify the category information of a certain type of image block in an image such as an image to be searched and an image sample used in the training model process, that is, the final output is the recognition label information corresponding to the specified recognition target included in the input image, which is called the target recognition feature for the convenience of description.
S102:调用图像识别网络,获取待搜索图像的每张子图像所包含的目标图像块的目标识别特征。S102: calling an image recognition network to obtain target recognition features of a target image block contained in each sub-image of the image to be searched.
本实施例的待搜索图像以及后续的图像样本包括一组子图像,也即一组子图像共同构成待搜索图像,以待搜索图像为菜谱步骤图像,每个步骤对应一个子图像,该菜谱步骤图像包括每个步骤对应的子图像构成。将待搜索图像中包含相应文本数据的某类指定信息的图像块称为目标图像块,这些目标图像块的识别信息即为目标识别特征,也即目标识别特征为待搜索图像或图像样本中目标图像块的标签信息,标签信息属于该类指定信息。以菜谱做菜步骤文本与菜谱步骤图举例来说,指定信息可为菜谱成分,目标图像块即为标识菜谱成分的图像块,目标识别特征即为识别各目标图像块所属的菜谱成分信息;以电子设备说明文档和电子设备说明书图像为例,指定信息为电子设备的产品结构,目标图像块即为标识产品结构的图像块,目标识别特征即为目标图像块属于某类产品结构的识别信息,如开关键或指示灯。The image to be searched and the subsequent image samples of this embodiment include a group of sub-images, that is, a group of sub-images together constitute the image to be searched, the image to be searched is a recipe step image, each step corresponds to a sub-image, and the recipe step image includes the sub-images corresponding to each step. The image blocks containing a certain type of designated information of the corresponding text data in the image to be searched are called target image blocks, and the identification information of these target image blocks is the target identification feature, that is, the target identification feature is the label information of the target image block in the image to be searched or the image sample, and the label information belongs to this type of designated information. Taking the recipe cooking step text and the recipe step diagram as an example, the designated information can be the recipe ingredients, the target image block is the image block that identifies the recipe ingredients, and the target identification feature is the recipe ingredient information to which each target image block belongs; taking the electronic device description document and the electronic device manual image as an example, the designated information is the product structure of the electronic device, the target image block is the image block that identifies the product structure, and the target identification feature is the identification information that the target image block belongs to a certain type of product structure, such as a switch or indicator light.
S103:基于文本异质图网络,获取仅包含一类目标文本数据的待搜索文本的文本特征。S103: Based on the text heterogeneous graph network, obtain text features of the text to be searched that only contains one type of target text data.
本申请的文本包括待搜索文本以及后续模型训练过程中所使用的训练样本集中的文本样本都只包含一类文本数据,所谓一类文本数据是指文本中的数据是同一类型的数据,以菜谱文本举例来说,菜谱文本可包括菜名、菜谱成分和做菜步骤这三类文本数据,本申请的待搜索文本以及文本样本只能包含其中一类文本数据。以服务器工作原理说明文档举例来说,该类文本可包括两类文本数据即服务器结构组成和工作原理。本申请的待搜索文本以及文本样本只能包含其中一类文本数据,也即待搜索文本以及文本样本仅仅包括服务器的工作原理。在上个步骤获取训练好的模型之后,基于待搜索文本,通过计算文本异质图网络得到相应的文本特征。本实施例的文本特征是指通过对文本异质图网络进行图结构运算之后所得到的特征,目标文本特征是直接利用文本特征提取方法提取待搜索文本所得到的数据。本步骤的目标文本特征与上个步骤所得到的目标识别特征之间具有包含关系,为了便于描述,可定义目标文本数据对应的目标文本特征包括目标识别特征,所谓的包含关系是指目标识别特征均存在于目标文本数据对应的目标文本特征中。以菜谱文本举例来说,目标识别特征表示菜谱成分,目标文本特征表示做菜步骤;以电子设备说明书为例,目标识别特征可为电子设备的产品结构,目标文本特征可为使用说明书。本实施例的目标文本特征与目标识别特征之间具有包含关系,目标识别特征是由每张子图像的多个目标图像块对应的识别特征构成,为了便于描述,不引起歧义,在构建文本异质图网络过程中,可称每个子图像的每个目标图像块的识别特征为一个第一类节点特征,目标文本特征是由多个文本特征构成,每个文本特征称为第二类节点特征。对指定的一个第一类节点特征来说,若其被包含在某个第二类节点特征中,则该第一类节点特征与该第二个节点特征之间具有关联关系。在获取待搜索文本的目标文本特征和待搜索图像的目标识别特征,通过分析目标文本特征的每个第二类节点特征,判断其是否包含目标识别特征的某一个第一类节点特征或某几个第一类节点特征,则可确定目标识别特征与目标文本特征之间的关联关系。在获取到目标文本特征和目标识别特征之后,根据这两类不同类型特征作为图结构网络的异构节点特征,图结构网络的连接边可根据不同节点特征之间是否具有包含关系来确定,也就是说,目标识别特征和目标文本特征为文本异质图网络的节点特征,文本异质图网络的连接边由目标识别特征与目标文本特征间的包含关系确定。在文本异质图网络中代入了待搜索文本的文本特征信息和待搜索图像的图像识别信息之后,通过进行图结构运算便可提取得到图结构对应的特征,该类特征即作为本步骤中的文本特征。The text of the present application includes the text to be searched and the text samples in the training sample set used in the subsequent model training process, which only contain one type of text data. The so-called one type of text data refers to the data in the text being of the same type. Taking the recipe text as an example, the recipe text may include three types of text data: dish name, recipe ingredients, and cooking steps. The text to be searched and the text samples of the present application can only contain one type of text data. Taking the server working principle description document as an example, this type of text may include two types of text data, namely, server structure composition and working principle. The text to be searched and the text samples of the present application can only contain one type of text data, that is, the text to be searched and the text samples only include the working principle of the server. After obtaining the trained model in the previous step, the corresponding text features are obtained by calculating the text heterogeneous graph network based on the text to be searched. The text features of this embodiment refer to the features obtained after performing graph structure operations on the text heterogeneous graph network, and the target text features are the data obtained by directly extracting the text to be searched using the text feature extraction method. There is an inclusion relationship between the target text feature of this step and the target recognition feature obtained in the previous step. For the convenience of description, it can be defined that the target text feature corresponding to the target text data includes the target recognition feature. The so-called inclusion relationship means that the target recognition feature exists in the target text feature corresponding to the target text data. Taking the recipe text as an example, the target recognition feature represents the recipe ingredients, and the target text feature represents the cooking steps; taking the electronic device manual as an example, the target recognition feature can be the product structure of the electronic device, and the target text feature can be the instruction manual. There is an inclusion relationship between the target text feature and the target recognition feature of this embodiment. The target recognition feature is composed of the recognition features corresponding to multiple target image blocks of each sub-image. For the convenience of description and without ambiguity, in the process of constructing a text heterogeneous graph network, the recognition feature of each target image block of each sub-image can be called a first-class node feature, and the target text feature is composed of multiple text features, each of which is called a second-class node feature. For a specified first-class node feature, if it is included in a second-class node feature, then the first-class node feature has an association relationship with the second node feature. After obtaining the target text features of the text to be searched and the target recognition features of the image to be searched, by analyzing each second-class node feature of the target text features, it is determined whether it contains a first-class node feature or several first-class node features of the target recognition features, and the correlation between the target recognition features and the target text features can be determined. After obtaining the target text features and the target recognition features, these two different types of features are used as heterogeneous node features of the graph structure network, and the connection edges of the graph structure network can be determined according to whether there is an inclusion relationship between different node features, that is, the target recognition features and the target text features are node features of the text heterogeneous graph network, and the connection edges of the text heterogeneous graph network are determined by the inclusion relationship between the target recognition features and the target text features. After substituting the text feature information of the text to be searched and the image recognition information of the image to be searched into the text heterogeneous graph network, the features corresponding to the graph structure can be extracted by performing graph structure operations, and this type of features is used as the text features in this step.
S104:基于图像异质图网络,获取包括一组子图像的待搜索图像的图像特征。S104: Based on the image heterogeneous graph network, image features of the image to be searched including a group of sub-images are obtained.
本步骤的图像异质图网络,其同样包括节点和连接边,本实施例的图像异质图网络的节点为异质节点,也即至少存在两种性质和结构不同的特征,而对于图像来说,其所提取的图像特征仅能作为一种节点特征,由于图像特征与文本特征是具有相关联对应关系的,所以可将S102提取的目标识别特征作为图像异质图网络的节点特征。考虑到目标识别特征的各第一类节点特征被包含在目标文本特征的各第二类节点特征中,所以可以第一类节点特征作为图像异质图网络的异质节点特征,也即待搜索图像的原始图像特征和目标识别特征作为图像异质图网络的节点特征,图像异质图网络的连接边由目标识别特征和原始图像特征之间的关联关系确定。原始图像特征是指直接采用图像特征方法如人工卷积神经网络、VGG16(Visual Geometry Group Network,目视图像生成器)、Resnet(Deep residual network,深度残差网络)等提取得到的图像特征。本步骤中的图像特征是将待搜索图像的每个子图像的图像特征代入至图像异质图网络 中,对图像异质图网络进行图结构运算所得的特征。The image heterogeneous graph network of this step also includes nodes and connecting edges. The nodes of the image heterogeneous graph network of this embodiment are heterogeneous nodes, that is, there are at least two features with different properties and structures. For images, the extracted image features can only be used as a node feature. Since the image features and text features have an associated corresponding relationship, the target recognition features extracted in S102 can be used as the node features of the image heterogeneous graph network. Considering that each first-class node feature of the target recognition feature is included in each second-class node feature of the target text feature, the first-class node feature can be used as the heterogeneous node feature of the image heterogeneous graph network, that is, the original image features of the image to be searched and the target recognition features are used as the node features of the image heterogeneous graph network. The connecting edges of the image heterogeneous graph network are determined by the association between the target recognition features and the original image features. The original image features refer to the image features extracted directly using image feature methods such as artificial convolutional neural networks, VGG16 (Visual Geometry Group Network, visual image generator), Resnet (Deep residual network, deep residual network), etc. The image features in this step are obtained by substituting the image features of each sub-image of the image to be searched into the image heterogeneous graph network and performing graph structure operations on the image heterogeneous graph network.
S105:将图像特征和文本特征输入至图文双向搜索模型,得到图文搜索结果。S105: Input the image features and text features into the image-text bidirectional search model to obtain image-text search results.
本实施例的图文搜索结果是指S103步骤提取的文本特征和S104提取的图像特征的匹配程度,也即将文本特征和图像特征输入至图文双向搜索模型之后,图文双向搜索模型可通过计算向量距离如欧式距离来确定特征是否相接近,如果相接近,则待搜索图像和待搜索文本是相匹配的,也即待搜索图像与待搜索文本是相互对应的一组数据。如果不相接近,则待搜索图像和待搜索文本是不相匹配的。The image-text search result of this embodiment refers to the matching degree of the text features extracted in step S103 and the image features extracted in step S104, that is, after the text features and the image features are input into the image-text bidirectional search model, the image-text bidirectional search model can determine whether the features are close by calculating the vector distance such as the Euclidean distance. If they are close, the image to be searched and the text to be searched are matched, that is, the image to be searched and the text to be searched are a set of data corresponding to each other. If they are not close, the image to be searched and the text to be searched are not matched.
在本申请实施例提供的技术方案中,分别基于文本和图像所包含的数据及其内部关系构建用于提取相应特征的图神经网络,从而有利于提取可反映现实世界中的文本及其内在关联关系的文本特征,反映现实世界中图像及其内在关联关系的图像特征,并基于提取的文本特征及图像特征进行模型训练,有利于充分挖掘图像与文本细粒度特征之间的关联关系,从而得到高精度的图文双向检索模型,有效提高图像数据与文本数据的相互检索精度。In the technical solution provided in the embodiment of the present application, graph neural networks for extracting corresponding features are constructed based on the data contained in the text and image and their internal relationships, which is conducive to extracting text features that can reflect the text and its internal correlation in the real world, and image features that reflect the image and its internal correlation in the real world. Model training is performed based on the extracted text features and image features, which is conducive to fully exploring the correlation between the fine-grained features of the image and the text, thereby obtaining a high-precision image-text bidirectional retrieval model, effectively improving the mutual retrieval accuracy of image data and text data.
上述实施例对如何提取目标识别特征并不做任何限定,基于上述实施例,本申请还提供了目标识别特征的一种可选的提取实施方式,可包括:The above embodiment does not limit how to extract the target identification features. Based on the above embodiment, the present application also provides an optional extraction implementation method of the target identification features, which may include:
预先利用在包含多张子图像的图像样本中标注相应目标识别特征的目标训练样本集,训练得到图像识别网络;将待搜索图像输入至图像识别网络中,得到待搜索图像的每张子图像所包含的目标识别特征。An image recognition network is trained in advance using a target training sample set in which corresponding target recognition features are annotated in an image sample containing multiple sub-images; the image to be searched is input into the image recognition network to obtain the target recognition features contained in each sub-image of the image to be searched.
在本实施例中,图像识别网络用于识别待搜索图像中目标图像块的类别信息,目标训练样本集包含多张标注有目标特征的图像,也即目标训练样本集中包含的每一张图像样本均携带有类别标签。各图像可为直接从原始数据库中所获取的图像,也可为将原始图像进行翻转、尺寸裁剪、拉伸等变化后所得的图像,这均不影响本申请的实现。图像识别网络可基于任何一种可识别图像类别的现有模型结构来搭建,如卷积神经网络、人工神经网络等等,本申请对此不作任何限定。作为一种可选的实施方式,目标识别网络结构可包括输入层、卷积结构、池化层及分类器;卷积结构包括基础运算组件和残差运算组件;基础运算组件用于对输入图像依次进行卷积处理、正则化处理、激活函数处理及最大池化处理;残差运算组件包括多个相连的残差块,每个残差块均包括多层卷积层,用于对基础运算组件的输出特征进行卷积计算;池化层,用于将卷积结构的输出特征转化为目标特征向量,并输送至分类器;分类器,用于通过对目标特征向量进行计算,并输出所属类别标签的概率。In this embodiment, the image recognition network is used to identify the category information of the target image block in the image to be searched, and the target training sample set contains multiple images marked with target features, that is, each image sample contained in the target training sample set carries a category label. Each image can be an image directly obtained from the original database, or it can be an image obtained by flipping, resizing, stretching, etc. the original image, which does not affect the implementation of the present application. The image recognition network can be built based on any existing model structure that can recognize image categories, such as convolutional neural networks, artificial neural networks, etc., and the present application does not impose any restrictions on this. As an optional implementation, the target recognition network structure may include an input layer, a convolution structure, a pooling layer and a classifier; the convolution structure includes a basic operation component and a residual operation component; the basic operation component is used to perform convolution processing, regularization processing, activation function processing and maximum pooling processing on the input image in sequence; the residual operation component includes a plurality of connected residual blocks, each residual block includes multiple layers of convolution layers, which are used to perform convolution calculations on the output features of the basic operation component; the pooling layer is used to convert the output features of the convolution structure into a target feature vector and transmit it to the classifier; the classifier is used to calculate the target feature vector and output the probability of the category label.
为了使所属领域技术人员更加清楚明白本申请的技术方案,本申请以菜谱文本和菜谱图像为例阐述本实施例的实现过程,也即通过图像分类网络实现对每张菜谱图像的主成分进行分类,并以分类后的类别信息构建成分节点的过程可包括:In order to make the technical solution of the present application more clearly understood by those skilled in the art, the present application takes recipe text and recipe image as examples to illustrate the implementation process of the present embodiment, that is, the process of classifying the main components of each recipe image through an image classification network and constructing component nodes with the classified category information may include:
首先通过多张菜谱步骤图生成步骤图数据集,对部分菜谱步骤图的主成分进行标注,例如面粉、白糖、木瓜等。使用标注好的菜谱步骤图训练ResNet50网络,以对图像主成分进行分类。ResNet50网络结构可包括七个部分,第一部分不包含残差块,主要对输入进行卷积、正则化、激活函数、最大池化的计算,第二、三、四、五部分结构都包含了残差块,每个残差块含有三层卷积,经过前五部分的卷积计算后,池化层将其转化为一个特征向量,最后分类器对这个特征向量进行计算并输出类别概率。训练好的ResNet50网络可以很好的获得输入图像的主成分信息。First, a step diagram dataset is generated through multiple recipe step diagrams, and the main components of some recipe step diagrams are annotated, such as flour, sugar, papaya, etc. The annotated recipe step diagrams are used to train the ResNet50 network to classify the main components of the image. The ResNet50 network structure can include seven parts. The first part does not contain residual blocks, and mainly performs convolution, regularization, activation function, and maximum pooling calculations on the input. The second, third, fourth, and fifth parts of the structure all contain residual blocks. Each residual block contains three layers of convolution. After the convolution calculation of the first five parts, the pooling layer converts it into a feature vector. Finally, the classifier calculates this feature vector and outputs the category probability. The trained ResNet50 network can obtain the main component information of the input image very well.
可以理解的是,从待搜索文本到目标文本特征的第二类文本特征是需要经过文本特征的提取操作的,在上述实施例中,对于如何从待搜索文本中提取文本特征并没有进行任何限定,基于上述实施例,本申请还给出了文本特征的一种可选的实施方式,可包括下述内容:It is understandable that the second type of text features from the text to be searched to the target text features need to undergo a text feature extraction operation. In the above embodiment, there is no limitation on how to extract text features from the text to be searched. Based on the above embodiment, the present application also provides an optional implementation of text features, which may include the following contents:
响应文本拆分指令,将目标识别特征拆分为多个文本词组和/或文本单词,将目标文本数据拆分为多个文本语句;将各文本词组和/或文本单词输入至预先训练好的文本特征提取模型中,得到多个第一类节点特征;将各文本语句输入至文本特征提取模型中,得到多个第二类节点特征。In response to the text splitting instruction, the target recognition feature is split into multiple text phrases and/or text words, and the target text data is split into multiple text sentences; each text phrase and/or text word is input into a pre-trained text feature extraction model to obtain multiple first-category node features; each text sentence is input into the text feature extraction model to obtain multiple second-category node features.
文本拆分指令为用于将待搜索文本拆分为多个文本语句,目标识别特征拆分为多个文本词组或文本单词的,可采用任何一种文本数据拆分算法。对于该种实施方式,相应的,文本异质图网络中每个连接边的确定方法可为:对目标识别特征中的每个文本词组或文本单词,依次遍历目标文本数据的每个文本语句;若当前文本语句所包含的目标词组与当前文本词组相同,则当前文本语句对应的第二类节点特征与当前文本词组对应的第一类节点特征具有连接关系;若当前文本语句所包含的目标单词与当前文本单词相同,则当前文本语句对应的第二类节点特征与当前文本单词对应的第一类节点特征具有连接关系。本实施例的文本特征提取模型用于对输入文本数据或目标识别特征进行文本特征提取,作为一种可选的实施方式,文本特征提取模型的训练过程为:搭建语言表征模型;语言表征模型包括文本信息输入层、特征提取层和文本特征输出层;特征提取层为基于转换器的双向编码器;利用自然语言文本样本数据集训练语言表征模型,并将训练好的语言表征模型作为文本特征提取模型。语言表征模型例如可为Bert(Bidirectional Encoder Representation from Transformers,预训练的语言表征模型)、word2vec(word to vector,词向量模型),这均不影响本申请的实现。在获取训练好文本特征提取模型之后,为了进一步提高文本特征提取准确度,还可同时为文本数据设置数据类型,数据类型包括用于标识目标识别特征的第一标识和用于标识目标文本数据或者是说目标文本特征的第二标识。在将待搜索文本输入至文本特征提取模型的同时,获取下一时刻输入至文本特征提取模型中的数据的数据类型,还可将各文本语句以及每个文本语句中包含的各词 组、各单词所在当前文本语句中的位置信息,输入至文本特征提取模型。将数据类型连同相应的数据一起输入至文本特征提取模型中。The text splitting instruction is used to split the text to be searched into multiple text sentences, and the target recognition feature is split into multiple text phrases or text words, and any text data splitting algorithm can be used. For this implementation, correspondingly, the method for determining each connection edge in the text heterogeneous graph network can be: for each text phrase or text word in the target recognition feature, traverse each text sentence of the target text data in turn; if the target phrase contained in the current text sentence is the same as the current text phrase, then the second type of node feature corresponding to the current text sentence has a connection relationship with the first type of node feature corresponding to the current text phrase; if the target word contained in the current text sentence is the same as the current text word, then the second type of node feature corresponding to the current text sentence has a connection relationship with the first type of node feature corresponding to the current text word. The text feature extraction model of this embodiment is used to extract text features from input text data or target recognition features. As an optional implementation, the training process of the text feature extraction model is: building a language representation model; the language representation model includes a text information input layer, a feature extraction layer and a text feature output layer; the feature extraction layer is a bidirectional encoder based on a converter; the language representation model is trained using a natural language text sample data set, and the trained language representation model is used as a text feature extraction model. The language representation model may be, for example, Bert (Bidirectional Encoder Representation from Transformers, a pre-trained language representation model) or word2vec (word to vector, a word vector model), which does not affect the implementation of the present application. After obtaining the trained text feature extraction model, in order to further improve the accuracy of text feature extraction, the data type may also be set for the text data at the same time, and the data type includes a first identifier for identifying the target recognition feature and a second identifier for identifying the target text data or the target text feature. While inputting the text to be searched into the text feature extraction model, the data type of the data to be input into the text feature extraction model at the next moment is obtained, and the position information of each text sentence and each phrase and word contained in each text sentence in the current text sentence may also be input into the text feature extraction model. The data type is input into the text feature extraction model together with the corresponding data.
可以理解的是,提取待搜索文本中的目标文本数据可以得到多个第二类文本特征,对于具有先后执行顺序的各第二类文本特征,或者是对于具有先后依赖关系的第二类文本特征的场景,为了进一步提取到贴合实际文本的文本特征,本申请还进一步的进行时序特征提取,并提供了时序特征的提取方法,可包括下述内容:It is understandable that multiple second-category text features can be obtained by extracting target text data from the text to be searched. For each second-category text feature having a sequential execution order, or for a scenario where the second-category text feature has a sequential dependency relationship, in order to further extract text features that fit the actual text, the present application further performs temporal feature extraction and provides a method for extracting temporal features, which may include the following contents:
若各第二类节点特征之间具有先后执行顺序,将各第二类节点特征以及顺序信息,输入至预先训练好的时序特征提取模型中,得到时序信息特征。可选的,时序特征提取模型可为双向长短期记忆神经网络,相应的,可基于各第二类节点特征之间的先后顺序,依次将各第二类节点特征按照顺序和逆序输入至双向长短期记忆神经网络,得到各第二类节点特征的时序编码特征;根据每个第二类节点特征时序编码特征确定时序信息特征。可选的,对每一个第二类节点特征,时序编码特征均可包括正序编码特征和倒序编码特征,为了将时序特征整合至最终生成的文本特征中,可将提取所得到的时序信息特征通过全连接层映射至文本特征中。对于正序编码特征和倒序编码特征的获取可通过下述方法:可调用正序编码关系式,对当前第二类节点特征进行正序编码,得到正序编码特征;正序编码关系式可表述为:If there is a sequential execution order between each second-category node feature, each second-category node feature and sequence information are input into a pre-trained temporal feature extraction model to obtain temporal information features. Optionally, the temporal feature extraction model can be a bidirectional long short-term memory neural network. Accordingly, based on the sequence between each second-category node feature, each second-category node feature can be input into the bidirectional long short-term memory neural network in sequence and reverse order to obtain the temporal coding features of each second-category node feature; the temporal information features are determined according to the temporal coding features of each second-category node feature. Optionally, for each second-category node feature, the temporal coding features can include forward coding features and reverse coding features. In order to integrate the temporal features into the final generated text features, the extracted temporal information features can be mapped to the text features through a fully connected layer. The acquisition of forward coding features and reverse coding features can be carried out by the following method: the forward coding relationship can be called to perform forward coding on the current second-category node feature to obtain the forward coding feature; the forward coding relationship can be expressed as:
Figure PCTCN2022142513-appb-000018
Figure PCTCN2022142513-appb-000018
然后调用倒序编码关系式,对当前第二类节点特征进行正序编码,得到倒序编码特征;倒序编码关系式可表述为:Then, the reverse coding relation is called to encode the current second-category node feature in the forward order to obtain the reverse coding feature; the reverse coding relation can be expressed as:
Figure PCTCN2022142513-appb-000019
Figure PCTCN2022142513-appb-000019
式中,q∈[1,Q],
Figure PCTCN2022142513-appb-000020
为双向长短期记忆神经网络的正向编码方向的第q个单元的输出,
Figure PCTCN2022142513-appb-000021
为文本异质图网络中第T层图注意力网络的第q个第二类节点特征,
Figure PCTCN2022142513-appb-000022
为双向长短期记忆神经网络的正向编码方向的第q-1个单元的输出,Q为第二类节点特征总数,
Figure PCTCN2022142513-appb-000023
为双向长短期记忆神经网络的倒向编码方向的第q个单元的输出,
Figure PCTCN2022142513-appb-000024
为双向长短期记忆神经网络的倒向编码方向的第q+1个单元的输出,
Figure PCTCN2022142513-appb-000025
为双向长短期记忆神经网络的倒向编码函数,
Figure PCTCN2022142513-appb-000026
为双向长短期记忆神经网络的正向编码函数。
Where q∈[1,Q],
Figure PCTCN2022142513-appb-000020
is the output of the qth unit in the forward encoding direction of the bidirectional long short-term memory neural network,
Figure PCTCN2022142513-appb-000021
is the qth second-category node feature of the Tth layer graph attention network in the text heterogeneous graph network,
Figure PCTCN2022142513-appb-000022
is the output of the q-1th unit in the forward encoding direction of the bidirectional long short-term memory neural network, Q is the total number of node features of the second category,
Figure PCTCN2022142513-appb-000023
is the output of the qth unit in the reverse encoding direction of the bidirectional long short-term memory neural network,
Figure PCTCN2022142513-appb-000024
is the output of the q+1th unit in the reverse encoding direction of the bidirectional long short-term memory neural network,
Figure PCTCN2022142513-appb-000025
is the backward encoding function of the bidirectional long short-term memory neural network,
Figure PCTCN2022142513-appb-000026
is the forward encoding function of the bidirectional long short-term memory neural network.
当然,对于时序特征的提取,本实施例还可基于长短期记忆神经网络实现,在获取第二类文本特征之后,可调用关系式
Figure PCTCN2022142513-appb-000027
q∈[1,Q]得到时序特征信息,其中,
Figure PCTCN2022142513-appb-000028
代表LSTM中第q个单元的输出。
Figure PCTCN2022142513-appb-000029
则代表LSTM中第q-1个单元的输出,也即上一个状态的输出。
Of course, for the extraction of time series features, this embodiment can also be implemented based on a long short-term memory neural network. After obtaining the second type of text features, the relationship can be called
Figure PCTCN2022142513-appb-000027
q∈[1,Q] to obtain the time series feature information, where
Figure PCTCN2022142513-appb-000028
Represents the output of the qth unit in the LSTM.
Figure PCTCN2022142513-appb-000029
It represents the output of the q-1th unit in the LSTM, that is, the output of the previous state.
上述实施例对如何基于文本异质图网络进行文本特征的生成并不做任何限定,文本特征的提取是通过异质图运算得到的,异质图运算也即对文本异质图网络的节点更新的过程,本实施例提供了一种可选的实施方式,可包括下述内容:The above embodiment does not limit how to generate text features based on the text heterogeneous graph network. The extraction of text features is obtained through heterogeneous graph operations, and heterogeneous graph operations are also the process of updating the nodes of the text heterogeneous graph network. This embodiment provides an optional implementation method, which may include the following contents:
为了提高文本异质图网络的模型精度,可实施例可叠加多层相同的结构,为了便于描述,每一层可称为第一图注意力网络,每一层第一图注意网络之后还集成第一全连接层;对文本异质图网络的各第一图注意力网络的每个文本异质节点,根据当前文本异质节点与其余各文本异质节点之间是否具有连接关系以及各文本异质节点之间的关联关系,更新当前文本异质节点的节点特征;基于更新后的文本异质图网络的每个文本异质节点的节点特征,生成待搜索文本的文本特征。In order to improve the model accuracy of the text heterogeneous graph network, the embodiment may stack multiple layers of the same structure. For the convenience of description, each layer may be called a first graph attention network, and a first fully connected layer is also integrated after each layer of the first graph attention network; for each text heterogeneous node of each first graph attention network of the text heterogeneous graph network, the node feature of the current text heterogeneous node is updated according to whether there is a connection relationship between the current text heterogeneous node and the remaining text heterogeneous nodes and the association relationship between the text heterogeneous nodes; based on the node features of each text heterogeneous node of the updated text heterogeneous graph network, the text features of the text to be searched are generated.
其中,根据当前文本异质节点与其余各文本异质节点之间是否具有连接关系以及各文本异质节点之间的关联关系,更新当前文本异质节点的节点特征的过程,可包括:The process of updating the node feature of the current text heterogeneous node according to whether the current text heterogeneous node has a connection relationship with the remaining text heterogeneous nodes and the association relationship between the text heterogeneous nodes may include:
确定与当前文本异质节点具有相连关系、且不为同一节点类型的目标文本异质节点;Determine a target text heterogeneous node that is connected to the current text heterogeneous node and is not of the same node type;
基于当前文本异质节点的节点特征与各目标文本异质节点的节点特征之间的关联关系,计算当前文本异质节点与每个目标文本异质节点的初始权重值,并根据各初始权重值确定当前文本异质节点的权重值;Based on the correlation between the node features of the current text heterogeneous node and the node features of each target text heterogeneous node, the initial weight values of the current text heterogeneous node and each target text heterogeneous node are calculated, and the weight value of the current text heterogeneous node is determined according to each initial weight value;
基于权重值和各目标文本异质节点,对当前文本异质节点进行节点特征更新,并将当前文本异质节点更新后的节点特征和更新前的节点特征之和作为当前文本异质节点的节点特征。Based on the weight value and each target text heterogeneous node, the node feature of the current text heterogeneous node is updated, and the sum of the node feature after the update and the node feature before the update of the current text heterogeneous node is used as the node feature of the current text heterogeneous node.
其中,基于当前文本异质节点的节点特征与各目标文本异质节点的节点特征之间的关联关系,计算当前文本异质节点与每个目标文本异质节点的初始权重值的过程,可包括:The process of calculating the initial weight values of the current text heterogeneous node and each target text heterogeneous node based on the association relationship between the node feature of the current text heterogeneous node and the node features of each target text heterogeneous node may include:
调用权重计算关系式分别计算当前文本异质节点与每个目标文本异质节点的初始权重值;权重计算关系式可为:The weight calculation formula is called to calculate the initial weight values of the current text heterogeneous node and each target text heterogeneous node respectively; the weight calculation formula can be:
Figure PCTCN2022142513-appb-000030
Figure PCTCN2022142513-appb-000030
其中,z qp为第q个文本异质节点与第p个文本异质节点的初始权重值,LeakyReLU()为激活函数,W a、W b、W c为已知的
Figure PCTCN2022142513-appb-000031
维矩阵,
Figure PCTCN2022142513-appb-000032
表示d×d维实向量,
Figure PCTCN2022142513-appb-000033
表示实向量,
Figure PCTCN2022142513-appb-000034
为第q个文本异质节点的节点特征,
Figure PCTCN2022142513-appb-000035
为第p个文本异质节点的节点特征。
Where zqp is the initial weight value of the qth text heterogeneous node and the pth text heterogeneous node, LeakyReLU() is the activation function, and Wa , Wb , and Wc are known
Figure PCTCN2022142513-appb-000031
dimensional matrix,
Figure PCTCN2022142513-appb-000032
represents a d×d dimensional real vector,
Figure PCTCN2022142513-appb-000033
represents a real vector,
Figure PCTCN2022142513-appb-000034
is the node feature of the qth text heterogeneous node,
Figure PCTCN2022142513-appb-000035
is the node feature of the pth text heterogeneous node.
其中,基于权重值和各目标文本异质节点,对当前文本异质节点进行节点特征更新,包括:Among them, based on the weight value and each target text heterogeneous node, the node feature of the current text heterogeneous node is updated, including:
调用初次更新关系式,对当前文本异质节点的节点特征进行更新;初次更新关系式可表述为:The initial update relation is called to update the node features of the current text heterogeneous nodes; the initial update relation can be expressed as:
Figure PCTCN2022142513-appb-000036
Figure PCTCN2022142513-appb-000036
式中,
Figure PCTCN2022142513-appb-000037
为第q个文本异质节点更新后的节点特征,σ为超参数,a qp为步骤节点的第q个节点与成分节点的第p个节点特征的归一化的权重,W v为已知的
Figure PCTCN2022142513-appb-000038
维矩阵,
Figure PCTCN2022142513-appb-000039
为第p个文本异质节点的节点特征,N P为目标文本异质节点总数。
In the formula,
Figure PCTCN2022142513-appb-000037
is the updated node feature of the qth text heterogeneous node, σ is a hyperparameter, aqp is the normalized weight of the qth node of the step node and the pth node of the component node, and Wv is the known
Figure PCTCN2022142513-appb-000038
dimensional matrix,
Figure PCTCN2022142513-appb-000039
is the node feature of the pth text heterogeneous node, and NP is the total number of target text heterogeneous nodes.
为了使所属领域技术人员更加清楚明白本申请的技术方案,本申请以待搜索文本为菜谱文本,菜谱文本包括做菜步骤数据,可简称为步骤,且做菜步骤之间具有先后顺序,下面阐述整个文本特征的生成过程:In order to make the technical solution of the present application more clearly understood by those skilled in the art, the present application assumes that the text to be searched is a recipe text, and the recipe text includes cooking step data, which can be referred to as steps, and the cooking steps have a sequence. The generation process of the entire text feature is described below:
本实施例将文本特征构建成一种图结构,图结构包括节点及节点特征和连接关系,如图2所示,第一类文本数据提取的文本特征为
Figure PCTCN2022142513-appb-000040
i=1,2,3,4;第二类文本数据提取的文本特征为
Figure PCTCN2022142513-appb-000041
i=1,2,3,4。第一类文本数据提取的各文本特征和第二类文本数据提取的各文本特征作为图结构的节点,各文本特征也即各节点之间的连接关系e 11、e 32、e 33即为图结构的连接关系。由于待搜索文本仅仅包含一类文本数据,也即得到一种类型的文本特征,为了构建异质图网络,本申请可从待搜索图像中提取特征以作为另一类节点特征。本实施例的待搜索图像为菜谱步骤图,首先通过多张菜谱步骤样本图生成步骤图数据集,对部分菜谱步骤样本图的主成分进行标注,例如面粉、白糖、木瓜等。使用标注好的菜谱步骤样本图训练ResNet50网络,以对图像主成分进行分类。将待搜索图像也即待搜索菜谱步骤图输入至训练好的ResNet50网络,得到该待搜索菜谱步骤图的主成分信息,也即相应的目标识别特征。成分和步骤从构造到性质都是不同的,所以称为异质节点。本实施例中每一个步骤称为1个节点,同理每1个成分称为1个节点。节点是由1句话或者1个词组组成,本实施例可使用Bert模型提取每句话或每个单词的特征,实现方式如下:
In this embodiment, text features are constructed into a graph structure, which includes nodes, node features and connection relationships. As shown in FIG2 , the text features extracted from the first type of text data are
Figure PCTCN2022142513-appb-000040
i=1,2,3,4; the text features extracted from the second type of text data are
Figure PCTCN2022142513-appb-000041
i=1, 2, 3, 4. Each text feature extracted from the first type of text data and each text feature extracted from the second type of text data are used as nodes of the graph structure, and each text feature, that is, the connection relationship between each node, e 11 , e 32 , e 33 , is the connection relationship of the graph structure. Since the text to be searched only contains one type of text data, that is, one type of text feature is obtained, in order to construct a heterogeneous graph network, the present application can extract features from the image to be searched as another type of node feature. The image to be searched in this embodiment is a recipe step diagram. First, a step diagram data set is generated through multiple recipe step sample diagrams, and the main components of some recipe step sample diagrams are annotated, such as flour, sugar, papaya, etc. The ResNet50 network is trained using the annotated recipe step sample diagram to classify the main components of the image. The image to be searched, that is, the recipe step diagram to be searched, is input into the trained ResNet50 network to obtain the main component information of the recipe step diagram to be searched, that is, the corresponding target recognition feature. The components and steps are different from structure to nature, so they are called heterogeneous nodes. In this embodiment, each step is called a node, and similarly, each component is called a node. A node is composed of a sentence or a phrase. In this embodiment, the Bert model can be used to extract the features of each sentence or each word, and the implementation method is as follows:
所有菜谱文本连通提取的主成分信息从最下方的文本信息输入,同时还会输入与菜谱文本信息以及主成分信息相伴随的位置信息和数据类型。位置信息是指若一句话中有5个单词“peel and slice the mango”,则其位置信息分别为“1,2,3,4,5”。数据类型是指:若输入是步骤数据,其数据类型为1;若输入是成分数据,其数据类型为2。通过该bert模型,可以获得每句话和每个单词的编码特征,该特征用于代表节点特征,即成分节点特征和步骤节点特征,成分节点特征和步骤节点特征都是一个高维向量,其维度均为
Figure PCTCN2022142513-appb-000042
维度(d维实向量)。在确定节点特征之后,如果该主成分存在该操作步骤中,则该成分节点和步骤节点需要有一条边连接,也即两个节点之间具有连接关系。可选的,可通过文本比对的方法,遍历步骤信息,提取每个步骤文本,然后依次查找主成分,如果该主成分中的单词在该步骤中出现,则该步骤和该主成分之间连接一条边即有连接关系。通过遍历所有步骤文本,可以构建步骤节点预成分节点的连接关系,即异质图的连接关系。在异质图建立之后,异质图信息更新可采用图注意力网络实现特征聚合与更新,更新方法是依次遍历每个异质节点进行更新。通过异质图运算来实现文本特征的聚合与提取,计算方法可如下所示:
The principal component information extracted from all recipe text connections is input from the text information at the bottom, and the position information and data type accompanying the recipe text information and the principal component information are also input. Position information means that if there are 5 words "peel and slice the mango" in a sentence, their position information is "1, 2, 3, 4, 5" respectively. The data type means: if the input is step data, its data type is 1; if the input is component data, its data type is 2. Through the BERT model, the encoding features of each sentence and each word can be obtained. This feature is used to represent the node features, namely the component node features and the step node features. The component node features and the step node features are both high-dimensional vectors with dimensions of
Figure PCTCN2022142513-appb-000042
Dimension (d-dimensional real vector). After determining the node features, if the principal component exists in the operation step, the component node and the step node need to be connected by an edge, that is, there is a connection relationship between the two nodes. Optionally, the step information can be traversed through the text comparison method, each step text can be extracted, and then the principal component can be searched in turn. If the word in the principal component appears in the step, an edge is connected between the step and the principal component, that is, there is a connection relationship. By traversing all step texts, the connection relationship between the step node and the component node can be constructed, that is, the connection relationship of the heterogeneous graph. After the heterogeneous graph is established, the heterogeneous graph information update can use the graph attention network to realize feature aggregation and update. The update method is to traverse each heterogeneous node in turn for update. The aggregation and extraction of text features are realized by heterogeneous graph operations. The calculation method can be as follows:
首先对步骤节点进行更新,
Figure PCTCN2022142513-appb-000043
是步骤节点的第q个节点的节点特征,
Figure PCTCN2022142513-appb-000044
代表成分节点的第p个节点的特征。若步骤节点的第q个节点与成分节点的第p个节点有连接(边),则用成分节点的第p个节点的特征去更新步骤节点的第q个节点特征。在更新过程中,需要考虑各节点之间的相关性,本实施例可通过赋予权重来表示节点间的关联性,可选的,可调用下述关系式(1)计算步骤节点的第q个节点与成分节点的第p个节点特征的相关权重z qp。对于每个步骤节点,例如
Figure PCTCN2022142513-appb-000045
遍历所有与其有相连的边的成分节点,假设有N P个,都会得到与其对应的相关权重z qp
First, update the step node.
Figure PCTCN2022142513-appb-000043
is the node feature of the qth node of the step node,
Figure PCTCN2022142513-appb-000044
Represents the feature of the pth node of the component node. If the qth node of the step node is connected (edge) to the pth node of the component node, the feature of the pth node of the component node is used to update the qth node feature of the step node. During the update process, the correlation between the nodes needs to be considered. In this embodiment, the correlation between the nodes can be represented by assigning weights. Optionally, the following relationship (1) can be used to calculate the correlation weight z qp between the qth node of the step node and the pth node feature of the component node. For each step node, for example
Figure PCTCN2022142513-appb-000045
Traverse all component nodes that have edges connected to it, assuming there are N P nodes, and get the corresponding relevant weight z qp .
Figure PCTCN2022142513-appb-000046
Figure PCTCN2022142513-appb-000046
其中,W a、W b、W c为已知的
Figure PCTCN2022142513-appb-000047
维矩阵,
Figure PCTCN2022142513-appb-000048
代表矩阵乘法,也即向量映射。
Among them, Wa , Wb , and Wc are known
Figure PCTCN2022142513-appb-000047
dimensional matrix,
Figure PCTCN2022142513-appb-000048
Represents matrix multiplication, aka vector mapping.
在更新完各步骤节点之后,可对所有与步骤节点相连的边的成分节点进行相关权重的归一化,也即可调用下述关系式(2)得到归一化的相关权重a qpAfter updating each step node, the relevant weights of all component nodes of the edges connected to the step node can be normalized, that is, the following relationship (2) can be called to obtain the normalized relevant weight a qp :
Figure PCTCN2022142513-appb-000049
Figure PCTCN2022142513-appb-000049
式中,a qp代表步骤节点的第q个节点与成分节点的第p个节点特征的归一化的权重,1代表第1个成分节点,exp代表求指数函数,exp(z qp)代表求z qp的指数函数,
Figure PCTCN2022142513-appb-000050
代表求取所有与步骤节点相连的边的成分节点的相关权重的总和。最后通过归一化的相关权重对步骤节点的节点特征进行更新,也即调用下述关系式(3)进行计算:
Where aqp represents the normalized weight of the qth node of the step node and the pth node of the component node, 1 represents the first component node, exp represents the exponential function, and exp( zqp ) represents the exponential function of zqp .
Figure PCTCN2022142513-appb-000050
It represents the sum of the relevant weights of all the component nodes of the edges connected to the step node. Finally, the node features of the step node are updated by the normalized relevant weights, that is, the following relationship (3) is called for calculation:
Figure PCTCN2022142513-appb-000051
Figure PCTCN2022142513-appb-000051
其中,σ代表超参数,在[0,1]区间。W v
Figure PCTCN2022142513-appb-000052
维矩阵,
Figure PCTCN2022142513-appb-000053
是被与其相连的成分节点更新后的新的特征向量。
Where σ represents a hyperparameter in the interval [0, 1]. W v is
Figure PCTCN2022142513-appb-000052
dimensional matrix,
Figure PCTCN2022142513-appb-000053
It is the new feature vector updated by the component nodes connected to it.
进一步,基于残差网络的思想,调用下述关系式(4)可将更新后的
Figure PCTCN2022142513-appb-000054
与未更前的初始特征
Figure PCTCN2022142513-appb-000055
相加:
Furthermore, based on the idea of residual network, the updated
Figure PCTCN2022142513-appb-000054
Compared with the initial features before
Figure PCTCN2022142513-appb-000055
Addition:
Figure PCTCN2022142513-appb-000056
Figure PCTCN2022142513-appb-000056
同理,可调用关系式(5)对成分节点也做相同的计算与更新,
Figure PCTCN2022142513-appb-000057
Figure PCTCN2022142513-appb-000058
更新后的特征:
Similarly, we can use equation (5) to perform the same calculation and update on the component nodes.
Figure PCTCN2022142513-appb-000057
for
Figure PCTCN2022142513-appb-000058
Updated features:
Figure PCTCN2022142513-appb-000059
Figure PCTCN2022142513-appb-000059
其中,
Figure PCTCN2022142513-appb-000060
为步骤节点的第q个节点与成分节点的第p个节点特征的第k层网络的归一化的权重,
Figure PCTCN2022142513-appb-000061
为第k层网络的可训练的权重矩阵,N Q为步骤节点的第q个节点的N个近邻节点集合。
in,
Figure PCTCN2022142513-appb-000060
is the normalized weight of the k-th layer network of the q-th node of the step node and the p-th node feature of the component node,
Figure PCTCN2022142513-appb-000061
is the trainable weight matrix of the k-th layer network, and N Q is the set of N neighbor nodes of the q-th node of the step node.
遍历完所有的成分节点和步骤节点,即完成图注意力网络一层的网络更新。通常,可叠加T层图注意力网络,用t代表第t层的图注意力网络,每一层的节点特征的更新方式都如上。通常会在每层图注意力网络后面加入集成全连接层,实现对节点特征(包括成分节点和步骤节点)特征的再编码,如下述关系式(6)所示:After traversing all component nodes and step nodes, the network update of one layer of the graph attention network is completed. Usually, T layers of graph attention networks can be superimposed, with t representing the tth layer of the graph attention network. The update method of the node features of each layer is as above. Usually, an integrated fully connected layer is added after each layer of the graph attention network to realize the re-encoding of node features (including component nodes and step nodes), as shown in the following relationship (6):
Figure PCTCN2022142513-appb-000062
Figure PCTCN2022142513-appb-000062
FFN代表全连接层,
Figure PCTCN2022142513-appb-000063
代表t+1层的图注意力网络的初始化节点特征。
FFN stands for fully connected layer.
Figure PCTCN2022142513-appb-000063
Represents the initial node features of the graph attention network at layer t+1.
如上完成了对本节点特征的更新,为了实现与菜谱图像的检索,还需要将所有文字节点的特征如操作步骤和成分信息进行归纳和综合。由于步骤节点融合了成分节点信息,成分节点通过图神经网络更新,以关键词的形式对相关步骤节点特征进行了强调。在获取各文本特征之后,可采用BiLSTM(Bi-directional Long Short-Term Memory,双向长短期记忆神经网络)方法进一步挖掘步骤节点的时序信息,实现对文字节点特征的归纳综合,并将其打包成一个向量。As mentioned above, the update of the node features is completed. In order to realize the retrieval of recipe images, it is also necessary to summarize and synthesize the features of all text nodes, such as operation steps and ingredient information. Since the step node integrates the ingredient node information, the ingredient node is updated through the graph neural network, and the relevant step node features are emphasized in the form of keywords. After obtaining the features of each text, the BiLSTM (Bi-directional Long Short-Term Memory) method can be used to further mine the temporal information of the step node, realize the induction and synthesis of the text node features, and package them into a vector.
本实施例可调用下述关系式(7)和(8)提取所有步骤节点的时序信息特征:In this embodiment, the following equations (7) and (8) can be used to extract the timing information features of all step nodes:
Figure PCTCN2022142513-appb-000064
Figure PCTCN2022142513-appb-000064
Figure PCTCN2022142513-appb-000065
Figure PCTCN2022142513-appb-000065
其中,向左和向右的箭头代表LSTM编码的方向,即步骤节点特征正序编码和倒序编码。
Figure PCTCN2022142513-appb-000066
代表BiLSTM中第q个单元的输出,箭头方向不同代表按照步骤节点输入顺序不同得到的BiLSTM编码输出。同理,
Figure PCTCN2022142513-appb-000067
则代表BiLSTM中第q-1个单元的输出,也即上一个状态的输出。假设菜谱步骤共有Q步,
Figure PCTCN2022142513-appb-000068
为0,
Figure PCTCN2022142513-appb-000069
代表第T层的图神经网络的第q个步骤节点的特征。按照步骤的顺序和逆序,依次输入到其对应的BiLSTM网络中,最后得到所有步骤节点的BiLSTM编码,如下述关系式(9)所示:
Among them, the left and right arrows represent the direction of LSTM encoding, that is, the forward and reverse encoding of step node features.
Figure PCTCN2022142513-appb-000066
represents the output of the qth unit in BiLSTM, and the different directions of the arrows represent the BiLSTM encoding output obtained according to the different order of step node input. Similarly,
Figure PCTCN2022142513-appb-000067
represents the output of the q-1th unit in the BiLSTM, that is, the output of the previous state. Assume that there are Q steps in the recipe.
Figure PCTCN2022142513-appb-000068
is 0,
Figure PCTCN2022142513-appb-000069
Represents the features of the qth step node of the Tth layer of the graph neural network. According to the order and reverse order of the steps, they are input into the corresponding BiLSTM network in sequence, and finally the BiLSTM encoding of all step nodes is obtained, as shown in the following relation (9):
Figure PCTCN2022142513-appb-000070
Figure PCTCN2022142513-appb-000070
在获取所有BiLSTM单元的输出之后,可通过求和后取平均值得到整个文本特征的输出。其中,e rec代表文本特征的输出,用来进行下一步的检索。将e rec特征与菜名title特征进行融合e rec=[e rec,e ttl],[]代表特征拼接,即特征首尾相连。e rec特征最后会经过一个全连接层进行特征映射,也即e rec=fc(e rec),得到新维度的向量,也即菜谱文本的文本特征信息,其用于作为与菜谱图像的编码特征进行匹配。 After obtaining the output of all BiLSTM units, the output of the entire text feature can be obtained by summing and averaging. Among them, e rec represents the output of the text feature, which is used for the next step of retrieval. The e rec feature is fused with the dish name title feature e rec = [e rec , e ttl ], where [] represents feature concatenation, that is, the features are connected end to end. The e rec feature will finally pass through a fully connected layer for feature mapping, that is, e rec = fc(e rec ), to obtain a vector of a new dimension, that is, the text feature information of the recipe text, which is used as the encoding feature for matching with the recipe image.
上述实施例对如何执行步骤S103并不做任何限定,基于上述实施例,本申请还给出了一种可选的实施方式,包括下述内容:The above embodiment does not limit how to perform step S103. Based on the above embodiment, the present application also provides an optional implementation method, including the following contents:
同样的,为了提高模型性能,图像异质图网络可包括多层第二图注意网络,每一层第二图注意网络之后还集成第二全连接层;将待搜索图像输入至预先训练好的图像特征提取模型,得到待搜索图像的原始图像特征;对图像异质图网络的各第二图注意力网络的每个图像异质节点,根据当前图像异质节点与其余各 图像异质节点之间是否具有连接关系以及各图像异质节点之间的关联关系,更新当前图像异质节点的节点特征;基于更新后的图像异质图网络的每个图像异质节点的节点特征,生成待搜索文本的图像编码特征;将图像编码特征输入至预先训练好的图像特征生成模型,得到待搜索图像的图像特征。Similarly, in order to improve the model performance, the image heterogeneous graph network may include multiple layers of second graph attention networks, and each layer of the second graph attention network is further integrated with a second fully connected layer; the image to be searched is input into a pre-trained image feature extraction model to obtain the original image features of the image to be searched; for each image heterogeneous node of each second graph attention network of the image heterogeneous graph network, the node features of the current image heterogeneous node are updated according to whether there is a connection relationship between the current image heterogeneous node and the remaining image heterogeneous nodes and the association relationship between the image heterogeneous nodes; based on the node features of each image heterogeneous node of the updated image heterogeneous graph network, the image encoding features of the text to be searched are generated; the image encoding features are input into a pre-trained image feature generation model to obtain the image features of the image to be searched.
其中,图像特征提取模型用于提取待搜索图像以及图像样本的原始图像特征,其可基于任何一种现有的图像特征提取模型来提取,这均不影响本申请的实现。至于图像异质图网络的图运算,可基于上述实施例所提供的文本异质图网络的图运算方法实现,此处,便不再赘述。本实施例所针对的图像为包含一组图像的图像,对于图像特征生成模型是用于整合待搜索图像的所有图像特征的。Among them, the image feature extraction model is used to extract the original image features of the image to be searched and the image sample, which can be extracted based on any existing image feature extraction model, which does not affect the implementation of this application. As for the graph operation of the image heterogeneous graph network, it can be implemented based on the graph operation method of the text heterogeneous graph network provided in the above embodiment, and it will not be repeated here. The image targeted by this embodiment is an image containing a group of images, and the image feature generation model is used to integrate all image features of the image to be searched.
同样的,为了使所属领域技术人员更加清楚明白本申请的技术方案,本实施例以待搜索图像为菜谱步骤图为例阐述整个图像特征的生成过程:Similarly, in order to make the technical personnel in the relevant field more clearly understand the technical solution of the present application, this embodiment takes the image to be searched as a recipe step diagram as an example to illustrate the generation process of the entire image feature:
首先可使用ResNet骨干网络提取每一张菜谱步骤图的原始图像特征,获取ResNet网络在分类层前一层的特征作为每一张图像的特征,并用该特征构建图像异质图网络的图像节点,记为
Figure PCTCN2022142513-appb-000071
Ingredients为菜的成分,在下文统一用成分表示。本实施例菜的主成分通过对菜谱步骤图进行分类获得类别标签,该道菜通过图像分类共获得多少类别标签就有多少成分,例如:西红柿炒鸡蛋包括西红柿、鸡蛋、油等标签。如图3所示,图像异质图网络包含节点和关系。下面一行的
Figure PCTCN2022142513-appb-000072
代表了成分节点,来自于图像分类网络的对于图像的分类标签。我们对每个类别标签,例如芒果,将其输入到bert网络模型中,获取每个类别单词短语的编码特征,该特征代表节点特征。关系的建立仍然通过分类网络建立,如果该图像分类结果中有该类别,则该步骤图像特征就和该成分建立一条边。如图3所示,芒果在所有步骤图像中都出现了,所以所有步骤图像都会与其建立边。以上,节点和边都建立好了,下面就是如何使用图像异质图网络进行计算,以得到相应的图像特征:
First, we can use the ResNet backbone network to extract the original image features of each recipe step diagram, obtain the features of the ResNet network before the classification layer as the features of each image, and use the features to construct the image nodes of the image heterogeneous graph network, denoted as
Figure PCTCN2022142513-appb-000071
Ingredients are the ingredients of a dish, and are uniformly represented by ingredients below. The main ingredients of the dish in this embodiment are obtained by classifying the recipe step diagram to obtain category labels. The dish has as many ingredients as the number of category labels obtained through image classification. For example, scrambled eggs with tomatoes includes labels such as tomatoes, eggs, and oil. As shown in Figure 3, the image heterogeneous graph network contains nodes and relationships. The following row
Figure PCTCN2022142513-appb-000072
Represents the component node, which is the classification label for the image from the image classification network. For each category label, such as mango, we input it into the BERT network model to obtain the encoding features of each category word phrase, which represents the node feature. The establishment of the relationship is still established through the classification network. If the image classification result has this category, the step image feature will establish an edge with the component. As shown in Figure 3, mango appears in all step images, so all step images will establish edges with it. Above, the nodes and edges are established. The following is how to use the image heterogeneous graph network for calculation to obtain the corresponding image features:
首先对步骤节点进行更新,
Figure PCTCN2022142513-appb-000073
是步骤图节点的第m个节点的节点特征,
Figure PCTCN2022142513-appb-000074
代表成分节点的第n个节点的特征。若步骤图节点的第m个节点与成分节点的第n个节点有连接(边),则用成分节点的第n个节点的特征去更新步骤图节点的第m个节点特征。在更新过程中,需要考虑各节点之间的相关性,本实施例可通过赋予权重来表示节点间的关联性,可选的,可调用下述关系式(10)计算步骤图节点的第m个节点与成分节点的第n个节点特征的相关权重z mn。对于每个步骤图节点,例如
Figure PCTCN2022142513-appb-000075
遍历所有与其有相连的边的成分节点,假设有N N个,都会得到与其对应的相关权重z mn
First, update the step node.
Figure PCTCN2022142513-appb-000073
is the node feature of the mth node of the step graph node,
Figure PCTCN2022142513-appb-000074
Represents the feature of the nth node of the component node. If the mth node of the step graph node is connected (edge) to the nth node of the component node, the feature of the nth node of the component node is used to update the feature of the mth node of the step graph node. During the update process, the correlation between the nodes needs to be considered. In this embodiment, the correlation between the nodes can be represented by assigning weights. Optionally, the following relationship (10) can be called to calculate the correlation weight z mn between the mth node of the step graph node and the nth node feature of the component node. For each step graph node, for example
Figure PCTCN2022142513-appb-000075
Traverse all component nodes that have edges connected to it, assuming there are N N nodes, and get the corresponding relevant weight z mn .
Figure PCTCN2022142513-appb-000076
Figure PCTCN2022142513-appb-000076
其中,W d,W e,W f
Figure PCTCN2022142513-appb-000077
维矩阵,
Figure PCTCN2022142513-appb-000078
代表矩阵乘法,也即向量映射,
Figure PCTCN2022142513-appb-000079
代表矩阵乘法,同样表示向量映射。
Among them, Wd , We , Wf are
Figure PCTCN2022142513-appb-000077
dimensional matrix,
Figure PCTCN2022142513-appb-000078
represents matrix multiplication, that is, vector mapping,
Figure PCTCN2022142513-appb-000079
Represents matrix multiplication and also represents vector mapping.
在更新完各步骤图节点之后,可对所有与步骤图节点相连的边的成分节点进行相关权重的归一化,也即可调用下述关系式(11)得到归一化的相关权重a mnAfter updating each step graph node, the relevant weights of all component nodes of the edges connected to the step graph nodes can be normalized, that is, the following relationship (11) can be called to obtain the normalized relevant weights a mn :
Figure PCTCN2022142513-appb-000080
Figure PCTCN2022142513-appb-000080
式中,exp代表求指数函数,
Figure PCTCN2022142513-appb-000081
代表求取所有与步骤图节点相连的边的成分节点的相关权重的总和。最后通过归一化的相关权重对步骤图节点的节点特征进行更新,也即调用下述关系式(12)进行计算:
In the formula, exp represents the exponential function.
Figure PCTCN2022142513-appb-000081
It represents the sum of the relevant weights of all the component nodes of the edges connected to the step graph node. Finally, the node features of the step graph node are updated by the normalized relevant weights, that is, the following relation (12) is called for calculation:
Figure PCTCN2022142513-appb-000082
Figure PCTCN2022142513-appb-000082
其中,
Figure PCTCN2022142513-appb-000083
代表更新后的步骤图节点的节点特征,σ代表超参数,在[0,1]区间。W v
Figure PCTCN2022142513-appb-000084
维矩阵,
Figure PCTCN2022142513-appb-000085
是被与其相连的成分节点更新后的新的特征向量。
in,
Figure PCTCN2022142513-appb-000083
represents the node features of the updated step graph nodes, σ represents the hyperparameter, in the interval [0, 1]. W v is
Figure PCTCN2022142513-appb-000084
dimensional matrix,
Figure PCTCN2022142513-appb-000085
It is the new feature vector updated by the component nodes connected to it.
进一步,基于残差网络的思想,调用下述关系式(13)可将更新后的
Figure PCTCN2022142513-appb-000086
与未更前的初始特征
Figure PCTCN2022142513-appb-000087
相加:
Furthermore, based on the idea of residual network, the updated
Figure PCTCN2022142513-appb-000086
Compared with the initial features before
Figure PCTCN2022142513-appb-000087
Addition:
Figure PCTCN2022142513-appb-000088
Figure PCTCN2022142513-appb-000088
同理,N M代表公共M个步骤图节点与该成分节点相连,可调用关系式(14)对成分节点也做相同的计算与更新: Similarly, N M represents the common M step graph nodes connected to the component node, and the relationship (14) can be called to perform the same calculation and update on the component node:
Figure PCTCN2022142513-appb-000089
Figure PCTCN2022142513-appb-000089
式中,a mn代表步骤节点的第m个节点与成分节点的第n个节点特征的归一化的权重,a qp代表步骤节点的第q个节点与成分节点的第p个节点特征的归一化的权重,
Figure PCTCN2022142513-appb-000090
代表未更前的初始特征,
Figure PCTCN2022142513-appb-000091
代表更新后的特征,
Figure PCTCN2022142513-appb-000092
代表矩阵乘法,也即将
Figure PCTCN2022142513-appb-000093
映射至W v
Figure PCTCN2022142513-appb-000094
代表第k层网络的可训练权重矩阵,
Figure PCTCN2022142513-appb-000095
代表矩阵乘法,也即将
Figure PCTCN2022142513-appb-000096
映射至
Figure PCTCN2022142513-appb-000097
Where a mn represents the normalized weight of the mth node feature of the step node and the nth node feature of the component node, a qp represents the normalized weight of the qth node feature of the step node and the pth node feature of the component node,
Figure PCTCN2022142513-appb-000090
represents the initial features before the update,
Figure PCTCN2022142513-appb-000091
represents the updated features,
Figure PCTCN2022142513-appb-000092
Represents matrix multiplication, which is
Figure PCTCN2022142513-appb-000093
Mapped to W v ,
Figure PCTCN2022142513-appb-000094
represents the trainable weight matrix of the k-th layer network,
Figure PCTCN2022142513-appb-000095
Represents matrix multiplication, which is
Figure PCTCN2022142513-appb-000096
Map to
Figure PCTCN2022142513-appb-000097
遍历完所有的成分节点和步骤节点,即完成图注意力网络一层的网络更新。通常,可叠加T层图注意力网络,用t代表第t层的图注意力网络,每一层的节点特征的更新方式都如上。通常会在每层图注意力网络后面加入集成全连接层,实现对节点特征(包括成分节点和步骤图节点)特征的再编码,如下述关系式(15)所示:After traversing all component nodes and step nodes, the network update of one layer of the graph attention network is completed. Usually, T layers of graph attention networks can be superimposed, with t representing the tth layer of the graph attention network. The update method of the node features of each layer is as above. Usually, an integrated fully connected layer is added after each layer of the graph attention network to realize the re-encoding of node features (including component nodes and step graph nodes), as shown in the following relationship (15):
Figure PCTCN2022142513-appb-000098
Figure PCTCN2022142513-appb-000098
FFN代表全连接层,
Figure PCTCN2022142513-appb-000099
代表t+1层的图注意力网络的初始化节点特征。
FFN stands for fully connected layer.
Figure PCTCN2022142513-appb-000099
Represents the initial node features of the graph attention network at layer t+1.
在提供图像异质图网络得到菜谱步骤图的图像特征之后,可将图像特征输入至长短期记忆神经网络LSTM中获取菜谱步骤图像的总体特征,也即可通过关系式
Figure PCTCN2022142513-appb-000100
获取得到。其中,LSTM代表LSTM网络的每一个单元。
Figure PCTCN2022142513-appb-000101
代表第m个LSTM单元的输出。
Figure PCTCN2022142513-appb-000102
代表菜谱步骤图特征,来自于最后一层的异质图节点特征,m代表第m张图像。相应的,最后一个LSTM单元的特征编码输出作为该菜谱步骤图的特征输出e csi,也即
Figure PCTCN2022142513-appb-000103
After providing the image heterogeneous graph network to obtain the image features of the recipe step graph, the image features can be input into the long short-term memory neural network LSTM to obtain the overall features of the recipe step image, that is, the relationship
Figure PCTCN2022142513-appb-000100
Obtained. Among them, LSTM represents each unit of the LSTM network.
Figure PCTCN2022142513-appb-000101
Represents the output of the mth LSTM unit.
Figure PCTCN2022142513-appb-000102
represents the recipe step graph feature, which comes from the heterogeneous graph node feature of the last layer, and m represents the mth image. Accordingly, the feature encoding output of the last LSTM unit is used as the feature output of the recipe step graph e csi , that is,
Figure PCTCN2022142513-appb-000103
基于上述实施例,本实施例还提供了图像数据与文本数据的双向搜索模型的训练方法,请参见图4,可包括以下内容:Based on the above embodiment, this embodiment further provides a training method for a bidirectional search model of image data and text data, see FIG4 , which may include the following contents:
S401:预先搭建图文双向搜索模型;S401: Pre-build a bidirectional image-text search model;
S402:对训练样本集的每组训练样本,分别获取当前组训练样本中的图像样本的原始图像特征、目标识别特征、图像特征和文本样本的目标文本特征、文本特征。S402: For each group of training samples in the training sample set, respectively obtain original image features, target recognition features, image features of image samples in the current group of training samples and target text features and text features of text samples.
本步骤的训练样本集包括多组训练样本,每组训练样本均包括相对应的一个文本样本和一个图像样本,也就是文本样本和图像样本为相匹配的一组样本数据,训练样本集所包含的训练样本组数可根据实际训练需求以及实际应用场景来确定,本申请对此不作任何限定。训练样本集中的文本样本可从任何一种已有数据库中获取,该文本样本对应的图像样本可从相应的数据库中获取。当然,为了扩充训练样本集的数量。文本样本或图像文本也可为对原始文本样本或图像文本样本进行裁剪、拼接、拉伸等处理后的数据。The training sample set of this step includes multiple groups of training samples, each group of training samples includes a corresponding text sample and an image sample, that is, the text sample and the image sample are a set of sample data that match each other. The number of training sample groups contained in the training sample set can be determined according to the actual training needs and the actual application scenarios, and this application does not impose any restrictions on this. The text samples in the training sample set can be obtained from any existing database, and the image samples corresponding to the text samples can be obtained from the corresponding database. Of course, in order to expand the number of training sample sets. The text sample or image text can also be the data after the original text sample or image text sample is processed by cropping, splicing, stretching, etc.
S403:基于将目标识别特征和目标文本特征分别作为文本异质节点特征,并根据目标识别特征与目标文本特征间的包含关系确定连接边,构建图文双向搜索模型的文本异质图网络;S403: Based on using the target recognition feature and the target text feature as text heterogeneous node features respectively, and determining the connection edge according to the inclusion relationship between the target recognition feature and the target text feature, a text heterogeneous graph network of the graph-text bidirectional search model is constructed;
S404:基于将原始图像特征和目标识别特征分别作为图像异质节点特征,并根据目标识别特征与原始图像特征间的关联关系确定连接边,构建图文双向搜索模型的图像异质图网络;S404: Based on using the original image features and the target recognition features as image heterogeneous node features respectively, and determining the connection edge according to the correlation relationship between the target recognition features and the original image features, an image heterogeneous graph network of the image-text bidirectional search model is constructed;
S405:将每组训练样本的图像特征输入图像异质图网络、文本特征输入至文本异质图网络中,训练图文双向搜索模型。S405: Input the image features of each group of training samples into the image heterogeneous graph network and the text features into the text heterogeneous graph network to train the image-text bidirectional search model.
在本实施例中,一个文本样本的文本特征信息对应一个图像样本的图像特征,模型训练过程中,会采用损失函数来指导模型的训练,然后通过诸如梯度反传等方式实现对图文双向搜索模型的各网络参数的更新,直至满足模型训练条件,如达到迭代次数或者收敛效果较好。举例来说,图文双向搜索模型的训练过程可包括前向传播阶段和反向传播阶段,前向传播阶段是数据由低层次向高层次传播的阶段,反向传播阶段是当前向传播得出的结果与预期不相符时,将误差从高层次向底层次进行传播训练的阶段。首先初始化所有网络层权值,如随机初始化;然后输入图像特征和文本特征信息经过图神经网络、卷积层、下采样层、全连接层等各层的前向传播得到输出值;计算图文双向搜索模型的模型输出值,并基于损失函数计算该输出值的损失值。将误差反向传回图文双向搜索模型中,依次求得图文双向搜索模型的各部分如图神经网络层,全连接层,卷积层等各层的反向传播误差。图文双向搜索模型的各层根据各层的反向传播误差对图文双向搜索模型的所有权重系数进行调整,实现权重的更新。重新随机选取新批次的图像特征和文本特征信息,然后再次进行上述过程,获得网络前向传播得到输出值。无限往复迭代,当计算得到的模型输出值与目标值(也即标签)之间的误差小于预设阈值时,或者迭代次数超过预设迭代次数时,结束模型训练。将结束模型训练当前对应的模型的所有层参数作为训练好的图文双向搜索模型的网络参数。In this embodiment, the text feature information of a text sample corresponds to the image feature of an image sample. During the model training process, a loss function is used to guide the training of the model, and then the network parameters of the image-text bidirectional search model are updated by methods such as gradient backpropagation until the model training conditions are met, such as reaching the number of iterations or the convergence effect is good. For example, the training process of the image-text bidirectional search model may include a forward propagation stage and a backpropagation stage. The forward propagation stage is the stage in which data is propagated from a low level to a high level, and the backpropagation stage is the stage in which the error is propagated from a high level to a low level when the result obtained by the forward propagation does not match the expectation. First, initialize all network layer weights, such as random initialization; then input image features and text feature information through the forward propagation of each layer such as the graph neural network, convolution layer, downsampling layer, and fully connected layer to obtain the output value; calculate the model output value of the image-text bidirectional search model, and calculate the loss value of the output value based on the loss function. The error is backpropagated back to the image-text bidirectional search model, and the backpropagation errors of each part of the image-text bidirectional search model such as the graph neural network layer, the fully connected layer, and the convolution layer are obtained in turn. Each layer of the image-text bidirectional search model adjusts all weight coefficients of the image-text bidirectional search model according to the back propagation error of each layer to update the weight. Randomly select a new batch of image features and text feature information, and then repeat the above process to obtain the output value of the network forward propagation. Infinite reciprocating iterations, when the error between the calculated model output value and the target value (that is, the label) is less than the preset threshold, or the number of iterations exceeds the preset number of iterations, the model training ends. All layer parameters of the model corresponding to the end of model training are used as the network parameters of the trained image-text bidirectional search model.
其中,为了提高模型训练精度,本实施例还给出了一种损失函数的可选实施方式,也即可基于每组训练样本的文本特征及相应的图像特征,调用损失函数指导图文双向搜索模型的训练过程;损失函数可表述为:In order to improve the model training accuracy, this embodiment also provides an optional implementation method of a loss function, that is, based on the text features and corresponding image features of each group of training samples, a loss function is called to guide the training process of the image-text bidirectional search model; the loss function can be expressed as:
Figure PCTCN2022142513-appb-000104
Figure PCTCN2022142513-appb-000104
式中,
Figure PCTCN2022142513-appb-000105
为损失函数,min d()用于表示计算距离的最小值的函数,y n
Figure PCTCN2022142513-appb-000106
Figure PCTCN2022142513-appb-000107
的类别标签,y a
Figure PCTCN2022142513-appb-000108
Figure PCTCN2022142513-appb-000109
的类别标签,N为训练样本组数,模型训练共遍历N次,N代表在本batch(批次)中,共有N个成对的样本。首先对图像组特征
Figure PCTCN2022142513-appb-000110
进行遍历(共N个),遍历选中的图像样本称为
Figure PCTCN2022142513-appb-000111
a代表anchor(锚点样本)。与锚点样本成对的文本特征编码记为
Figure PCTCN2022142513-appb-000112
p代表positive。同理,在本batch中与
Figure PCTCN2022142513-appb-000113
不配对的文本特征记为
Figure PCTCN2022142513-appb-000114
是超参数,在训练时固定,例如设置为0.3。同理,对于文本特征也做相同的遍历操作,
Figure PCTCN2022142513-appb-000115
代表遍历中被选中的那个样本,与其对应的正图像组特征样本记为
Figure PCTCN2022142513-appb-000116
不对应的记为
Figure PCTCN2022142513-appb-000117
是超参数。
In the formula,
Figure PCTCN2022142513-appb-000105
is the loss function, min d() is used to represent the function of calculating the minimum distance, y n is
Figure PCTCN2022142513-appb-000106
and
Figure PCTCN2022142513-appb-000107
The category label of
Figure PCTCN2022142513-appb-000108
and
Figure PCTCN2022142513-appb-000109
The category label, N is the number of training sample groups, the model training is traversed N times, N represents the total number of paired samples in this batch. First, the image group features
Figure PCTCN2022142513-appb-000110
Traverse (a total of N), and the selected image samples are called
Figure PCTCN2022142513-appb-000111
a represents anchor (anchor sample). The text feature encoding paired with the anchor sample is recorded as
Figure PCTCN2022142513-appb-000112
p represents positive. Similarly, in this batch
Figure PCTCN2022142513-appb-000113
The unpaired text features are recorded as
Figure PCTCN2022142513-appb-000114
is a hyperparameter, which is fixed during training, for example, set to 0.3. Similarly, the same traversal operation is performed for text features.
Figure PCTCN2022142513-appb-000115
Represents the sample selected in the traversal, and its corresponding positive image group feature sample is recorded as
Figure PCTCN2022142513-appb-000116
The non-corresponding
Figure PCTCN2022142513-appb-000117
is a hyperparameter.
本实施例与上述实施例相同的步骤与相似的步骤可参阅上述实施例记载的实现方式,此处,便不再赘述。The same steps and similar steps as those in the above-mentioned embodiments can be found in the implementation methods described in the above-mentioned embodiments, and will not be described in detail here.
需要说明的是,本申请中各步骤之间没有严格的先后执行顺序,只要符合逻辑上的顺序,则这些步骤可以同时执行,也可按照某种预设顺序执行,图1和图4只是一种示意方式,并不代表只能是这样的执行顺序。It should be noted that there is no strict order of execution between the steps in the present application. As long as they conform to the logical order, these steps can be executed simultaneously or in a preset order. Figures 1 and 4 are only a schematic diagram and do not mean that this is the only execution order.
本申请实施例还针对图文双向搜索方法及图像文本匹配模型的训练方法提供了相应的装置,进一步使得方法更具有实用性。其中,装置可从功能模块的角度和硬件的角度分别说明。下面对本申请实施例提供的图文双向搜索装置及图像文本匹配模型的训练装置进行介绍,下文描述的图文双向搜索装置及图像文本匹配模型的训练装置与上文描述的图文双向搜索方法及图像文本匹配模型的训练方法可相互对应参照。The embodiment of the present application also provides a corresponding device for the image-text bidirectional search method and the image-text matching model training method, which further makes the method more practical. Among them, the device can be described from the perspective of functional modules and hardware. The image-text bidirectional search device and the image-text matching model training device provided in the embodiment of the present application are introduced below. The image-text bidirectional search device and the image-text matching model training device described below can be referenced to each other with the image-text bidirectional search method and the image-text matching model training method described above.
基于功能模块的角度,首先请参见图5,图5为本申请实施例提供的图文双向搜索装置在一种实施方式下的结构图,该装置可包括:From the perspective of functional modules, please refer to FIG. 5 , which is a structural diagram of a bidirectional image-text search device provided in an embodiment of the present application in one implementation manner. The device may include:
图像识别模块501,用于调用预先训练好的图文双向搜索模型的图像识别网络,获取待搜索图像的每张子图像所包含的目标图像块的目标识别特征;The image recognition module 501 is used to call the image recognition network of the pre-trained image-text bidirectional search model to obtain the target recognition features of the target image block contained in each sub-image of the image to be searched;
文本特征提取模块502,用于基于图文双向搜索模型的文本异质图网络,获取仅包含一类目标文本数据的待搜索文本的文本特征;目标文本数据对应的目标文本特征包括目标识别特征;目标识别特征和目标文本特征为文本异质图网络的节点特征,文本异质图网络的连接边由目标识别特征与目标文本特征间的包含关系确定;A text feature extraction module 502 is used to obtain text features of a text to be searched that only contains one type of target text data based on a text heterogeneous graph network of a text-text bidirectional search model; the target text features corresponding to the target text data include target recognition features; the target recognition features and the target text features are node features of the text heterogeneous graph network, and the connection edges of the text heterogeneous graph network are determined by the inclusion relationship between the target recognition features and the target text features;
图像特征提取模块503,用于基于图文双向搜索模型的图像异质图网络,获取包括一组子图像的待搜索图像的图像特征;待搜索图像的原始图像特征和目标识别特征作为图像异质图网络的节点特征,图像异质图网络的连接边由目标识别特征和原始图像特征之间的关联关系确定;An image feature extraction module 503 is used to obtain image features of an image to be searched including a group of sub-images based on an image heterogeneous graph network of an image-text bidirectional search model; the original image features and target recognition features of the image to be searched are used as node features of the image heterogeneous graph network, and the connection edges of the image heterogeneous graph network are determined by the association relationship between the target recognition features and the original image features;
双向搜索模块504,用于将图像特征和文本特征输入至预先训练好的图文双向搜索模型,得到图文搜索结果;图文双向搜索模型包括文本异质图网络、图像异质图网络和图像识别网络。The bidirectional search module 504 is used to input image features and text features into a pre-trained image-text bidirectional search model to obtain image-text search results; the image-text bidirectional search model includes a text heterogeneous graph network, an image heterogeneous graph network and an image recognition network.
可选的,在本实施例的一些实施方式中,上述文本特征提取模块502还可用于:获取仅包含一类目标文本数据的待搜索文本的文本特征,包括:响应文本拆分指令,将目标识别特征拆分为多个文本词组和/或文本单词,将目标文本数据拆分为多个文本语句;将各文本词组和/或文本单词输入至预先训练好的文本特征提取模型中,得到多个第一类节点特征;将各文本语句输入至文本特征提取模型中,得到多个第二类节点特征。Optionally, in some implementations of the present embodiment, the above-mentioned text feature extraction module 502 can also be used to: obtain text features of the text to be searched that only contains one type of target text data, including: responding to a text splitting instruction, splitting the target recognition features into multiple text phrases and/or text words, and splitting the target text data into multiple text sentences; inputting each text phrase and/or text word into a pre-trained text feature extraction model to obtain multiple first-category node features; inputting each text sentence into the text feature extraction model to obtain multiple second-category node features.
作为上述实施例的一种可选的实施方式,上述文本特征提取模块502还可包括特征提取单元,用于搭建语言表征模型;语言表征模型包括文本信息输入层、特征提取层和文本特征输出层;特征提取层为基于转换器的双向编码器;利用自然语言文本样本数据集训练语言表征模型,并将训练好的语言表征模型作为文本特征提取模型。As an optional implementation of the above embodiment, the above text feature extraction module 502 may also include a feature extraction unit for building a language representation model; the language representation model includes a text information input layer, a feature extraction layer and a text feature output layer; the feature extraction layer is a bidirectional encoder based on a converter; the language representation model is trained using a natural language text sample data set, and the trained language representation model is used as a text feature extraction model.
作为上述实施例的另一种可选的实施方式,上述文本特征提取模块502还可包括位置输入单元,用于将各文本语句以及每个文本语句中包含的各词组、各单词所在当前文本语句中的位置信息,输入至文本特征提取模型。As another optional implementation of the above embodiment, the above text feature extraction module 502 may also include a position input unit for inputting the position information of each text sentence and each phrase and each word contained in each text sentence in the current text sentence into the text feature extraction model.
作为上述实施例的再一种可选的实施方式,上述文本特征提取模块502还可包括标识处理单元,用于获取下一时刻输入至文本特征提取模型中的数据的数据类型,以将数据类型连同相应的数据一起输入至文本特征提取模型中;数据类型包括用于标识目标识别特征的第一标识,和用于标识目标文本数据的第二标识。As another optional implementation of the above embodiment, the above text feature extraction module 502 may also include an identification processing unit for obtaining the data type of data to be input into the text feature extraction model at the next moment, so as to input the data type together with the corresponding data into the text feature extraction model; the data type includes a first identification for identifying the target recognition feature, and a second identification for identifying the target text data.
作为上述实施例的再一种可选的实施方式,上述文本特征提取模块502进一步还可包括边连接确定单元,用于对目标识别特征中的每个文本词组或文本单词,依次遍历目标文本数据的每个文本语句;若当前文本语句所包含的目标词组与当前文本词组相同,则当前文本语句对应的第二类节点特征与当前文本词组对应的第一类节点特征具有连接关系;若当前文本语句所包含的目标单词与当前文本单词相同,则当前文本语句对应的第二类节点特征与当前文本单词对应的第一类节点特征具有连接关系。As another optional implementation of the above embodiment, the above text feature extraction module 502 may further include an edge connection determination unit, which is used to traverse each text sentence of the target text data in turn for each text phrase or text word in the target recognition feature; if the target phrase contained in the current text sentence is the same as the current text phrase, then the second-category node feature corresponding to the current text sentence and the first-category node feature corresponding to the current text phrase have a connection relationship; if the target word contained in the current text sentence is the same as the current text word, then the second-category node feature corresponding to the current text sentence and the first-category node feature corresponding to the current text word have a connection relationship.
可选的,作为上述实施例的一种可选的实施方式,上述图像识别模块501还可用于预先利用在包含多张子图像的图像样本中标注相应目标识别特征的目标训练样本集,训练得到图像识别网络;将待搜索图像输入至图像识别网络中,得到待搜索图像的每张子图像所包含的目标识别特征。Optionally, as an optional implementation of the above embodiment, the above image recognition module 501 can also be used to pre-train an image recognition network using a target training sample set that annotates corresponding target recognition features in an image sample containing multiple sub-images; input the image to be searched into the image recognition network to obtain the target recognition features contained in each sub-image of the image to be searched.
作为上述实施例的一种可选的实施方式,目标识别网络结构包括输入层、卷积结构、池化层及分类器;卷积结构包括基础运算组件和残差运算组件;基础运算组件用于对输入图像依次进行卷积处理、正则化处理、激活函数处理及最大池化处理;残差运算组件包括多个相连的残差块,每个残差块均包括多层卷积层,用于对基础运算组件的输出特征进行卷积计算;池化层,用于将卷积结构的输出特征转化为目标特征向量,并输送至分类器;分类器,用于通过对目标特征向量进行计算,并输出所属类别标签的概率。As an optional implementation of the above embodiment, the target recognition network structure includes an input layer, a convolution structure, a pooling layer and a classifier; the convolution structure includes a basic operation component and a residual operation component; the basic operation component is used to perform convolution processing, regularization processing, activation function processing and maximum pooling processing on the input image in sequence; the residual operation component includes a plurality of connected residual blocks, each residual block includes multiple layers of convolution layers, which are used to perform convolution calculations on the output features of the basic operation components; the pooling layer is used to convert the output features of the convolution structure into a target feature vector and transmit it to the classifier; the classifier is used to calculate the target feature vector and output the probability of the category label.
可选的,在本实施例的另一些实施方式中,上述文本特征提取模块502还可包括图运算单元,用于文本异质图网络包括多层第一图注意力网络,每一层第一图注意网络之后还集成第一全连接层;对文本异质图网络的各第一图注意力网络的每个文本异质节点,根据当前文本异质节点与其余各文本异质节点之间是否具有连接关系以及各文本异质节点之间的关联关系,更新当前文本异质节点的节点特征;基于更新后的文本异质图网络的每个文本异质节点的节点特征,生成待搜索文本的文本特征。Optionally, in some other implementations of the present embodiment, the above-mentioned text feature extraction module 502 may also include a graph operation unit, which is used for a text heterogeneous graph network including multiple layers of first graph attention networks, and each layer of the first graph attention network is also integrated with a first fully connected layer; for each text heterogeneous node of each first graph attention network of the text heterogeneous graph network, according to whether there is a connection relationship between the current text heterogeneous node and the remaining text heterogeneous nodes and the association relationship between the text heterogeneous nodes, the node feature of the current text heterogeneous node is updated; based on the node features of each text heterogeneous node of the updated text heterogeneous graph network, the text features of the text to be searched are generated.
作为上述实施例的一种可选的实施方式,上述图运算单元还可用于:确定与当前文本异质节点具有相连关系、且不为同一节点类型的目标文本异质节点;基于当前文本异质节点的节点特征与各目标文本异质节点的节点特征之间的关联关系,计算当前文本异质节点与每个目标文本异质节点的初始权重值,并根据各初始权重值确定当前文本异质节点的权重值;基于权重值和各目标文本异质节点,对当前文本异质节点进行节点特征更新,并将当前文本异质节点更新后的节点特征和更新前的节点特征之和作为当前文本异质节点的节点特征。As an optional implementation of the above embodiment, the above graph operation unit can also be used to: determine a target text heterogeneous node that is connected to the current text heterogeneous node and is not of the same node type; calculate the initial weight values of the current text heterogeneous node and each target text heterogeneous node based on the association between the node features of the current text heterogeneous node and the node features of each target text heterogeneous node, and determine the weight value of the current text heterogeneous node according to each initial weight value; update the node feature of the current text heterogeneous node based on the weight value and each target text heterogeneous node, and use the sum of the node feature of the current text heterogeneous node after the update and the node feature before the update as the node feature of the current text heterogeneous node.
作为上述实施例的另一种可选的实施方式,上述图运算单元进一步可用于:调用权重计算关系式分别计算当前文本异质节点与每个目标文本异质节点的初始权重值;权重计算关系式为:As another optional implementation of the above embodiment, the above graph operation unit can be further used to: call the weight calculation relationship to calculate the initial weight value of the current text heterogeneous node and each target text heterogeneous node respectively; the weight calculation relationship is:
Figure PCTCN2022142513-appb-000118
Figure PCTCN2022142513-appb-000118
其中,z qp为第q个文本异质节点与第p个文本异质节点的初始权重值,LeakyReLU()为激活函数,W a、W b、W c为已知的
Figure PCTCN2022142513-appb-000119
维矩阵,
Figure PCTCN2022142513-appb-000120
为第q个文本异质节点的节点特征,
Figure PCTCN2022142513-appb-000121
为第p个文本异质节点的节点特征。
Where zqp is the initial weight value of the qth text heterogeneous node and the pth text heterogeneous node, LeakyReLU() is the activation function, and Wa , Wb , and Wc are known
Figure PCTCN2022142513-appb-000119
dimensional matrix,
Figure PCTCN2022142513-appb-000120
is the node feature of the qth text heterogeneous node,
Figure PCTCN2022142513-appb-000121
is the node feature of the pth text heterogeneous node.
作为上述实施例的另一种可选的实施方式,上述图运算单元进一步还可用于:调用初次更新关系式,对当前文本异质节点的节点特征进行更新;初次更新关系式为:As another optional implementation of the above embodiment, the above graph operation unit can be further used to: call the initial update relational expression to update the node features of the current text heterogeneous node; the initial update relational expression is:
Figure PCTCN2022142513-appb-000122
Figure PCTCN2022142513-appb-000122
式中,
Figure PCTCN2022142513-appb-000123
为第q个文本异质节点更新后的节点特征,σ为超参数,a qp为步骤节点的第q个节点与成分节点的第p个节点特征的归一化的权重W v为已知的
Figure PCTCN2022142513-appb-000124
维矩阵,
Figure PCTCN2022142513-appb-000125
为第p个文本异质节点的节点特征,N P为目标文本异质节点总数。
In the formula,
Figure PCTCN2022142513-appb-000123
is the updated node feature of the qth text heterogeneous node, σ is a hyperparameter, aqp is the normalized weight of the qth node of the step node and the pth node feature of the component node, Wv is known
Figure PCTCN2022142513-appb-000124
dimensional matrix,
Figure PCTCN2022142513-appb-000125
is the node feature of the pth text heterogeneous node, and NP is the total number of target text heterogeneous nodes.
可选的,在本实施例的再一些实施方式中,上述文本特征提取模块502还可进一步包括时序特征提取单元,用于各第二类节点特征之间具有先后执行顺序,将各第二类节点特征以及顺序信息,输入至预先训练好的时序特征提取模型中,得到时序信息特征;将时序信息特征,通过全连接层映射至文本特征中。Optionally, in some further implementations of the present embodiment, the above-mentioned text feature extraction module 502 may further include a timing feature extraction unit, which is used to have a sequential execution order between each second-category node feature, and input each second-category node feature and sequence information into a pre-trained timing feature extraction model to obtain timing information features; the timing information features are mapped to the text features through a fully connected layer.
作为上述实施例的一种可选的实施方式,上述时序特征提取单元可进一步用于:基于各第二类节点特征之间的先后顺序,依次将各第二类节点特征按照顺序和逆序输入至双向长短期记忆神经网络,得到各第二类节点特征的时序编码特征;根据每个第二类节点特征时序编码特征确定时序信息特征。As an optional implementation of the above embodiment, the above-mentioned timing feature extraction unit can be further used to: based on the sequence between each second-category node feature, input each second-category node feature into the bidirectional long short-term memory neural network in sequence and reverse order to obtain the timing coding feature of each second-category node feature; determine the timing information feature according to the timing coding feature of each second-category node feature.
作为上述实施例的另一种可选的实施方式,上述时序特征提取单元还可进一步用于:对每一个第二类节点特征,调用正序编码关系式,对当前第二类节点特征进行正序编码,得到正序编码特征;正序编码关系式为:As another optional implementation of the above embodiment, the above time series feature extraction unit can be further used to: for each second-category node feature, call the positive sequence coding relationship formula, perform positive sequence coding on the current second-category node feature, and obtain a positive sequence coding feature; the positive sequence coding relationship formula is:
Figure PCTCN2022142513-appb-000126
Figure PCTCN2022142513-appb-000126
调用倒序编码关系式,对当前第二类节点特征进行正序编码,得到倒序编码特征;倒序编码关系式为:The reverse coding relational formula is called to encode the current second-category node feature in forward order to obtain the reverse coding feature; the reverse coding relational formula is:
Figure PCTCN2022142513-appb-000127
Figure PCTCN2022142513-appb-000127
将正序编码特征和倒序编码特征作为当前第二类节点特征的时序编码特征;The forward coding feature and the reverse coding feature are used as the temporal coding features of the current second type of node features;
式中,q∈[1,Q],
Figure PCTCN2022142513-appb-000128
为双向长短期记忆神经网络的正向编码方向的第q个单元的输出,
Figure PCTCN2022142513-appb-000129
为文本异质图网络中第T层图注意力网络的第q个第二类节点特征,
Figure PCTCN2022142513-appb-000130
为双向长短期记忆神经网络的正向编码方向的第q-1个单元的输出,Q为第二类节点特征总数,
Figure PCTCN2022142513-appb-000131
为双向长短期记忆神经网络的倒向编码方向的第q个单元的输出,
Figure PCTCN2022142513-appb-000132
为双向长短期记忆神经网络的倒向编码方向的第q+1个单元的输出,
Figure PCTCN2022142513-appb-000133
为双向长短期 记忆神经网络的倒向编码函数,
Figure PCTCN2022142513-appb-000134
为双向长短期记忆神经网络的正向编码函数。
Where q∈[1,Q],
Figure PCTCN2022142513-appb-000128
is the output of the qth unit in the forward encoding direction of the bidirectional long short-term memory neural network,
Figure PCTCN2022142513-appb-000129
is the qth second-category node feature of the Tth layer graph attention network in the text heterogeneous graph network,
Figure PCTCN2022142513-appb-000130
is the output of the q-1th unit in the forward encoding direction of the bidirectional long short-term memory neural network, Q is the total number of node features of the second category,
Figure PCTCN2022142513-appb-000131
is the output of the qth unit in the reverse encoding direction of the bidirectional long short-term memory neural network,
Figure PCTCN2022142513-appb-000132
is the output of the q+1th unit in the reverse encoding direction of the bidirectional long short-term memory neural network,
Figure PCTCN2022142513-appb-000133
is the backward encoding function of the bidirectional long short-term memory neural network,
Figure PCTCN2022142513-appb-000134
is the forward encoding function of the bidirectional long short-term memory neural network.
可选的,在本实施例的一些实施方式中,上述图像特征提取模块503还可用于:图像异质图网络包括多层第二图注意网络,每一层第二图注意网络之后还集成第二全连接层;将待搜索图像输入至预先训练好的图像特征提取模型,得到待搜索图像的原始图像特征;对图像异质图网络的各第二图注意力网络的每个图像异质节点,根据当前图像异质节点与其余各图像异质节点之间是否具有连接关系以及各图像异质节点之间的关联关系,更新当前图像异质节点的节点特征;基于更新后的图像异质图网络的每个图像异质节点的节点特征,生成待搜索文本的图像编码特征;将图像编码特征输入至预先训练好的图像特征生成模型,得到待搜索图像的图像特征。Optionally, in some implementations of the present embodiment, the above-mentioned image feature extraction module 503 can also be used for: the image heterogeneous graph network includes multiple layers of second graph attention networks, and each layer of the second graph attention network is also integrated with a second fully connected layer; the image to be searched is input into a pre-trained image feature extraction model to obtain the original image features of the image to be searched; for each image heterogeneous node of each second graph attention network of the image heterogeneous graph network, according to whether there is a connection relationship between the current image heterogeneous node and the remaining image heterogeneous nodes and the association relationship between the image heterogeneous nodes, the node features of the current image heterogeneous node are updated; based on the node features of each image heterogeneous node of the updated image heterogeneous graph network, the image coding features of the text to be searched are generated; the image coding features are input into a pre-trained image feature generation model to obtain the image features of the image to be searched.
其次,请参见图6,图6为本申请实施例提供的图像文本匹配模型的训练装置在一种实施方式下的结构图,该装置可包括:Next, please refer to FIG. 6 , which is a structural diagram of a training device for an image-text matching model provided in an embodiment of the present application in one implementation manner, and the device may include:
特征提取模块601,用于对训练样本集的每组训练样本,分别获取当前组训练样本中的图像样本的原始图像特征、目标识别特征、图像特征和文本样本的目标文本特征、文本特征;目标文本特征包括目标识别特征;图像样本包括一组子图像;The feature extraction module 601 is used to obtain the original image features, target recognition features, image features of the image samples in the current group of training samples and the target text features and text features of the text samples for each group of training samples in the training sample set; the target text features include the target recognition features; the image samples include a group of sub-images;
模型搭建模块602,用于预先搭建图文双向搜索模型;基于将目标识别特征和目标文本特征分别作为文本异质节点特征,并根据目标识别特征与目标文本特征间的包含关系确定连接边,构建图文双向搜索模型的文本异质图网络;基于将原始图像特征和目标识别特征分别作为图像异质节点特征,并根据各目标识别特征与原始图像特征间的关联关系确定连接边,构建图文双向搜索模型的图像异质图网络; Model building module 602, used to pre-build a bidirectional image-text search model; based on using target recognition features and target text features as text heterogeneous node features respectively, and determining connecting edges according to the inclusion relationship between the target recognition features and the target text features, a text heterogeneous graph network of the bidirectional image-text search model is constructed; based on using original image features and target recognition features as image heterogeneous node features respectively, and determining connecting edges according to the correlation relationship between each target recognition feature and the original image feature, an image heterogeneous graph network of the bidirectional image-text search model is constructed;
模型训练模块603,用于将每组训练样本的图像特征输入图像异质图网络、文本特征输入至文本异质图网络中,训练图文双向搜索模型。The model training module 603 is used to input the image features of each group of training samples into the image heterogeneous graph network and the text features into the text heterogeneous graph network to train the image-text bidirectional search model.
本申请实施例图文双向搜索装置及图像文本匹配模型的训练装置的各功能模块的功能可根据上述方法实施例中的方法实现,其实现过程可以参照上述方法实施例的相关描述,此处不再赘述。The functions of the various functional modules of the image-text bidirectional search device and the image-text matching model training device in the embodiment of the present application can be implemented according to the method in the above-mentioned method embodiment. The implementation process can refer to the relevant description of the above-mentioned method embodiment, which will not be repeated here.
由上可知,本申请实施例可有效提升图像数据和文本数据之间的双向搜索精度。It can be seen from the above that the embodiments of the present application can effectively improve the accuracy of two-way search between image data and text data.
上文中提到的图文双向搜索装置及图像文本匹配模型的训练装置是从功能模块的角度描述,进一步的,本申请还提供一种图文双向搜索设备,是从硬件角度描述。图7为本申请实施例提供的图文双向搜索设备在一种实施方式下的结构示意图。如图7所示,该图文双向搜索设备可包括存储器70,用于存储计算机程序;处理器71,用于执行计算机程序时实现如上述任一实施例提到的图文双向搜索方法及图像文本匹配模型的训练方法的步骤。人机交互组件72用于通过信息输入/信息输出接口,接收用户输入的训练样本集选择请求、模型训练请求、搜索请求以及向用户展示图文搜索结果;通信组件73用于传输图像文本匹配模型的训练过程中以及图文双向搜索任务执行过程中的数据及指令。The image-text bidirectional search device and the image-text matching model training device mentioned above are described from the perspective of functional modules. Furthermore, the present application also provides an image-text bidirectional search device, which is described from the perspective of hardware. Figure 7 is a structural schematic diagram of the image-text bidirectional search device provided in an embodiment of the present application under one implementation. As shown in Figure 7, the image-text bidirectional search device may include a memory 70 for storing computer programs; a processor 71 for implementing the steps of the image-text bidirectional search method and the image-text matching model training method mentioned in any of the above embodiments when executing the computer program. The human-computer interaction component 72 is used to receive the training sample set selection request, model training request, search request input by the user through the information input/information output interface, and to display the image-text search results to the user; the communication component 73 is used to transmit data and instructions during the training process of the image-text matching model and the execution process of the image-text bidirectional search task.
其中,处理器71可以包括一个或多个处理核心,比如4核心处理器、8核心处理器,处理器71还可为控制器、微控制器、微处理器或其他数据处理芯片等。处理器71可以采用DSP(Digital Signal Processing,数字信号处理)、FPGA(Field-Programmable Gate Array,现场可编程门阵列)、PLA(Programmable Logic Array,可编程逻辑阵列)中的至少一种硬件形式来实现。处理器71也可以包括主处理器和协处理器,主处理器是用于对在唤醒状态下的数据进行处理的处理器,也称CPU(Central Processing Unit,中央处理器);协处理器是用于对在待机状态下的数据进行处理的低功耗处理器。在一些实施例中,处理器71可以集成有GPU(Graphics Processing Unit,图像处理器),GPU用于负责显示屏所需要显示的内容的渲染和绘制。一些实施例中,处理器71还可以包括AI(Artificial Intelligence,人工智能)处理器,该AI处理器用于处理有关机器学习的计算操作。The processor 71 may include one or more processing cores, such as a 4-core processor or an 8-core processor. The processor 71 may also be a controller, a microcontroller, a microprocessor or other data processing chip. The processor 71 may be implemented in at least one hardware form of DSP (Digital Signal Processing), FPGA (Field-Programmable Gate Array), and PLA (Programmable Logic Array). The processor 71 may also include a main processor and a coprocessor. The main processor is a processor for processing data in the awake state, also known as CPU (Central Processing Unit); the coprocessor is a low-power processor for processing data in the standby state. In some embodiments, the processor 71 may be integrated with a GPU (Graphics Processing Unit), which is responsible for rendering and drawing the content to be displayed on the display screen. In some embodiments, the processor 71 may also include an AI (Artificial Intelligence) processor, which is used to process computing operations related to machine learning.
存储器70可以包括一个或多个计算机可读存储介质,该计算机非易失性可读存储介质可以是非暂态的。存储器70还可包括高速随机存取存储器以及非易失性存储器,比如一个或多个磁盘存储设备、闪存存储设备。存储器70在一些实施例中可以是图文双向搜索设备的内部存储单元,例如服务器的硬盘。存储器70在另一些实施例中也可以是图文双向搜索设备的外部存储设备,例如服务器上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。进一步地,存储器70还可以既包括图文双向搜索设备的内部存储单元也包括外部存储设备。存储器70不仅可以用于存储安装于图文双向搜索设备的应用软件及各类数据,例如:执行图文双向搜索过程中以及图像文本匹配模型的训练过程中使用以及产生的程序的代码等,还可以用于暂时地存储已经输出或者将要输出的数据。本实施例中,存储器70至少用于存储以下计算机程序701,其中,该计算机程序被处理器71加载并执行之后,能够实现前述任一实施例公开的图文双向搜索方法中以及图像文本匹配模型的训练方法的相关步骤。另外,存储器70所存储的资源还可以包括操作系统702和数据703等,存储方式可以是短暂存储或者永久存储。其中,操作系统702可以包括Windows、Unix、Linux等。数据703可以包括但不限于图文双向搜索过程中以及图像文本匹配模型的训练过程所生成的数据以及双向搜索结果等对应的数据等。The memory 70 may include one or more computer-readable storage media, and the computer non-volatile readable storage media may be non-transitory. The memory 70 may also include high-speed random access memory and non-volatile memory, such as one or more disk storage devices and flash memory storage devices. In some embodiments, the memory 70 may be an internal storage unit of the image-text bidirectional search device, such as a hard disk of a server. In other embodiments, the memory 70 may also be an external storage device of the image-text bidirectional search device, such as a plug-in hard disk equipped on a server, a smart memory card (Smart Media Card, SMC), a secure digital (Secure Digital, SD) card, a flash card (Flash Card), etc. Further, the memory 70 may also include both an internal storage unit and an external storage device of the image-text bidirectional search device. The memory 70 may not only be used to store application software and various types of data installed in the image-text bidirectional search device, such as: the code of the program used and generated in the process of executing the image-text bidirectional search and the training process of the image-text matching model, but also be used to temporarily store data that has been output or is to be output. In this embodiment, the memory 70 is at least used to store the following computer program 701, wherein, after the computer program is loaded and executed by the processor 71, it can implement the relevant steps of the image-text bidirectional search method and the image-text matching model training method disclosed in any of the aforementioned embodiments. In addition, the resources stored in the memory 70 may also include an operating system 702 and data 703, etc., and the storage method may be temporary storage or permanent storage. Among them, the operating system 702 may include Windows, Unix, Linux, etc. The data 703 may include but is not limited to the data generated during the image-text bidirectional search process and the image-text matching model training process, as well as the data corresponding to the bidirectional search results, etc.
人机交互组件72可包括有显示屏、信息输入/信息输出接口如键盘或鼠标,显示屏、信息输入/信息输出接口属于用户接口,可选的用户接口还可以包括标准的有线接口、无线接口等。可选地,在一些实施例中,显示器可以是LED显示器、液晶显示器、触控式液晶显示器以及OLED(Organic Light-Emitting Diode,有机发光二极管)触摸器等。显示器也可以适当的称为显示屏或显示单元,用于显示在互检索设备 中处理的信息以及用于显示可视化的用户界面。通信组件73可包括通信接口或者称为网络接口、通信总线等,通信接口可选的可以包括有线接口和/或无线接口,如WI-FI接口、蓝牙接口等,通常用于在图文双向搜索设备与其他设备之间建立通信连接。通信总线可以是外设部件互连标准(peripheral component interconnect,简称PCI)总线或扩展工业标准结构(extended industry standard architecture,简称EISA)总线等。该总线可以分为地址总线、数据总线、控制总线等。为便于表示,图7中仅用一条粗线表示,但并不表示仅有一根总线或一种类型的总线。在一些实施例中,上述互检索设备还可包括电源74以及实现各类功能的传感器75。本领域技术人员可以理解,图7中示出的结构并不构成对该图文双向搜索设备的限定,可以包括比图示更多或更少的组件。The human-computer interaction component 72 may include a display screen, an information input/output interface such as a keyboard or a mouse. The display screen and the information input/output interface belong to the user interface. The optional user interface may also include a standard wired interface, a wireless interface, etc. Optionally, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, and an OLED (Organic Light-Emitting Diode) touch device, etc. The display may also be appropriately referred to as a display screen or a display unit, which is used to display information processed in the mutual retrieval device and to display a visual user interface. The communication component 73 may include a communication interface or a network interface, a communication bus, etc. The communication interface may optionally include a wired interface and/or a wireless interface, such as a WI-FI interface, a Bluetooth interface, etc., which is usually used to establish a communication connection between the image and text two-way search device and other devices. The communication bus may be a peripheral component interconnect standard (PCI) bus or an extended industry standard architecture (EISA) bus, etc. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of representation, FIG7 is represented by only one thick line, but it does not mean that there is only one bus or one type of bus. In some embodiments, the mutual search device may also include a power supply 74 and a sensor 75 for implementing various functions. Those skilled in the art will appreciate that the structure shown in FIG7 does not constitute a limitation on the image-text bidirectional search device, and may include more or fewer components than shown in the figure.
进一步的,本实施例中并不对图文双向搜索设备的数量进行限定,其可以是多个图文双向搜索设备共同协作完成的图文双向搜索模型训练方法和/或图像文本匹配模型的训练方法。在一种可能的实施方式中,请参考图8,图8为本申请实施例提供的另一种图文双向搜索模型训练方法和/或图像文本匹配模型的训练方法所适用的硬件组成框架示意图。由图8可知,该硬件组成框架可以包括:第一图文双向搜索设备81和第二图文双向搜索设备82,二者之间通过网络连接。Furthermore, the present embodiment does not limit the number of image-text bidirectional search devices, and it may be a method for training an image-text bidirectional search model and/or a method for training an image-text matching model that is jointly completed by multiple image-text bidirectional search devices. In a possible implementation, please refer to Figure 8, which is a schematic diagram of a hardware composition framework applicable to another method for training an image-text bidirectional search model and/or a method for training an image-text matching model provided in an embodiment of the present application. As can be seen from Figure 8, the hardware composition framework may include: a first image-text bidirectional search device 81 and a second image-text bidirectional search device 82, which are connected via a network.
在本申请实施例中,第一图文双向搜索设备81和第二图文双向搜索设备82的硬件结构可以参考图7中电子设备。即可以理解为本实施例中具有两个电子设备,两者进行数据交互。可将如图9所示的训练好的图文双向搜索模型预部署在任何一台设备中,进一步,本申请实施例中并不对网络的形式进行限定,即,网络可以是无线网络(如WIFI、蓝牙等),也可以是有线网络。In the embodiment of the present application, the hardware structure of the first image-text bidirectional search device 81 and the second image-text bidirectional search device 82 can refer to the electronic device in FIG7. That is, it can be understood that there are two electronic devices in this embodiment, and the two exchange data. The trained image-text bidirectional search model shown in FIG9 can be pre-deployed in any device. Further, the embodiment of the present application does not limit the form of the network, that is, the network can be a wireless network (such as WIFI, Bluetooth, etc.) or a wired network.
其中,第一图文双向搜索设备81和第二图文双向搜索设备82可以是同一种电子设备,如第一图文双向搜索设备81和第二图文双向搜索设备82均为服务器;也可以是不同类型的电子设备,例如,第一图文双向搜索设备81可以是智能手机或其它智能终端,第二图文双向搜索设备82可以是服务器。在该种实施方式中,为了提高整体性能,可将模型训练过程以及训练好的图文双向搜索模型预部署计算性能高的那端。也即可以利用计算能力强的服务器作为第二图文双向搜索设备82来提高数据处理效率及可靠性,进而提高模型训练和/或图文双向检索的处理效率。同时利用成本低,应用范围广的智能手机作为第一图文双向搜索设备81,用于实现第二图文双向搜索设备82与用户之间的交互。可以理解的是,该交互过程例如可以为:智能手机从服务器处获取训练样本集,并获取训练样本集的标签,将这些标签发送至服务器,由服务器利用获取到的标签进行后续的模型训练步骤。服务器在生成图文双向搜索模型后,获取智能手机发送的搜索请求,搜索请求为用户下发的,且携带待搜索数据,服务器在获取到该搜索请求后,通过解析搜索请求确定待搜索数据,并调用图文双向搜索模型对待搜索数据进行相应处理,得到相应的搜索结果,同时将搜索结果反馈至第一图文双向搜索设备81。Among them, the first image-text bidirectional search device 81 and the second image-text bidirectional search device 82 can be the same electronic device, such as the first image-text bidirectional search device 81 and the second image-text bidirectional search device 82 are both servers; they can also be different types of electronic devices, for example, the first image-text bidirectional search device 81 can be a smart phone or other smart terminal, and the second image-text bidirectional search device 82 can be a server. In this embodiment, in order to improve the overall performance, the model training process and the trained image-text bidirectional search model can be pre-deployed on the end with high computing performance. That is, a server with strong computing power can be used as the second image-text bidirectional search device 82 to improve data processing efficiency and reliability, thereby improving the processing efficiency of model training and/or image-text bidirectional retrieval. At the same time, a low-cost and widely used smart phone is used as the first image-text bidirectional search device 81 to realize the interaction between the second image-text bidirectional search device 82 and the user. It can be understood that the interaction process can be, for example, that the smart phone obtains a training sample set from the server, obtains the labels of the training sample set, sends these labels to the server, and the server uses the obtained labels to perform subsequent model training steps. After generating the image-text bidirectional search model, the server obtains the search request sent by the smart phone. The search request is issued by the user and carries the data to be searched. After obtaining the search request, the server determines the data to be searched by parsing the search request, and calls the image-text bidirectional search model to perform corresponding processing on the data to be searched, obtains the corresponding search results, and feeds back the search results to the first image-text bidirectional search device 81.
本申请实施例图文双向搜索设备的各功能模块的功能可根据上述方法实施例中的方法实现,其实现过程可以参照上述方法实施例的相关描述,此处不再赘述。The functions of the various functional modules of the image-text bidirectional search device in the embodiment of the present application can be implemented according to the method in the above method embodiment, and the implementation process can refer to the relevant description of the above method embodiment, which will not be repeated here.
由上可知,本申请实施例可有效提升图像数据和文本数据之间的双向搜索精度。It can be seen from the above that the embodiments of the present application can effectively improve the accuracy of two-way search between image data and text data.
可以理解的是,如果上述实施例中的图文双向搜索方法以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机非易失性可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个非易失性存储介质中,执行本申请各个实施例方法的全部或部分步骤。而前述的非易失性存储介质包括:U盘、移动硬盘、只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、电可擦除可编程ROM、寄存器、硬盘、多媒体卡、卡型存储器(例如SD或DX存储器等)、磁性存储器、可移动磁盘、CD-ROM、磁碟或者光盘等各种可以存储程序代码的介质。It is understandable that if the image-text bidirectional search method in the above embodiment is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer non-volatile readable storage medium. Based on this understanding, the technical solution of the present application is essentially or the part that contributes to the prior art or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a non-volatile storage medium to execute all or part of the steps of the various embodiments of the present application. The aforementioned non-volatile storage medium includes: U disk, mobile hard disk, read-only memory (ROM), random access memory (RAM), electrically erasable programmable ROM, register, hard disk, multimedia card, card-type memory (such as SD or DX memory, etc.), magnetic memory, removable disk, CD-ROM, disk or optical disk, etc. Various media that can store program codes.
基于此,本申请实施例还提供了一种非易失性可读存储介质,存储有计算机程序,所述计算机程序被处理器执行时如上任意一实施例所述图文双向搜索方法的步骤。Based on this, an embodiment of the present application further provides a non-volatile readable storage medium storing a computer program, and when the computer program is executed by a processor, the steps of the image-text bidirectional search method described in any of the above embodiments are performed.
本说明书中各个实施例采用递进的方式描述,每个实施例重点说明的都是与其它实施例的不同之处,各个实施例之间相同或相似部分互相参见即可。对于实施例公开的硬件包括装置及设备而言,由于其与实施例公开的方法相对应,所以描述的比较简单,相关之处参见方法部分说明即可。In this specification, each embodiment is described in a progressive manner, and each embodiment focuses on the differences from other embodiments. The same or similar parts between the embodiments can be referred to each other. As for the hardware disclosed in the embodiments, including devices and equipment, since they correspond to the methods disclosed in the embodiments, the description is relatively simple, and the relevant parts can be referred to the method part description.
专业人员还可以进一步意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、计算机软件或者二者的结合来实现,为了清楚地说明硬件和软件的可互换性,在上述说明中已经按照功能一般性地描述了各示例的组成及步骤。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。Professionals may further appreciate that the units and algorithm steps of each example described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, computer software, or a combination of the two. In order to clearly illustrate the interchangeability of hardware and software, the composition and steps of each example have been generally described in the above description according to function. Whether these functions are performed in hardware or software depends on the specific application and design constraints of the technical solution. Professionals and technicians may use different methods to implement the described functions for each specific application, but such implementation should not be considered to be beyond the scope of this application.
以上对本申请所提供的一种图文双向搜索方法、装置、设备及非易失性可读存储介质进行了详细介绍。本文中应用了具体个例对本申请的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本申请的方法及其核心思想。应当指出,对于本技术领域的普通技术人员来说,在不脱离本申请原理的前提下,还可以对本申请进行若干改进和修饰,这些改进和修饰也落入本申请权利要求的保护范围内。The above is a detailed introduction to a method, device, equipment and non-volatile readable storage medium for bidirectional search of images and texts provided by the present application. Specific examples are used herein to illustrate the principles and implementation methods of the present application. The description of the above embodiments is only used to help understand the method and core idea of the present application. It should be pointed out that for ordinary technicians in this technical field, without departing from the principles of the present application, several improvements and modifications can be made to the present application, and these improvements and modifications also fall within the scope of protection of the claims of the present application.

Claims (21)

  1. 一种图文双向搜索方法,其特征在于,包括:A method for bidirectional search of images and texts, characterized by comprising:
    预先训练图文双向搜索模型;所述图文双向搜索模型包括文本异质图网络、图像异质图网络和图像识别网络;Pre-training a bidirectional image-text search model; the bidirectional image-text search model includes a text heterogeneous graph network, an image heterogeneous graph network and an image recognition network;
    调用所述图像识别网络,获取待搜索图像的每张子图像所包含的目标图像块的目标识别特征;Calling the image recognition network to obtain target recognition features of the target image blocks contained in each sub-image of the image to be searched;
    基于所述文本异质图网络,获取仅包含一类目标文本数据的待搜索文本的文本特征;所述目标文本数据对应的目标文本特征包括所述目标识别特征;所述目标识别特征和所述目标文本特征为所述文本异质图网络的节点特征,所述文本异质图网络的连接边由所述目标识别特征与所述目标文本特征间的包含关系确定;Based on the text heterogeneous graph network, text features of the text to be searched that only contain one type of target text data are obtained; the target text features corresponding to the target text data include the target recognition features; the target recognition features and the target text features are node features of the text heterogeneous graph network, and the connection edges of the text heterogeneous graph network are determined by the inclusion relationship between the target recognition features and the target text features;
    基于所述图像异质图网络,获取包括一组子图像的待搜索图像的图像特征;所述待搜索图像的原始图像特征和所述目标识别特征作为所述图像异质图网络的节点特征,所述图像异质图网络的连接边由所述目标识别特征和所述原始图像特征之间的关联关系确定;Based on the image heterogeneous graph network, image features of an image to be searched including a group of sub-images are obtained; the original image features of the image to be searched and the target recognition features are used as node features of the image heterogeneous graph network, and the connection edges of the image heterogeneous graph network are determined by the association relationship between the target recognition features and the original image features;
    将所述图像特征和所述文本特征输入至所述图文双向搜索模型,得到图文搜索结果。The image features and the text features are input into the image-text bidirectional search model to obtain image-text search results.
  2. 根据权利要求1所述的图文双向搜索方法,其特征在于,所述预先训练图文双向搜索模型之后,还包括:The image-text bidirectional search method according to claim 1 is characterized in that after the image-text bidirectional search model is pre-trained, it also includes:
    响应文本拆分指令,将所述目标识别特征拆分为多个文本词组和/或文本单词,将所述目标文本数据拆分为多个文本语句;In response to the text splitting instruction, the target recognition feature is split into a plurality of text phrases and/or text words, and the target text data is split into a plurality of text sentences;
    将各文本词组和/或文本单词输入至预先训练好的文本特征提取模型中,得到多个第一类节点特征;Inputting each text phrase and/or text word into a pre-trained text feature extraction model to obtain a plurality of first-category node features;
    将各文本语句输入至所述文本特征提取模型中,得到多个第二类节点特征。Each text sentence is input into the text feature extraction model to obtain a plurality of second-category node features.
  3. 根据权利要求2所述的图文双向搜索方法,其特征在于,所述获取仅包含一类目标文本数据的待搜索文本的文本特征之前,还包括:The image-text bidirectional search method according to claim 2 is characterized in that, before obtaining the text features of the text to be searched that only contains one type of target text data, it also includes:
    搭建语言表征模型;所述语言表征模型包括文本信息输入层、特征提取层和文本特征输出层;所述特征提取层为基于转换器的双向编码器;Build a language representation model; the language representation model includes a text information input layer, a feature extraction layer and a text feature output layer; the feature extraction layer is a bidirectional encoder based on a converter;
    利用自然语言文本样本数据集训练所述语言表征模型,并将训练好的语言表征模型作为文本特征提取模型。The language representation model is trained using a natural language text sample data set, and the trained language representation model is used as a text feature extraction model.
  4. 根据权利要求2所述的图文双向搜索方法,其特征在于,所述将各文本语句输入至所述文本特征提取模型中,包括:The image-text bidirectional search method according to claim 2, characterized in that the inputting of each text sentence into the text feature extraction model comprises:
    将各文本语句以及每个文本语句中包含的各词组、各单词所在当前文本语句中的位置信息,输入至所述文本特征提取模型。The position information of each text sentence and each phrase and each word contained in each text sentence in the current text sentence is input into the text feature extraction model.
  5. 根据权利要求2所述的图文双向搜索方法,其特征在于,所述将各文本词组和/或文本单词输入至预先构建的文本特征提取模型中,得到多个第一类节点特征之前,以及所述将各文本语句输入至所述文本特征提取模型中,得到多个第二类节点特征之前,还包括:The image-text bidirectional search method according to claim 2 is characterized in that before inputting each text phrase and/or text word into a pre-built text feature extraction model to obtain a plurality of first-class node features, and before inputting each text sentence into the text feature extraction model to obtain a plurality of second-class node features, it further comprises:
    获取下一时刻输入至文本特征提取模型中的数据的数据类型,以将所述数据类型连同相应的数据一起输入至所述文本特征提取模型中;Acquire the data type of data to be input into the text feature extraction model at the next moment, so as to input the data type together with the corresponding data into the text feature extraction model;
    所述数据类型包括用于标识所述目标识别特征的第一标识,和用于标识所述目标文本数据的第二标识。The data type includes a first identifier for identifying the target identification feature and a second identifier for identifying the target text data.
  6. 根据权利要求2所述的图文双向搜索方法,其特征在于,所述文本异质图网络的连接边由所述目标识别特征与所述目标文本特征间的包含关系确定,包括:The image-text bidirectional search method according to claim 2 is characterized in that the connection edges of the text heterogeneous graph network are determined by the inclusion relationship between the target recognition feature and the target text feature, including:
    对所述目标识别特征中的每个文本词组或文本单词,依次遍历所述目标文本数据的每个文本语句;For each text phrase or text word in the target recognition feature, traverse each text sentence of the target text data in sequence;
    若当前文本语句所包含的目标词组与当前文本词组相同,则所述当前文本语句对应的第二类节点特征与所述当前文本词组对应的第一类节点特征具有连接关系;If the target phrase included in the current text sentence is the same as the current text phrase, then the second type of node feature corresponding to the current text sentence has a connection relationship with the first type of node feature corresponding to the current text phrase;
    若所述当前文本语句所包含的目标单词与当前文本单词相同,则所述当前文本语句对应的第二类节点特征与所述当前文本单词对应的第一类节点特征具有连接关系。If the target word included in the current text sentence is the same as the current text word, the second type of node feature corresponding to the current text sentence and the first type of node feature corresponding to the current text word have a connection relationship.
  7. 根据权利要求1所述的图文双向搜索方法,其特征在于,所述调用所述图像识别网络,获取待搜索图像的每张子图像所包含的目标图像块的目标识别特征,包括:The image-text bidirectional search method according to claim 1 is characterized in that the calling of the image recognition network to obtain the target recognition features of the target image blocks contained in each sub-image of the image to be searched comprises:
    预先利用在包含多张子图像的图像样本中标注相应目标识别特征的目标训练样本集,训练得到图像识别网络;Preliminarily using a target training sample set in which corresponding target recognition features are annotated in an image sample containing a plurality of sub-images, an image recognition network is trained;
    将所述待搜索图像输入至所述图像识别网络中,得到所述待搜索图像的每张子图像所包含的目标识别特征。The image to be searched is input into the image recognition network to obtain target recognition features contained in each sub-image of the image to be searched.
  8. 根据权利要求7所述的图文双向搜索方法,其特征在于,所述利用在包含多张子图像的图像样本中标注相应目标识别特征的目标训练样本集,训练得到图像识别网络之前,还包括:The image-text bidirectional search method according to claim 7 is characterized in that, before training the image recognition network using the target training sample set in which the corresponding target recognition features are annotated in the image sample containing the plurality of sub-images, the method further comprises:
    预先构建目标识别网络结构,所述目标识别网络结构包括输入层、卷积结构、池化层及分类器;Pre-constructing a target recognition network structure, wherein the target recognition network structure includes an input layer, a convolution structure, a pooling layer, and a classifier;
    所述卷积结构包括基础运算组件和残差运算组件;所述基础运算组件用于对输入图像依次进行卷积处理、正则化处理、激活函数处理及最大池化处理;所述残差运算组件包括多个相连的残差块,每个残差块均包括多层卷积层,用于对所述基础运算组件的输出特征进行卷积计算;The convolution structure includes a basic operation component and a residual operation component; the basic operation component is used to perform convolution processing, regularization processing, activation function processing and maximum pooling processing on the input image in sequence; the residual operation component includes a plurality of connected residual blocks, each residual block includes multiple convolution layers, which are used to perform convolution calculation on the output features of the basic operation component;
    所述池化层,用于将所述卷积结构的输出特征转化为目标特征向量,并输送至所述分类器;The pooling layer is used to convert the output features of the convolution structure into a target feature vector and transmit it to the classifier;
    所述分类器,用于通过对所述目标特征向量进行计算,并输出所属类别标签的概率。The classifier is used to calculate the target feature vector and output the probability of the category label.
  9. 根据权利要求1所述的图文双向搜索方法,其特征在于,所述文本异质图网络包括多层第一图注意力网络,每一层第一图注意网络之后还集成第一全连接层;所述获取仅包含一类目标文本数据的待搜索文本的文本特征,包括:The image-text bidirectional search method according to claim 1 is characterized in that the text heterogeneous graph network includes multiple layers of first graph attention networks, and each layer of the first graph attention network is further integrated with a first fully connected layer; the step of obtaining text features of the text to be searched that contains only one type of target text data includes:
    对所述文本异质图网络的各第一图注意力网络的每个文本异质节点,根据当前文本异质节点与其余各文本异质节点之间是否具有连接关系以及各文本异质节点之间的关联关系,更新所述当前文本异质节点的节点特征;For each text heterogeneous node of each first graph attention network of the text heterogeneous graph network, according to whether there is a connection relationship between the current text heterogeneous node and the remaining text heterogeneous nodes and the association relationship between the text heterogeneous nodes, updating the node feature of the current text heterogeneous node;
    基于更新后的文本异质图网络的每个文本异质节点的节点特征,生成所述待搜索文本的文本特征。Based on the node features of each text heterogeneous node in the updated text heterogeneous graph network, the text features of the text to be searched are generated.
  10. 根据权利要求9所述的图文双向搜索方法,其特征在于,所述根据当前文本异质节点与其余各文本异质节点之间是否具有连接关系以及各文本异质节点之间的关联关系,更新所述当前文本异质节点的节点特征,包括:The image-text bidirectional search method according to claim 9 is characterized in that the updating of the node feature of the current text heterogeneous node according to whether the current text heterogeneous node has a connection relationship with the remaining text heterogeneous nodes and the association relationship between the text heterogeneous nodes comprises:
    确定与所述当前文本异质节点具有相连关系、且不为同一节点类型的目标文本异质节点;Determine a target text heterogeneous node that is connected to the current text heterogeneous node and is not of the same node type;
    基于所述当前文本异质节点的节点特征与各目标文本异质节点的节点特征之间的关联关系,计算所述当前文本异质节点与每个目标文本异质节点的初始权重值,并根据各初始权重值确定所述当前文本异质节点的权重值;Based on the correlation between the node feature of the current text heterogeneous node and the node features of each target text heterogeneous node, the initial weight value of the current text heterogeneous node and each target text heterogeneous node is calculated, and the weight value of the current text heterogeneous node is determined according to each initial weight value;
    基于所述权重值和各目标文本异质节点,对所述当前文本异质节点进行节点特征更新,并将所述当前文本异质节点更新后的节点特征和更新前的节点特征之和作为所述当前文本异质节点的节点特征。Based on the weight value and each target text heterogeneous node, the node feature of the current text heterogeneous node is updated, and the sum of the node feature after the update and the node feature before the update of the current text heterogeneous node is used as the node feature of the current text heterogeneous node.
  11. 根据权利要求10所述的图文双向搜索方法,其特征在于,所述基于所述当前文本异质节点的节点特征与各目标文本异质节点的节点特征之间的关联关系,计算所述当前文本异质节点与每个目标文本异质节点的初始权重值,包括:The image-text bidirectional search method according to claim 10 is characterized in that the initial weight value of the current text heterogeneous node and each target text heterogeneous node is calculated based on the association relationship between the node feature of the current text heterogeneous node and the node feature of each target text heterogeneous node, including:
    调用权重计算关系式分别计算所述当前文本异质节点与每个目标文本异质节点的初始权重值;所述权重计算关系式为:The weight calculation formula is called to calculate the initial weight values of the current text heterogeneous node and each target text heterogeneous node respectively; the weight calculation formula is:
    Figure PCTCN2022142513-appb-100001
    Figure PCTCN2022142513-appb-100001
    其中,z qp为第q个文本异质节点与第p个文本异质节点的初始权重值,LeakyReLU()为激活函数,W a、W b、W c为已知的
    Figure PCTCN2022142513-appb-100002
    维矩阵,
    Figure PCTCN2022142513-appb-100003
    为第q个文本异质节点的节点特征,
    Figure PCTCN2022142513-appb-100004
    为第p个文本异质节点的节点特征。
    Where zqp is the initial weight value of the qth text heterogeneous node and the pth text heterogeneous node, LeakyReLU() is the activation function, and Wa , Wb , and Wc are known
    Figure PCTCN2022142513-appb-100002
    dimensional matrix,
    Figure PCTCN2022142513-appb-100003
    is the node feature of the qth text heterogeneous node,
    Figure PCTCN2022142513-appb-100004
    is the node feature of the pth text heterogeneous node.
  12. 根据权利要求10所述的图文双向搜索方法,其特征在于,所述基于所述权重值和各目标文本异质节点,对所述当前文本异质节点进行节点特征更新,包括:The image-text bidirectional search method according to claim 10 is characterized in that the updating of node features of the current text heterogeneous nodes based on the weight value and each target text heterogeneous node comprises:
    调用初次更新关系式,对所述当前文本异质节点的节点特征进行更新;所述初次更新关系式为:The initial update relational expression is called to update the node features of the heterogeneous nodes of the current text; the initial update relational expression is:
    Figure PCTCN2022142513-appb-100005
    Figure PCTCN2022142513-appb-100005
    式中,
    Figure PCTCN2022142513-appb-100006
    为第q个文本异质节点更新后的节点特征,σ为超参数,a qp为步骤节点的第q个节点与成分节点的第p个节点特征的归一化的权重,W v为已知的
    Figure PCTCN2022142513-appb-100007
    维矩阵,
    Figure PCTCN2022142513-appb-100008
    为第p个文本异质节点的节点特征,N P为目标文本异质节点总数。
    In the formula,
    Figure PCTCN2022142513-appb-100006
    is the updated node feature of the qth text heterogeneous node, σ is a hyperparameter, aqp is the normalized weight of the qth node of the step node and the pth node of the component node, and Wv is the known
    Figure PCTCN2022142513-appb-100007
    dimensional matrix,
    Figure PCTCN2022142513-appb-100008
    is the node feature of the pth text heterogeneous node, and NP is the total number of target text heterogeneous nodes.
  13. 根据权利要求1至12任意一项所述的图文双向搜索方法,其特征在于,所述目标文本数据对应的各第二类节点特征之间具有先后执行顺序,所述基于所述文本异质图网络,获取仅包含一类目标文本数据的待搜索文本的文本特征之后,还包括:The image-text bidirectional search method according to any one of claims 1 to 12 is characterized in that the second-class node features corresponding to the target text data have a sequential execution order, and after obtaining the text features of the text to be searched containing only one class of target text data based on the text heterogeneous graph network, the method further comprises:
    将各第二类节点特征以及顺序信息,输入至预先训练好的时序特征提取模型中,得到时序信息特征;Input each second-category node feature and sequence information into a pre-trained time series feature extraction model to obtain time series information features;
    将所述时序信息特征,通过全连接层映射至所述文本特征中。The temporal information features are mapped to the text features through a fully connected layer.
  14. 根据权利要求13所述的图文双向搜索方法,其特征在于,所述将各第二类节点特征以及顺序信息,输入至预先训练好的时序特征提取模型,得到时序信息特征,包括:The image-text bidirectional search method according to claim 13 is characterized in that the step of inputting each second-type node feature and sequence information into a pre-trained temporal feature extraction model to obtain temporal information features comprises:
    基于各第二类节点特征之间的先后顺序,依次将各第二类节点特征按照顺序和逆序输入至双向长短期记忆神经网络,得到各第二类节点特征的时序编码特征;Based on the sequence between the features of each second-category node, the features of each second-category node are input into the bidirectional long short-term memory neural network in sequence and reverse order to obtain the temporal coding features of each second-category node feature;
    根据每个第二类节点特征时序编码特征确定时序信息特征。The timing information feature is determined according to the characteristic timing coding feature of each second-category node.
  15. 根据权利要求13所述的图文双向搜索方法,其特征在于,所述基于各第二类节点特征之间的先后顺序,依次将各第二类节点特征按照顺序和逆序输入至双向长短期记忆神经网络,得到各第二类节点特征的时序编码特征,包括:The image-text bidirectional search method according to claim 13 is characterized in that, based on the sequence between the second-type node features, the second-type node features are sequentially and reversely input into the bidirectional long short-term memory neural network to obtain the temporal coding features of the second-type node features, including:
    对每一个第二类节点特征,调用正序编码关系式,对当前第二类节点特征进行正序编码,得到正序编码特征;所述正序编码关系式为:For each second-category node feature, the positive sequence coding relational expression is called to perform positive sequence coding on the current second-category node feature to obtain a positive sequence coding feature; the positive sequence coding relational expression is:
    Figure PCTCN2022142513-appb-100009
    Figure PCTCN2022142513-appb-100009
    调用倒序编码关系式,对所述当前第二类节点特征进行正序编码,得到倒序编码特征;所述倒序编码关系式为:The reverse order coding relational formula is called to perform forward coding on the current second type node feature to obtain a reverse order coding feature; the reverse order coding relational formula is:
    Figure PCTCN2022142513-appb-100010
    Figure PCTCN2022142513-appb-100010
    将所述正序编码特征和所述倒序编码特征作为所述当前第二类节点特征的时序编码特征;Using the positive sequence coding feature and the reverse sequence coding feature as the time sequence coding feature of the current second type node feature;
    式中,q∈[1,Q],
    Figure PCTCN2022142513-appb-100011
    为所述双向长短期记忆神经网络的正向编码方向的第q个单元的输出,
    Figure PCTCN2022142513-appb-100012
    为所述文本异质图网络中第T层图注意力网络的第q个第二类节点特征,
    Figure PCTCN2022142513-appb-100013
    为所述双向长短期记忆神经网络的正向编码方向的第q-1个单元的输出,Q为第二类节点特征总数,
    Figure PCTCN2022142513-appb-100014
    为所述双向长短期记忆神经网络的倒向编码方向的第q个单元的输出,
    Figure PCTCN2022142513-appb-100015
    为所述双向长短期记忆神经网络的倒向编码方向的第q+1个单元的输出,
    Figure PCTCN2022142513-appb-100016
    为所述双向长短期记忆神经网络的倒向编码函数,
    Figure PCTCN2022142513-appb-100017
    为所述双向长短期记忆神经网络的正向编码函数。
    Where q∈[1,Q],
    Figure PCTCN2022142513-appb-100011
    is the output of the qth unit in the forward encoding direction of the bidirectional long short-term memory neural network,
    Figure PCTCN2022142513-appb-100012
    is the qth second-category node feature of the Tth layer graph attention network in the text heterogeneous graph network,
    Figure PCTCN2022142513-appb-100013
    is the output of the q-1th unit in the forward encoding direction of the bidirectional long short-term memory neural network, Q is the total number of second-category node features,
    Figure PCTCN2022142513-appb-100014
    is the output of the qth unit in the backward encoding direction of the bidirectional long short-term memory neural network,
    Figure PCTCN2022142513-appb-100015
    is the output of the q+1th unit in the backward encoding direction of the bidirectional long short-term memory neural network,
    Figure PCTCN2022142513-appb-100016
    is the backward encoding function of the bidirectional long short-term memory neural network,
    Figure PCTCN2022142513-appb-100017
    is the forward encoding function of the bidirectional long short-term memory neural network.
  16. 根据权利要求1所述的图文双向搜索方法,其特征在于,所述图像异质图网络包括多层第二图注意网络,每一层第二图注意网络之后还集成第二全连接层;所述获取包括一组子图像的待搜索图像的图像特征,包括:The image-text bidirectional search method according to claim 1 is characterized in that the image heterogeneous graph network includes multiple layers of second graph attention networks, and each layer of the second graph attention network is followed by a second fully connected layer; the step of obtaining image features of an image to be searched including a group of sub-images comprises:
    将所述待搜索图像输入至预先训练好的图像特征提取模型,得到所述待搜索图像的原始图像特征;Inputting the image to be searched into a pre-trained image feature extraction model to obtain original image features of the image to be searched;
    对所述图像异质图网络的各第二图注意力网络的每个图像异质节点,根据当前图像异质节点与其余各图像异质节点之间是否具有连接关系以及各图像异质节点之间的关联关系,更新所述当前图像异质节点的节点特征;For each image heterogeneous node of each second graph attention network of the image heterogeneous graph network, according to whether there is a connection relationship between the current image heterogeneous node and the remaining image heterogeneous nodes and the association relationship between the image heterogeneous nodes, updating the node feature of the current image heterogeneous node;
    基于更新后的图像异质图网络的每个图像异质节点的节点特征,生成所述待搜索文本的图像编码特征;Generate image encoding features of the text to be searched based on the node features of each image heterogeneous node of the updated image heterogeneous graph network;
    将所述图像编码特征输入至预先训练好的图像特征生成模型,得到所述待搜索图像的图像特征。The image coding features are input into a pre-trained image feature generation model to obtain the image features of the image to be searched.
  17. 一种图像文本匹配模型的训练方法,其特征在于,包括:A training method for an image-text matching model, comprising:
    预先搭建图文双向搜索模型;Pre-build a bidirectional image and text search model;
    对训练样本集的每组训练样本,分别获取当前组训练样本中的图像样本的原始图像特征、目标识别特征、图像特征和文本样本的目标文本特征、文本特征;所述目标文本特征包括所述目标识别特征;所述图像样本包括一组子图像;For each group of training samples in the training sample set, original image features, target recognition features, image features of image samples in the current group of training samples and target text features and text features of text samples are obtained respectively; the target text features include the target recognition features; the image samples include a group of sub-images;
    基于将所述目标识别特征和所述目标文本特征分别作为文本异质节点特征,并根据所述目标识别特征与所述目标文本特征间的包含关系确定连接边,构建所述图文双向搜索模型的文本异质图网络;Based on taking the target recognition feature and the target text feature as text heterogeneous node features respectively, and determining the connection edge according to the inclusion relationship between the target recognition feature and the target text feature, a text heterogeneous graph network of the image-text bidirectional search model is constructed;
    基于将所述原始图像特征和所述目标识别特征分别作为图像异质节点特征,并根据所述目标识别特征与所述原始图像特征间的关联关系确定连接边,构建所述图文双向搜索模型的图像异质图网络;Based on taking the original image features and the target recognition features as image heterogeneous node features respectively, and determining the connection edge according to the correlation relationship between the target recognition features and the original image features, an image heterogeneous graph network of the image-text bidirectional search model is constructed;
    将每组训练样本的图像特征输入所述图像异质图网络、文本特征输入至所述文本异质图网络中,训练所述图文双向搜索模型。The image features of each group of training samples are input into the image heterogeneous graph network, and the text features are input into the text heterogeneous graph network to train the image-text bidirectional search model.
  18. 一种图文双向搜索装置,其特征在于,包括:A bidirectional image and text search device, characterized by comprising:
    图像识别模块,用于调用预先训练好的图文双向搜索模型的图像识别网络,获取待搜索图像的每张子图像所包含的目标图像块的目标识别特征;An image recognition module is used to call a pre-trained image recognition network of a bidirectional image-text search model to obtain target recognition features of a target image block contained in each sub-image of the image to be searched;
    文本特征提取模块,用于基于所述图文双向搜索模型的文本异质图网络,获取仅包含一类目标文本数据的待搜索文本的文本特征;所述目标文本数据对应的目标文本特征包括所述目标识别特征;所述目标识别特征和所述目标文本特征为所述文本异质图网络的节点特征,所述文本异质图网络的连接边由所述目标识别特征与所述目标文本特征间的包含关系确定;A text feature extraction module is used to obtain text features of a text to be searched that only contains one type of target text data based on the text heterogeneous graph network of the text-image bidirectional search model; the target text features corresponding to the target text data include the target recognition features; the target recognition features and the target text features are node features of the text heterogeneous graph network, and the connection edges of the text heterogeneous graph network are determined by the inclusion relationship between the target recognition features and the target text features;
    图像特征提取模块,用于基于所述图文双向搜索模型的图像异质图网络,获取包括一组子图像的待搜索图像的图像特征;所述待搜索图像的原始图像特征和所述目标识别特征作为所述图像异质图网络的节点特征,所述图像异质图网络的连接边由所述目标识别特征和所述原始图像特征之间的关联关系确定;An image feature extraction module is used to obtain image features of an image to be searched including a group of sub-images based on the image heterogeneous graph network of the image-text bidirectional search model; the original image features of the image to be searched and the target recognition features are used as node features of the image heterogeneous graph network, and the connection edges of the image heterogeneous graph network are determined by the association relationship between the target recognition features and the original image features;
    双向搜索模块,用于将所述图像特征和所述文本特征输入至预先训练好的图文双向搜索模型,得到图文搜索结果;所述图文双向搜索模型包括文本异质图网络、图像异质图网络和图像识别网络。The bidirectional search module is used to input the image features and the text features into a pre-trained image-text bidirectional search model to obtain image-text search results; the image-text bidirectional search model includes a text heterogeneous graph network, an image heterogeneous graph network and an image recognition network.
  19. 一种图像文本匹配模型的训练装置,其特征在于,包括:A training device for an image-text matching model, comprising:
    特征提取模块,用于对训练样本集的每组训练样本,分别获取当前组训练样本中的图像样本的原始图像特征、目标识别特征、图像特征和文本样本的目标文本特征、文本特征;所述目标文本特征包括所述目标识别特征;所述图像样本包括一组子图像;A feature extraction module is used to obtain, for each group of training samples in the training sample set, original image features, target recognition features, image features of image samples in the current group of training samples and target text features and text features of text samples; the target text features include the target recognition features; the image samples include a group of sub-images;
    模型搭建模块,用于预先搭建图文双向搜索模型;基于将所述目标识别特征和所述目标文本特征分别作为文本异质节点特征,并根据所述目标识别特征与所述目标文本特征间的包含关系确定连接边,构建所述图文双向搜索模型的文本异质图网络;基于将所述原始图像特征和所述目标识别特征分别作为图像异质节点特征,并根据各目标识别特征与所述原始图像特征间的关联关系确定连接边,构建所述图文双向搜索模型的图像异质图网络;A model building module is used to pre-build a bidirectional image-text search model; based on taking the target recognition feature and the target text feature as text heterogeneous node features respectively, and determining the connection edge according to the inclusion relationship between the target recognition feature and the target text feature, a text heterogeneous graph network of the bidirectional image-text search model is constructed; based on taking the original image feature and the target recognition feature as image heterogeneous node features respectively, and determining the connection edge according to the association relationship between each target recognition feature and the original image feature, an image heterogeneous graph network of the bidirectional image-text search model is constructed;
    模型训练模块,用于将每组训练样本的图像特征输入所述图像异质图网络、文本特征输入至所述文本异质图网络中,训练所述图文双向搜索模型。The model training module is used to input the image features of each group of training samples into the image heterogeneous graph network and the text features into the text heterogeneous graph network to train the image-text bidirectional search model.
  20. 一种图文双向搜索设备,其特征在于,包括处理器、存储器、人机交互组件以及通信组件;A bidirectional image and text search device, characterized by comprising a processor, a memory, a human-computer interaction component and a communication component;
    所述人机交互组件用于通过信息输入/信息输出接口,接收用户输入的训练样本集选择请求、模型训练请求、搜索请求以及向用户展示图文搜索结果;The human-computer interaction component is used to receive a training sample set selection request, a model training request, a search request input by a user through an information input/information output interface, and to display graphic and text search results to the user;
    所述通信组件用于传输图像文本匹配模型的训练过程中以及图文双向搜索任务执行过程中的数据及指令;The communication component is used to transmit data and instructions during the training process of the image-text matching model and the execution process of the image-text bidirectional search task;
    所述处理器用于执行所述存储器中存储的计算机程序时实现如权利要求1或16任一项所述图文双向搜索方法和/或如权利要求17所述图像文本匹配模型的训练方法的步骤。The processor is used to implement the steps of the image-text bidirectional search method as described in any one of claims 1 or 16 and/or the image-text matching model training method as described in claim 17 when executing the computer program stored in the memory.
  21. 一种非易失性可读存储介质,其特征在于,所述非易失性可读存储介质上存储有计算机程序,所述计算机程序被处理器执行时实现如权利要求1或16任一项所述图文双向搜索方法和/或如权利要求17所述图像文本匹配模型的训练方法的步骤。A non-volatile readable storage medium, characterized in that a computer program is stored on the non-volatile readable storage medium, and when the computer program is executed by a processor, the steps of the image-text bidirectional search method as described in any one of claims 1 or 16 and/or the image-text matching model training method as described in claim 17 are implemented.
PCT/CN2022/142513 2022-11-08 2022-12-27 Image-text bidirectional search method, apparatus and device, and non-volatile readable storage medium WO2024098533A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211388778.5 2022-11-08
CN202211388778.5A CN115438215B (en) 2022-11-08 2022-11-08 Image-text bidirectional search and matching model training method, device, equipment and medium

Publications (1)

Publication Number Publication Date
WO2024098533A1 true WO2024098533A1 (en) 2024-05-16

Family

ID=84252309

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/142513 WO2024098533A1 (en) 2022-11-08 2022-12-27 Image-text bidirectional search method, apparatus and device, and non-volatile readable storage medium

Country Status (2)

Country Link
CN (1) CN115438215B (en)
WO (1) WO2024098533A1 (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115438215B (en) * 2022-11-08 2023-04-18 苏州浪潮智能科技有限公司 Image-text bidirectional search and matching model training method, device, equipment and medium
CN115858848B (en) * 2023-02-27 2023-08-15 浪潮电子信息产业股份有限公司 Image-text mutual inspection method and device, training method and device, server and medium
CN116049459B (en) * 2023-03-30 2023-07-14 浪潮电子信息产业股份有限公司 Cross-modal mutual retrieval method, device, server and storage medium
CN116226434B (en) * 2023-05-04 2023-07-21 浪潮电子信息产业股份有限公司 Multi-element heterogeneous model training and application method, equipment and readable storage medium
CN116992167B (en) * 2023-09-22 2024-01-23 深圳市智慧城市科技发展集团有限公司 Address searching method, system and computer readable storage medium

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113127669A (en) * 2020-01-15 2021-07-16 百度在线网络技术(北京)有限公司 Advertisement matching method, device, equipment and storage medium
CN113902764A (en) * 2021-11-19 2022-01-07 东北大学 Semantic-based image-text cross-modal retrieval method
CN114821605A (en) * 2022-06-30 2022-07-29 苏州浪潮智能科技有限公司 Text processing method, device, equipment and medium
CN114896429A (en) * 2022-07-12 2022-08-12 苏州浪潮智能科技有限公司 Image-text mutual detection method, system, equipment and computer readable storage medium
CN114969405A (en) * 2022-04-30 2022-08-30 苏州浪潮智能科技有限公司 Cross-modal image-text mutual inspection method
CN115062208A (en) * 2022-05-30 2022-09-16 苏州浪潮智能科技有限公司 Data processing method and system and computer equipment
US20220343626A1 (en) * 2019-08-15 2022-10-27 Vision Semantics Limited Text Based Image Search
CN115438215A (en) * 2022-11-08 2022-12-06 苏州浪潮智能科技有限公司 Image-text bidirectional search and matching model training method, device, equipment and medium

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7461073B2 (en) * 2006-02-14 2008-12-02 Microsoft Corporation Co-clustering objects of heterogeneous types
CN111985184A (en) * 2020-06-30 2020-11-24 上海翎腾智能科技有限公司 Auxiliary writing font copying method, system and device based on AI vision
CN113111154B (en) * 2021-06-11 2021-10-29 北京世纪好未来教育科技有限公司 Similarity evaluation method, answer search method, device, equipment and medium

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220343626A1 (en) * 2019-08-15 2022-10-27 Vision Semantics Limited Text Based Image Search
CN113127669A (en) * 2020-01-15 2021-07-16 百度在线网络技术(北京)有限公司 Advertisement matching method, device, equipment and storage medium
CN113902764A (en) * 2021-11-19 2022-01-07 东北大学 Semantic-based image-text cross-modal retrieval method
CN114969405A (en) * 2022-04-30 2022-08-30 苏州浪潮智能科技有限公司 Cross-modal image-text mutual inspection method
CN115062208A (en) * 2022-05-30 2022-09-16 苏州浪潮智能科技有限公司 Data processing method and system and computer equipment
CN114821605A (en) * 2022-06-30 2022-07-29 苏州浪潮智能科技有限公司 Text processing method, device, equipment and medium
CN114896429A (en) * 2022-07-12 2022-08-12 苏州浪潮智能科技有限公司 Image-text mutual detection method, system, equipment and computer readable storage medium
CN115438215A (en) * 2022-11-08 2022-12-06 苏州浪潮智能科技有限公司 Image-text bidirectional search and matching model training method, device, equipment and medium

Also Published As

Publication number Publication date
CN115438215B (en) 2023-04-18
CN115438215A (en) 2022-12-06

Similar Documents

Publication Publication Date Title
WO2024098533A1 (en) Image-text bidirectional search method, apparatus and device, and non-volatile readable storage medium
CN108959246B (en) Answer selection method and device based on improved attention mechanism and electronic equipment
CN109241524B (en) Semantic analysis method and device, computer-readable storage medium and electronic equipment
US20230106873A1 (en) Text extraction method, text extraction model training method, electronic device and storage medium
CN112084789B (en) Text processing method, device, equipment and storage medium
WO2021121198A1 (en) Semantic similarity-based entity relation extraction method and apparatus, device and medium
WO2024098524A1 (en) Text and video cross-searching method and apparatus, model training method and apparatus, device, and medium
WO2024098623A1 (en) Cross-media retrieval method and apparatus, cross-media retrieval model training method and apparatus, device, and recipe retrieval system
WO2023236977A1 (en) Data processing method and related device
US20210326524A1 (en) Method, apparatus and device for quality control and storage medium
CN111310441A (en) Text correction method, device, terminal and medium based on BERT (binary offset transcription) voice recognition
CN112580328A (en) Event information extraction method and device, storage medium and electronic equipment
CN110163121A (en) Image processing method, device, computer equipment and storage medium
CN113158687B (en) Semantic disambiguation method and device, storage medium and electronic device
CN113158656B (en) Ironic content recognition method, ironic content recognition device, electronic device, and storage medium
CN114612921B (en) Form recognition method and device, electronic equipment and computer readable medium
JP2022534375A (en) Text intelligent cleaning method, apparatus and computer readable storage medium
US20230114673A1 (en) Method for recognizing token, electronic device and storage medium
WO2024098763A1 (en) Text operation diagram mutual-retrieval method and apparatus, text operation diagram mutual-retrieval model training method and apparatus, and device and medium
CN117114063A (en) Method for training a generative large language model and for processing image tasks
CN112906368A (en) Industry text increment method, related device and computer program product
CN110019952A (en) Video presentation method, system and device
US20220382991A1 (en) Training method and apparatus for document processing model, device, storage medium and program
WO2023173617A1 (en) Image processing method and apparatus, device, and storage medium
CN114118049B (en) Information acquisition method, device, electronic equipment and storage medium