CN114840705A

CN114840705A - Combined commodity retrieval method and system based on multi-mode pre-training model

Info

Publication number: CN114840705A
Application number: CN202210453799.4A
Authority: CN
Inventors: 詹巽霖; 吴洋鑫; 董晓; 梁小丹
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2022-04-27
Filing date: 2022-04-27
Publication date: 2022-08-02
Anticipated expiration: 2042-04-27
Also published as: CN114840705B

Abstract

The invention provides a combined commodity retrieval method and a combined commodity retrieval system based on a multi-mode pre-training model, wherein the method uses information of two modes of pictures and texts, and also adds entity information into a network training process, so that the image characteristics and the text characteristics can be effectively highly fused, the retrieval characteristics with higher discrimination are extracted, and the problem of incomplete single-mode information is solved; the method and the device realize instance-level retrieval of the combined commodities, namely all the single commodities in the returned combined commodities are retrieved, can improve the commodity searching precision and help online users to search more accurate and specific commodities; the method can also be used for constructing an E-commerce knowledge map and mining commodity relations; the commodity relation obtained by combined commodity retrieval can be used for commodity recommendation, and the shopping platform recommendation effect is improved.

Description

Combined commodity retrieval method and system based on multi-mode pre-training model

Technical Field

The invention relates to the field of example-level commodity retrieval, in particular to a combined commodity retrieval method and system based on a multi-mode pre-training model.

Background

With the rapid development of electronic commerce, the types of commodities are more and more abundant, and the demands of online customers are increased and diversified. In electronic goods, many goods are presented in a package, i.e. multiple instances of different goods exist in one image. However, a customer or merchant may wish to find a single product in a product portfolio for similar item retrieval or online item recommendation, or to match the same product for price comparison.

In the field of commodity retrieval, the existing method is to input data of a single mode, such as a text or a picture, and then perform matching search in a retrieval library. However, in the e-commerce field, pictures and texts are widely present in a search library, and due to the lack of full utilization of multi-modal data, the current search mode greatly limits the real use scenes.

In addition, the existing methods mainly focus on relatively simple situations, such as picture-level retrieval, which do not determine whether there are multiple objects in the commodity picture and distinguish the objects; while the example-level commodity search is to search all the single commodities in the combined commodity, the search method has little research on the search method. The existing method also depends on labeled information for training, the method is lack of generalization in large-scale real scene data set, and heterogeneous data generated by shopping websites are accumulated continuously, so that the large-scale and weakly labeled data are difficult to process by using an algorithm for multi-modal retrieval.

The combined commodity retrieval has higher practical value and application prospect in the E-commerce field. Firstly, the method is beneficial to improving the commodity searching precision and helping the online user to search more accurate and specific commodities; secondly, the method can be used for constructing an E-commerce knowledge graph and mining commodity relations; thirdly, the commodity relation obtained by combined commodity retrieval can be used for commodity recommendation, and the shopping platform recommendation effect is improved. Under the real scene of large data size and lack of labels, how to perform multi-modal example-level combined commodity retrieval is a problem which has practical value but is not solved.

The patent firstly obtains interaction information of a to-be-recommended commodity, wherein the interaction information of the to-be-recommended commodity comprises an ID information set of the to-be-recommended commodity, a user ID information set, popularity information of the to-be-recommended commodity and interaction record information of the user commodity, and the popularity information of the to-be-recommended commodity represents interaction quantity information between users and commodities; inputting the interaction information of the commodities to be recommended into the trained commodity recommendation model to obtain the recommendation result of the commodities to be recommended; the trained commodity recommendation model is constructed through a drawing attention network and is obtained by training sample commodity ID information and sample user ID information which are marked as negative samples, wherein the sample commodity ID information and the sample user ID information which are marked as the negative samples are obtained by determining sample commodity popularity information and preset hyper-parameters; however, the patent only reports how to solve the problem of low accuracy caused by commodity retrieval depending on monomodal data and picture level retrieval.

Disclosure of Invention

The invention provides a combined commodity retrieval method based on a multi-mode pre-training model, which solves the problem of low accuracy caused by commodity retrieval depending on single-mode data and picture level retrieval.

The invention further aims to provide a combined commodity retrieval system based on the multi-mode pre-training model.

In order to achieve the technical effects, the technical scheme of the invention is as follows:

a combined commodity retrieval method based on a multi-mode pre-training model comprises the following steps:

s1: for the combined commodity image and the title text, inputting the image into a pre-trained top-down attention model, and extracting the position and the characteristic information of a bounding box of all single commodities in the image; obtaining entity information from the title text through an analysis tool;

s2: respectively extracting feature codes, position codes and segmentation codes of the image features, the title text information and the entity information, and learning embedded representation as input of a multi-mode pre-training model;

s3: inputting the image embedding representation, the title text embedding representation and the entity embedding representation into three transformers, gradually extracting the mutually fused retrieval characteristics of the three, and training by adopting four self-supervision tasks;

s4: in the training process, a title text and corresponding entity information are used for constructing an entity graph enhancement module, entity knowledge with real semantic information is learned through the arrangement loss based on nodes and the arrangement loss based on subgraphs, and feature representation is enhanced;

s5: inputting image information and title text information of each single commodity sample into a multi-modal model to extract retrieval characteristics, and storing the retrieval characteristics in a retrieval library;

s6: for each combined commodity, inputting image features, title text information and entity information into a pre-trained multi-mode model, and extracting retrieval features of image-text fusion; and calculating the cosine similarity of the fusion feature and each single product in the sample library, and selecting the single product with the highest similarity as a retrieval result to return.

Further, in step S1, the process of extracting the bounding box positions and the feature information of all the individual commodities in the image by the top-down attention model includes:

using a pre-trained bottom-up attention model based on a VG data set as a target detector, inputting a combined commodity image, extracting boundary frame position information and boundary frame characteristics of each single commodity, and inputting the boundary frame position information and the boundary frame characteristics as image characteristics of the combined commodity;

the process of obtaining entity information from the title text through the analysis tool comprises the following steps:

and extracting a noun entity set from the title text by using an NLP (NLP) analysis tool to be used as entity information input.

Further, the specific process of step S2 is:

s21: for the position and the characteristic information of the boundary frame, 5-dimensional vectors are used for calculating the position information of each frame, wherein the position information comprises the coordinates of the upper left corner and the lower right corner of the frame and the size proportion of the frame in the whole image, and the 5-dimensional vectors are transmitted into a linear full-connection layer to obtain position codes; using 0 as segmentation information to be transmitted into a linear full-connection layer to obtain segmentation codes; transmitting the frame characteristics into a linear full-connection layer to obtain the codes of the frame characteristics; finally, adding the position codes, the sectional codes and the feature codes to obtain the embedded representation of the image mode;

s21: for the title text information, incremental natural number sequence lists are used for representing the position information of the title text information, and the position information is transmitted into a linear full-connection layer to obtain position codes; using 1 as segmentation information to be transmitted into a linear full-connection layer to obtain segmentation codes; transmitting the title text into a linear full-connection layer to obtain a feature code of the text; finally, adding the position codes, the segment codes and the feature codes to obtain the embedded representation of the title text;

s23: for entity information, incremental natural number sequence lists are used for representing the position information of the entity information, and a linear full-connection layer is transmitted to obtain position codes; using 1 as segmentation information to be transmitted into a linear full-connection layer to obtain segmentation codes; transmitting the entity into a linear full-connection layer to obtain the characteristic code of the entity; and finally, adding the position code, the segmented code and the feature code to obtain the embedded representation of the entity.

Further, in step S3, the process of inputting the image-embedded representation, the header text-embedded representation, and the entity-embedded representation into three types of transformers and gradually extracting the search features fused with each other includes:

1): respectively extracting shallow layer characteristics of an image, a title text and an entity by using an image/title text/entity Transformer network, calculating attention weight by using Q and K, multiplying by V to obtain characteristic representation of each mode, wherein the transformers of the entity and the text share network parameters, and repeating the layer for L times and then transmitting into a next network;

2): the cross Transformer network extracts the mutual attention characteristics among the modes, the layer comprises three independent cross multi-head self-attention networks, and the cross Transformer network is realized by exchanging K and V in different modes; the image cross Transformer calculates attention weight to the text so as to obtain image characteristics after cross attention; the title text cross Transformer is used for calculating attention weights of the images and the entities so as to obtain text features after cross attention; the entity cross transformer calculates the attention weight of the text so as to obtain the entity characteristics after cross attention, and the layer is repeatedly carried out for K times and then is transmitted to the next network;

3): extracting characteristics of comprehensive fusion of an image, a title text and an entity by a public Transformer network, splicing the characteristics of the title text and the characteristics of the image in a text-vision public Transformer, calculating the weight of all characteristics of each vector attention by using Q and K, and multiplying the weight by V to obtain the characteristic representation of the text and the characteristic representation of the image, wherein Q, K, V is obtained by the characteristics after splicing two modes; in the title text-entity public Transformer, splicing title text features and entity features, calculating the weight of all the features of each vector attention by using Q and K, and multiplying by V to obtain feature representation of a title text and feature representation of an entity, wherein Q, K, V is obtained by the features after splicing of two modalities; for the image, the title text and the entity, the layer respectively calculates attention weights of all the characteristics of the image, the title text and the entity by using a multi-head attention mechanism so as to obtain the fully fused characteristics, and the layer is iterated for H times.

Further, in step S3, the training process using four self-supervision tasks includes:

1) the method comprises the following steps of inputting a title text sequence with masked words into a multi-mode pre-training model by masking the words in a title text, and learning and recovering the masked words in the training process of the model so as to extract a feature representation with title text information;

2) inputting the entity sequence with the masked words into a multi-mode pre-training model by masking the words in the entity information, and learning and recovering the masked words by the model in the training process so as to extract a feature representation with the entity information;

3) inputting the image frame feature sequence with the mask into a pre-training model through the feature of the boundary frame in the mask image, and learning and recovering the feature of the masked boundary frame in the training process of the model so as to extract a feature representation with visual information;

4) training the network using a comparative learned loss function: for pairs of pictures and title texts, shortening the distance of the pairs in the training process; and for unpaired picture title text pairs, enlarging the distance in the training process so as to learn the image-text characteristics with discrimination.

Further, in step S4, in the training process, the process of constructing the entity graph enhancing module by using the header text and the corresponding entity information includes:

s41: initializing an entity queue for the header text and the entity information extracted from the header text, coding the entity queue through a common Transformer to obtain a joint embedded representation of entity characteristics, and inputting the entity queue into a pre-trained AdaGAE network for graph clustering to obtain a semantic relation of an entity graph; training by using the arrangement loss based on the nodes and the arrangement loss based on the subgraph, so that the entity can pay more attention to the semantic neighbors to obtain good node representation;

s42: in the arrangement loss based on the nodes, for a small batch of data, selecting a negative sample based on cosine similarity, randomly selecting one of k entity samples with the lowest similarity as the negative sample for the batch of data entities, and calculating the arrangement loss by using the selected negative sample;

s43: in the arrangement loss based on the subgraph, the embedding expression of any subgraph feature is obtained by global pooling calculation of k nearest neighbor nodes, and the arrangement loss is calculated for the nearest neighbor positive and negative subgraphs of any node.

A combined commodity retrieval system based on a multi-modal pre-trained model, comprising:

the multi-mode pre-training model module is used for extracting retrieval characteristics of image-text fusion, and training a model by using three different self-attention network layers and adopting a self-supervision learning mode so that the model can extract the characteristics of multi-mode fusion;

the entity graph reinforcement learning module is used for constructing the relationship among the entity nodes and helping more accurate instance-level retrieval; coding entity information by using a self-attention network layer, and dividing subgraphs by using a graph clustering network; performing network training based on the arrangement loss of the nodes and the arrangement loss of the subgraph to obtain good semantic relation representation;

the multi-mode commodity retrieval module is used for extracting the features of the single commodity and storing the features in a retrieval library, extracting the features of each target commodity of the combined commodity and retrieving the features in the retrieval library, extracting the detection features of the single commodity by using the detector module and inputting the detection features into the multi-mode pre-training model module to extract the retrieval features; and extracting the boundary box and the boundary box characteristics of each target commodity in the combined product by using a detector module, inputting the boundary box area and the title into a multi-mode pre-training model module to extract query characteristics, calculating the similarity between the query characteristics and the retrieval characteristics, and returning the most similar commodity.

Further, the multi-modal pre-training model module comprises:

the image, title text and entity embedded representation submodule is used for coding information of an image mode to input a model, taking position coordinates of an image frame and area proportion of the image frame as position codes, and respectively transmitting the position coordinates and the area proportion of the image frame into a linear layer to obtain embedded representation of the image by combining detection characteristics of segmented codes and frames; taking the sequence number of each word in the title text and the entity as position codes, transmitting the position codes into a linear layer in combination with sectional codes to obtain embedded representations of the title text and the entity, and transmitting the embedded representations of the image, the title text and the entity into a multi-head self-attention network;

the multi-head self-attention network sub-module is used for extracting retrieval characteristics of high fusion of images, title texts and entities, carrying out interaction among multiple modal information by using three different multi-head self-attention networks and extracting fully fused characteristics;

the image-text-entity multi-modal pre-training sub-module is used for training a multi-modal model to learn the characteristics with discrimination, and completing the learning of multi-modal fusion characteristics by using four self-supervision tasks, including an image area covering task, a title text covering task, an entity covering task and image-text cross-modal contrast learning.

Further, the entity graph reinforcement learning module comprises:

the entity graph constructing submodule is used for constructing a node graph and a partitioning subgraph for given entity information; for an initialized entity queue, coding the queue through a public Transformer to obtain a joint embedded representation of entity characteristics, and inputting the entity characteristic into a pre-trained AdaGAE network for graph clustering to obtain a semantic relation of an entity graph;

the entity graph semantic information learning submodule is used for learning semantic relation representation among nodes and helping more accurate instance-level retrieval; training is carried out by using the arrangement loss based on the nodes and the arrangement loss based on the subgraph, so that the entity can pay more attention to the adjacent semanteme, and good node representation is obtained.

Further, the multi-modal merchandise retrieval module includes:

the single-item feature extraction submodule is used for extracting the features of the single item and storing the features in a search library, and inputting the image of each single-item sample into a trained detector to extract an image boundary frame and the features of the boundary frame; all the bounding boxes and the bounding box features are combined with texts to serve as the input of a multi-mode pre-training model, the image-text fusion features are extracted, and a search library is stored;

a combined product feature extraction submodule: the system is used for extracting the characteristics of each target commodity in the combined product, and inputting the image into a trained detector to extract an image boundary box and the characteristics of the boundary box aiming at each inquiry combined commodity; inputting each bounding box and the characteristics of the bounding boxes into a multi-mode model module in combination with a title, and extracting the characteristics of image-text fusion of each target commodity in the combined product;

a characteristic retrieval submodule: and the method is used for searching the single product results of the combined product, calculating the cosine similarity between each target commodity and each single product of the combined product, and selecting the closest single product as a result to return according to the cosine similarity.

Compared with the prior art, the technical scheme of the invention has the beneficial effects that:

the invention uses a self-supervision learning mode for training, and only depends on the naturally existing picture and title information and does not depend on any manually labeled category information; therefore, the method is easy to expand to large-scale data, learns a more judged feature representation, ensures the realization of high-quality instance-level commodity retrieval tasks and has stronger generalization; by using the information of two modes of the picture and the text and adding the entity information into the network training process, the image characteristic and the text characteristic can be effectively highly fused, the retrieval characteristic with higher discrimination is extracted, and the problem of incomplete information of a single mode is solved; the method and the device realize instance-level retrieval of the combined commodities, namely all the single commodities in the returned combined commodities are retrieved, can improve the commodity searching precision and help online users to search more accurate and specific commodities; the method can also be used for constructing an E-commerce knowledge map and mining commodity relations; the commodity relation obtained by combined commodity retrieval can be used for commodity recommendation, and the shopping platform recommendation effect is improved.

Drawings

FIG. 1 is a flowchart of a combined merchandise retrieval method based on a multi-modal pre-training model according to an embodiment of the present invention;

FIG. 2 is a flowchart of a combined merchandise retrieval method based on a multi-modal pre-training model according to another embodiment of the present invention;

FIG. 3 is a flowchart of a combined merchandise retrieval method based on a multi-modal pre-training model according to another embodiment of the present invention;

FIG. 4 is a flowchart of a combined merchandise retrieval method based on a multi-modal pre-training model according to another embodiment of the present invention;

FIG. 5 is a flowchart of a combined merchandise retrieval method based on a multi-modal pre-training model according to another embodiment of the present invention;

FIG. 6 is a flowchart illustrating steps of a combined merchandise retrieval method based on a multi-modal pre-training model according to an embodiment of the present invention;

FIG. 7 is a schematic diagram of a network framework of a combined commodity retrieval method based on a multi-modal pre-training model according to an embodiment of the present invention;

FIG. 8 is a diagram of an apparatus of a combined merchandise retrieval system based on a multi-modal pre-training model according to an embodiment of the present invention;

FIG. 9 is a diagram of an apparatus of a combined merchandise retrieval system based on a multi-modal pre-training model according to another embodiment of the present invention;

FIG. 10 is a diagram of an apparatus of a combined merchandise retrieval system based on a multi-modal pre-training model according to another embodiment of the present invention;

fig. 11 is a diagram of an apparatus of a combined merchandise retrieval system based on a multi-modal pre-training model according to another embodiment of the present invention.

Detailed Description

The drawings are for illustrative purposes only and are not to be construed as limiting the patent;

for the purpose of better illustrating the embodiments, certain features of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product;

it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.

The technical solution of the present invention is further described below with reference to the accompanying drawings and examples.

Example 1

As shown in fig. 1-5, a combined commodity retrieval method based on a multi-modal pre-training model includes:

s10, inputting the combined commodity image and text into a pre-trained top-down attention model, and extracting the position and characteristic information of a bounding box of all single commodities in the image; and obtaining entity information from the text through an analysis tool.

And S20, respectively extracting feature codes, position codes and segmentation codes of each mode for the image features, the text information and the entity information, and learning and embedding the representation to be used as the input of the multi-mode pre-training model.

In a specific embodiment, S20, for the image feature, the text information, and the entity information, respectively extracting feature codes, position codes, and segment codes of each modality, and learning the embedded representation as an input of the multi-modality pre-training model, includes:

s21, for the boundary frames and the features output by the target detector, calculating the position information of each frame by using 5-dimensional vectors, wherein the position information comprises the coordinates of the upper left corner and the lower right corner of the frame and the size proportion of the frame in the whole image, and transmitting the 5-dimensional vectors into a linear full-link layer to obtain position codes; using 0 as segmentation information to be transmitted into a linear full-connection layer to obtain segmentation codes; and transmitting the frame features into a linear full-connection layer to obtain the codes of the frame features. Finally, adding the position codes, the sectional codes and the feature codes to obtain the embedded representation of the image mode;

s22, for the text sequence, using the increasing natural number sequence to represent the position information, and transmitting the position information into the linear full-connection layer to obtain the position code; using 1 as segmentation information to be transmitted into a linear full-connection layer to obtain segmentation codes; and transmitting the text into a linear full-connection layer to obtain the feature code of the text. And finally, adding the position codes, the segmented codes and the feature codes to obtain the embedded representation of the text.

S23, for the entity information, using the increasing natural number sequence to represent the position information, and transmitting the position information into the linear full-connection layer to obtain the position code; using 1 as segmentation information to be transmitted into a linear full-connection layer to obtain segmentation codes; and transmitting the entity into a linear full-connection layer to obtain the feature code of the entity. And finally, adding the position code, the segmented code and the feature code to obtain the embedded representation of the entity.

And S30, inputting the image embedding representation, the text embedding representation and the entity embedding representation into three transformers by the multi-mode pre-training model, and gradually extracting the mutually fused retrieval characteristics of the three.

In a specific embodiment, the S30, inputting the image embedding representation, the text embedding representation, and the entity embedding representation into three transforms by the multi-modal pre-training model, and gradually extracting the search features fused with each other, includes:

s31, respectively extracting shallow features of the image, the text and the entity by the visual/text/entity transform network, calculating attention weight by using Q and K, and multiplying V to obtain each modal feature representation. Wherein the entity and the textual Transformer share network parameters. The layer is repeatedly carried out for L times and then is transmitted into the next network;

and S32, extracting the characteristics of mutual attention among the modalities by a cross Transformer network, wherein the layer comprises three independent cross multi-head self-attention networks and is realized by exchanging K and V in different modalities. The visual cross Transformer calculates attention weight of the text so as to obtain image characteristics after cross attention; the text cross Transformer calculates attention weights of the images and the entities so as to obtain text features after cross attention; and the entity cross transformer calculates attention weight of the text so as to obtain the cross attention entity characteristics. The layer is repeatedly carried out for K times and then is transmitted into the next network;

and S33, extracting the characteristics of the comprehensive fusion of the image, the text and the entity through a public Transformer network, splicing the characteristics of the text and the characteristics of the image in a text-vision public Transformer, calculating the weight of all characteristics of each vector attention by using Q and K, and multiplying the weight by V to obtain the characteristic representation of the text and the characteristic representation of the image, wherein Q, K, V is obtained by the characteristics after the two modalities are spliced. In the text-entity common Transformer, text features and entity features are spliced, the weight of all the features of interest of each vector is calculated by using Q and K, and then V is multiplied to obtain feature representation of the text and feature representation of the entity, wherein Q, K, V is obtained by the features after splicing of two modalities. For images, texts and entities, the layer respectively calculates attention weights of all the characteristics of the images, the texts and the entities by using a multi-head attention mechanism so as to obtain fully fused characteristics. This layer is iterated H times.

And S40, performing network training by adopting four self-supervision tasks based on the loss function of image-text contrast learning through the multi-mode pre-training model.

In a specific embodiment, the S40 multi-modal pre-training model performs network training by using three self-supervision tasks based on a loss function of graph-text contrast learning, including:

s41, by covering the words in the title text, inputting the text sequence with the covered words into a multi-mode pre-training model, and learning and recovering the covered words in the training process of the model so as to extract a feature representation with text information;

s42, inputting the entity sequence with the masked words into a multi-mode pre-training model by masking the words in the entity information, and learning and recovering the masked words in the training process of the model so as to extract a feature representation with the entity information;

s43, inputting the image frame feature sequence with the mask into a pre-training model by masking the boundary frame features in the image, and learning and recovering the masked boundary frame features in the training process of the model so as to extract a feature representation with visual information;

s44, training the network by using the loss function of contrast learning: for the paired picture and text pairs, shortening the distance of the paired picture and text pairs in the training process; and for the unpaired image text pair, the distance is enlarged in the training process, so that the image text characteristic with discrimination is learned.

S50, constructing an entity graph enhancement module, learning entity knowledge with real semantic information through arrangement loss based on nodes and arrangement loss based on subgraphs, and enhancing feature representation.

In a specific embodiment, S50, constructing an entity graph enhancing module, and learning entity knowledge with true semantic information and enhancing feature representation through node-based arrangement loss and subgraph-based arrangement loss, includes:

and S51, initializing an entity queue for the text titles and the entity information extracted from the text titles, coding the entity queue through a public Transformer to obtain a joint embedded representation of the entity characteristics, and inputting the entity queue into a pre-trained AdaGAE network for graph clustering to obtain the semantic relation of the entity graph. Training is carried out by using the arrangement loss based on the nodes and the arrangement loss based on the subgraph, so that the entity can pay more attention to the adjacent semanteme, and good node representation is obtained.

S52, in the arrangement loss based on the nodes, for a small batch of data, the cosine similarity degree is used based on the selection of negative samples. For the data entity of the batch, randomly selecting one of k entity samples with the lowest similarity as a negative sample. The selected negative examples are used to calculate alignment loss.

S53, in the arrangement loss based on the subgraph, the embedded representation of any subgraph feature is obtained by the global pooling calculation of k nearest neighbor nodes. Permutation penalties are computed for the nearest neighbor positive and negative subgraphs of any node.

And S60, inputting the image information and the text information of each single sample into the multi-modal model to extract retrieval characteristics, and storing the retrieval characteristics in a retrieval library.

S70, inputting image features, text information and entity information into the trained multi-mode model for each combined commodity, and extracting retrieval features of image-text fusion; and calculating the cosine similarity of the fusion feature and each single product in the sample library, and selecting the single product with the highest similarity as a retrieval result to return.

Example 2

As shown in fig. 6, a combined commodity retrieval method based on a multi-modal pre-training model includes the following steps and details:

step 1, inputting the combined commodity image and text into a pre-trained top-down attention model, and extracting the bounding box positions B of all single commodities in the image (B) ₀ ,b ₁ ,b ₂ ,…,b _K ) And the characteristic information F ═ F ₀ ,f ₁ ,f ₂ ,…,f _K ) (ii) a Title text T ═ T ₀ ,t ₁ ,t ₂ ,…,t _L ) Obtaining the entity information E ═ (E) by the analysis tool ₀ ,e ₁ ,e ₂ ,…,e _H )。

And 2, respectively extracting feature codes, position codes and segmentation codes of all modes for the image features, the text information and the entity information, and learning embedded representation as the input of the multi-mode pre-training model.

Using the extracted bounding box position and feature of the detector network as the image feature input I ═ b ₀ ,f ₀ ),(b ₁ ,f ₁ ),(b ₂ ,f ₂ ),…,(b _K ,f _K ) Extracting feature codes, position codes and segment codes through an embedding layer, and adding to obtain image embedding representation

Using the product title T ═ T (T) ₀ ,t ₁ ,t ₂ ,…,t _L ) Extracting feature codes, position codes and segmentation codes through an embedding layer, and adding to obtain text embedding representation

Using entity information E ═ (E) ₀ ,e ₁ ,e ₂ ,…,e _H ) Extracting characteristic code, position code and segment code through embedding layer, adding to obtain entity embedding expression

Specifically, as shown in fig. 7, the feature coding vector is obtained by passing the bounding box feature F through the full-connected layer

The calculation formula is as follows:

wherein w ₁ And b ₁ Is a parameter of the fully-connected layer and σ is the activation function.

Bounding box extracted from commodity detector

Calculating the area ratio of each frame to the whole picture, and constructing a 5-dimensional vector

Outputting position-coded vectors via full-connection layers

The calculation formula is as follows:

wherein w ₂ And b ₂ Is a parameter of the fully connected layer and σ is the activation function.

Using integer 0 as segmentation information S of image modality _img Obtaining segmented coding vectors through full connection layer

The calculation formula is as follows:

wherein w ₃ And b ₃ Is a parameter of the fully-connected layer and σ is the activation function.

Adding the feature encoding vector, the position encoding vector and the segment encoding vector to obtain an embedded representation E of the image modality _img ，

Specifically, as shown in fig. 7, the title text T is passed through the embedding layer to obtain the feature encoding vector

The calculation formula is as follows:

wherein w ₄ And b ₄ Is a parameter of the fully-connected layer and σ is the activation function.

Position information (natural number sequence) P of words in the title is processed by a full connection layer to obtain a position coding vector

The calculation formula is as follows:

wherein w ₅ And b ₅ Is a parameter of the fully-connected layer and σ is the activation function.

Taking integer 1 as segmentation information S of text mode _txt Obtaining segmented coding vectors through full connection layer

The calculation formula is as follows:

wherein w ₆ And b ₆ Is a parameter of the fully-connected layer and σ is the activation function.

Adding the feature encoding vector, the position encoding vector and the segment encoding vector to obtain an embedded representation E of the text mode _txt ，

The embedded representation of the entity is the same as the embedded representation of the text.

Step 3, embedding the image into the representation

Text embedded representation

And entity embedded representation

Inputting the images and the texts into three transformers, extracting retrieval characteristics H of mutual fusion of the images and the texts, and performing model training by adopting four self-supervision tasks.

Specifically, as shown in fig. 7, first, a picture is embedded with a feature E using a picture Transformer, a text Transformer and an entity Transformer respectively _img Text embedding feature E _txt And entity embedding features E _ent Coding to obtain respective characteristics of picture, text and entity

The image, the text and the entity Transformer are respectively provided with four layers, and the calculation formula of each layer is as follows:

wherein t-1 and t are transform layer numbers, LN is a LayerNorm layer, characteristic normalization is performed, MLP is a full link layer, MSA is a multi-head attention layer, and the calculation formula is as follows:

MSA(H)＝Concat(Head _i ,...，Head _h )W ^O

obtaining respective characteristics of picture, text and entity

Then, a cross Transformer is transmitted to enable the picture information and the text information to be correlated, and the entity information and the text information to be correlated, so that the characteristics of two modes after being correlated are obtained

The calculation formula is as follows:

the CMSA is a cross-modal cross multi-head attention network, and the calculation formula is as follows:

CMSA(H _img， H _txt )＝Concat(Head ₁ (H _img ，H _txt )，...，Head _n (H _img ，H _txt ))

CMSA(H _txt ，H _img ，H _ent )

＝Concat(Head ₁ (H _txt ，H _img )，...，Head _n (H _txt ，H _img )，Head ₁ (H _txt ，H _ent )，...，Head _n (H _txt ，H _ent ))

CMSA(H _ent， H _txt )＝Concat(Head ₁ (H _ent ，H _txt )，...，Head _n (H _ent ，H _txt ))

picture/text/entity features output from a cross-Transformer

And transmitting the data into a public Transformer for more comprehensive feature fusion.

The text-vision common Transformer enables each block of region in the image modality to pay attention to the characteristics of other regions and the characteristics of all characters, and each character in the text modality pays attention to the characteristics of other characters and the characteristics of all images. The calculation formula is as follows:

the text-entity common Transformer enables each word in the entity to focus on the characteristics of other words and all texts, and each word in the text modality focuses on the characteristics of other words and all entities.

The calculation formula is as follows:

four pre-training tasks are used to train the model structure described above, including a text masking task, an entity masking task, an image area masking task, and a cross-modal contrast learning task.

Specifically, for each picture text pair I ═ I _1, I _2, ,I _3, ,…,I _K, }，T＝{T _1, T _2, ,T _3, ,…,T _L, And extracted entity E ═ E ₀ ,E ₁ ,E ₂ ,…,E _H )：

The text masking task is to replace the entered text word with "[ MASK ]" at a 15% probability, and the model predicts the masked word from the remaining words and images with a penalty function of:

the entity masking task is to replace the entered entity word with "[ MASK ]" at a 15% probability, and the model predicts the masked word from the remaining words and images with a penalty function of:

the image region masking task is to replace the input image frame features with a 15% probability into a 0 vector, and the model predicts the masked image region features from the remaining image regions and sentence words with a penalty function of:

in addition to the learning of different modality features, the model needs to guarantee consistency between different input modalities in order to learn the correspondence between the graphics and the text, so a cross-modality contrast learning task is used to align the picture modality and the text modality. There are 2N data for N pairs of picture text in a training batch. For each sample, the corresponding other modality data is treated as a positive sample pair, and the remaining samples are treated as a negative sample pair. For the input teletext pair ((I) _i ,T _i ) Each pair of image-text characteristics output by the model text Transformer and the image Transformer

The loss function is:

wherein the content of the first and second substances,

for calculating the similarity between the image-text pairs u and v, tau is a temperature regulation parameter,

is a binary indicator if and only if i! When j, it returns to 1. The contrast loss function causes the paired teletext vectors to zoom in, while the unpaired teletext vectors zoom out.

And 4, in the training process, constructing an entity graph enhancement module by using the title text and the corresponding entity information, learning entity knowledge with real semantic information through the arrangement loss based on the nodes and the arrangement loss based on the subgraph, and enhancing feature representation.

Specifically, initializing an entity queue l for a text title and entity information extracted from the text title, and coding the queue through a common Transformer to obtain a joint embedded representation l of entity characteristics _f Then inputting the data into a pre-trained AdaGAE network for graph clustering to obtain an embedded expression l of the entity graph _g . Training using node-based permutation loss and subgraph-based permutation loss:

in node-based permutation loss, for a small batch of data, cosine similarity is used based on negative sample selection. For any entity h of the batch of data _ei Randomly selecting one h from k physical samples with the lowest similarity _ek As negative examples. Calculating the arrangement loss by using the selected negative sample, wherein the loss function is as follows;

in the subgraph-based arrangement loss, the embedded representation of any subgraph feature is obtained by global pooling calculation of k nearest neighbor nodes. For any entity h _ei Nearest neighbor positive and negative subgraph h _gi And h _gk The alignment loss is calculated, and the loss function is:

and 5, extracting a detection frame and a text word for each single sample in the same processing mode as training, inputting the detection frame and the text word into the multi-mode model, extracting image-text fusion characteristics, and storing the characteristics of each single sample in a search library.

And 6, for each combined commodity, extracting a boundary frame and boundary frame characteristics of each target commodity in the picture by using a trained detector, inputting the boundary frame and the boundary frame characteristics into a trained multi-mode model by combining text information, extracting retrieval characteristics of image-text fusion, calculating a Cosine similarity distance by using the characteristics and each single-commodity characteristic in a retrieval library, and sequencing according to the similarity, thereby obtaining a retrieval single-commodity sample which is the most matched as a final result.

Example 3

As shown in fig. 8-11, a combined commodity retrieval system based on multi-modal pre-training models comprises:

the multi-mode pre-training model module 10 is used for extracting retrieval characteristics of image-text fusion, and training a model by using three different self-attention network layers and adopting a self-supervision learning mode so that the model can extract the characteristics of multi-mode fusion;

and the entity graph reinforcement learning module 20 is used for constructing the relationship between the entity nodes and helping more accurate instance-level retrieval. Coding entity information by using a self-attention network layer, and dividing subgraphs by using a graph clustering network; and carrying out network training based on the arrangement loss of the nodes and the arrangement loss of the subgraph to obtain good semantic relation representation.

The multi-mode commodity retrieval module 30 is used for extracting the features of the single commodities and storing the features in a retrieval library, extracting the features of each target commodity of the combined commodity and retrieving the features in the retrieval library, extracting the detection features of the single commodities by using the detector module, and inputting the detection features into the multi-mode pre-training model module to extract the retrieval features; and extracting the boundary box and the boundary box characteristics of each target commodity in the combined product by using a detector module, inputting the boundary box area and the title into a multi-mode pre-training model module to extract query characteristics, calculating the similarity between the query characteristics and the retrieval characteristics, and returning the most similar commodity.

In a specific embodiment, the multi-modal pre-training model module 10 further includes:

the image, text and entity embedded representation submodule 11 is used for coding information of an image modality to input a model, taking position coordinates of an image frame and area proportion of the image frame as position codes, and respectively transmitting the position coordinates and the area proportion of the image frame into a linear layer to obtain embedded representation of the image by combining detection characteristics of the segmented codes and the frame; the sequence number of each word in the text and the entity is used as position coding, the sequence number is transmitted into a linear layer by combining with sectional coding to obtain the embedded representation of the text and the entity, and the embedded representation of the image, the text and the entity is transmitted into a multi-head self-attention network;

the multi-head self-attention network sub-module 12 is used for extracting retrieval characteristics of high fusion of images, texts and entities, carrying out interaction among a plurality of modal information by using three different multi-head self-attention networks and extracting fully fused characteristics;

and the image-text-entity multi-mode pre-training sub-module 13 is used for training a multi-mode model to learn the characteristics with the discrimination, and completing the learning of multi-mode fusion characteristics by using four self-supervision tasks, including an image area covering task, a text covering task, an entity covering task and image-text cross-mode comparison learning.

In a specific embodiment, the entity map reinforcement learning module 20 further includes:

and the entity graph constructing sub-module 21 is configured to construct a node graph and a partitioning subgraph for given entity information. And for the initialized entity queue, coding the queue through a common Transformer to obtain the joint embedded representation of the entity characteristics, and inputting the entity characteristic into a pre-trained AdaGAE network for graph clustering to obtain the semantic relation of the entity graph.

And the entity graph semantic information learning submodule 22 is used for learning semantic relation representation among nodes and helping more accurate instance-level retrieval. Training is carried out by using the arrangement loss based on the nodes and the arrangement loss based on the subgraph, so that the entity can pay more attention to the adjacent semanteme, and good node representation is obtained.

In a specific embodiment, the multi-modal merchandise retrieval module 30 further includes:

the single-item feature extraction submodule 31 is used for extracting the features of the single item, storing the features in a search library, and inputting the image of each single-item sample into a trained detector to extract an image boundary frame and the features of the boundary frame; combining all the bounding boxes and the bounding box features with texts as the input of a multi-mode pre-training model, extracting image-text fusion features, and storing the image-text fusion features in a search library;

the combined product feature extraction submodule 32 is used for extracting the features of each target commodity in the combined product, and inputting the images into a trained detector to extract image bounding boxes and bounding box features aiming at each inquiry combined commodity; and inputting each bounding box and the bounding box characteristics into the multi-mode model module in combination with the title, and extracting the characteristics of image-text fusion of each target commodity in the combined product.

And the characteristic retrieval submodule 33 is used for retrieving the single product results of the combined product, calculating cosine similarity between each target commodity and each single product of the combined product, and selecting the closest single product as a result to return according to the cosine similarity.

The same or similar reference numerals correspond to the same or similar parts;

the positional relationships depicted in the drawings are for illustrative purposes only and should not be construed as limiting the present patent;

it should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.

Claims

1. A combined commodity retrieval method based on a multi-mode pre-training model is characterized by comprising the following steps:

2. The combined merchandise retrieval method based on multi-modal pre-trained model according to claim 1, wherein in step S1:

the process of extracting the bounding box positions and the characteristic information of all single commodities in the image by the top-down attention model comprises the following steps:

and extracting a noun entity set from the title text by using an NLP (NLP) analysis tool, and inputting the noun entity set as entity information.

3. The combined commodity retrieval method based on the multi-modal pre-trained model according to claim 2, wherein the specific process of step S2 is:

s21: for the position and the characteristic information of the bounding box, 5-dimensional vectors are used for calculating the position information of each box, including the coordinates of the upper left corner and the lower right corner of the box and the size proportion of the box in the whole image, and the 5-dimensional vectors are transmitted into a linear full-connection layer to obtain position codes; using 0 as segmentation information to be transmitted into a linear full-connection layer to obtain segmentation codes; transmitting the frame characteristics into a linear full-connection layer to obtain the codes of the frame characteristics; finally, adding the position codes, the sectional codes and the feature codes to obtain the embedded representation of the image mode;

4. The combined merchandise retrieval method based on multi-modal pre-trained model of claim 3, wherein the step S3 of inputting the image embedded representation, the title text embedded representation and the entity embedded representation into three transformers and gradually extracting the retrieval features of the three transformers in combination comprises:

5. The combined merchandise retrieval method based on multi-modal pre-training model according to claim 4, wherein in step S3, the training process using four self-supervision tasks comprises:

6. The combined merchandise retrieval method based on multi-modal pre-trained model according to claim 5, wherein in step S4, the process of constructing the entity map enhancement module using the heading text and the corresponding entity information in the training process comprises:

7. A combined commodity retrieval system based on a multi-modal pre-training model is characterized by comprising:

the multi-mode commodity retrieval module is used for extracting the features of the single commodity and storing the features in a retrieval library, extracting the features of each target commodity of the combined commodity and retrieving the features in the retrieval library, extracting the detection features of the single commodity by using the detector module and inputting the detection features into the multi-mode pre-training model module to extract the retrieval features; and extracting the bounding box and the bounding box characteristics of each target commodity in the combined product by using a detector module, inputting the bounding box area and the title into a multi-mode pre-training model module to extract query characteristics, calculating the similarity between the query characteristics and the retrieval characteristics, and returning the most similar commodity.

8. The combined commodity retrieval system based on the multi-modal pre-trained model according to claim 7, wherein the multi-modal pre-trained model module comprises:

9. The combined commodity retrieval system based on the multi-modal pre-trained model according to claim 8, wherein the entity map reinforcement learning module comprises:

10. The combined merchandise retrieval system based on multi-modal pre-trained model of claim 9, wherein the multi-modal merchandise retrieval module comprises:

the single-item feature extraction submodule is used for extracting the features of the single item and storing the features in a search library, and inputting the image of each single-item sample into a trained detector to extract an image boundary frame and the features of the boundary frame; combining all the bounding boxes and the bounding box features with texts as the input of a multi-mode pre-training model, extracting image-text fusion features, and storing the image-text fusion features in a search library;