CN114840705B

CN114840705B - Combined commodity retrieval method and system based on multi-mode pre-training model

Info

Publication number: CN114840705B
Application number: CN202210453799.4A
Authority: CN
Inventors: 张靖宜; 董晓; 詹巽霖; 梁小丹
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2022-04-27
Filing date: 2022-04-27
Publication date: 2024-04-19
Anticipated expiration: 2042-04-27
Also published as: CN114840705A

Abstract

The invention provides a combined commodity retrieval method and a system based on a multi-mode pre-training model, wherein the method uses information of two modes of pictures and texts, and entity information is added into a network training process, so that image features and text features can be effectively fused, more differentiated retrieval features are extracted, and the problem of incomplete single-mode information is solved; the method realizes the example level retrieval of the combined commodity, namely, all single products in the combined commodity are retrieved and returned, so that the commodity searching precision can be improved, and the online user is helped to search more accurate and specific commodities; the method can also be used for constructing an e-commerce knowledge graph and excavating commodity relations; the commodity relationship obtained by combined commodity retrieval can be used for commodity recommendation, and the recommendation effect of the shopping platform is improved.

Description

Combined commodity retrieval method and system based on multi-mode pre-training model

Technical Field

The invention relates to the field of example-level commodity retrieval, in particular to a combined commodity retrieval method and system based on a multi-mode pre-training model.

Background

With the rapid development of electronic commerce, commodity types are more and more abundant, the demands of online clients are also increased, and meanwhile, the demands are more and more diversified. In electronic goods, many goods are presented in a set, i.e. multiple instances of different goods exist in one image. However, a customer or merchant may wish to find a single product in one product combination for similar product retrieval or online product recommendation, or to match the same product for price comparison.

In the commodity retrieval field, the existing method is to input single-mode data, such as a text or a picture, and then perform matching search in a retrieval library. However, in the e-commerce field, pictures and texts are widely stored in a search library, and the current search mode greatly limits the actual use scene due to the lack of full utilization of multi-mode data.

In addition, existing methods are mainly focused on relatively simple situations, such as picture-level retrieval, which does not determine whether there are multiple objects in a commodity picture and distinguish between the objects; while example level product retrieval consists in retrieving all individual items back in the combined product, this way of retrieval has little associated research. The existing method also depends on annotation information for training, and the method lacks generalization when large-scale real scene data sets are obtained, and because heterogeneous data generated by shopping websites are continuously accumulated, the large-scale and weak annotation data are difficult to process by using an algorithm for multi-mode retrieval.

The combined commodity retrieval has higher practical value and application prospect in the field of electronic commerce. Firstly, the commodity searching precision is improved, and online users are helped to search more accurate and specific commodities; secondly, the method can be used for constructing an e-commerce knowledge graph and excavating commodity relations; thirdly, the commodity relationship obtained through combined commodity retrieval can be used for commodity recommendation, and the recommendation effect of the shopping platform is improved. How to perform multi-modal example-level combined commodity retrieval is a practical but unsolved problem in real scenes with large data sizes and lack of labels.

The patent of the long-tail commodity recommendation method based on the graph attention network firstly acquires commodity interaction information to be recommended, wherein the commodity interaction information to be recommended comprises a commodity ID information set to be recommended, a user ID information set, commodity popularity information to be recommended and commodity interaction record information of a user, and the commodity popularity information to be recommended represents interaction quantity information between the user and the commodity; inputting the commodity interaction information to be recommended into a trained commodity recommendation model to obtain a recommendation result of the commodity to be recommended; the trained commodity recommendation model is constructed through a graph attention network and is obtained by training sample commodity ID information and sample user ID information marked as negative samples, wherein the sample commodity ID information and the sample user ID information marked as the negative samples are obtained by determining sample commodity popularity information and preset super parameters; however, the patent has a few reports on how to solve the problem of low accuracy caused by the fact that commodity retrieval depends on single-mode data and picture level retrieval.

Disclosure of Invention

The invention provides a combined commodity retrieval method based on a multi-mode pre-training model, which solves the problem of low accuracy caused by the fact that commodity retrieval depends on single-mode data and picture level retrieval.

It is yet another object of the present invention to provide a combined commodity retrieval system based on a multimodal pre-training model.

In order to achieve the technical effects, the technical scheme of the invention is as follows:

a combined commodity retrieval method based on a multi-mode pre-training model comprises the following steps:

S1: inputting the images into a pre-trained top-down attention model for the combined commodity images and the title text, and extracting the boundary box positions and characteristic information of all single commodities in the images; the title text is subjected to analysis to obtain entity information;

S2: extracting feature codes, position codes and segment codes of the image features, the title text information and the entity information respectively, and learning embedded representation as input of a multi-mode pre-training model;

S3: inputting the image embedded representation, the title text embedded representation and the entity embedded representation into three convectors, gradually extracting the mutually fused retrieval features of the three, and training by adopting four self-supervision tasks;

S4: in the training process, using the title text and the corresponding entity information to construct an entity diagram enhancement module, and learning entity knowledge with real semantic information through arrangement loss based on nodes and arrangement loss based on subgraphs to enhance feature representation;

S5: for each single commodity sample, inputting the image information and the title text information of the single commodity sample into a multi-mode model to extract retrieval features, and storing the retrieval features in a retrieval library;

S6: inputting the image features, the title text information and the entity information into a pre-trained multi-mode model for each combined commodity, and extracting the retrieval features of image-text fusion; and calculating cosine similarity of the fusion characteristics and each single product in the sample library, and selecting the single product with the highest similarity as a retrieval result to return.

Further, in the step S1, the process of extracting the bounding box positions and the feature information of all the single commodities in the image from the top-down attention model includes:

Inputting a combined commodity image by using a pretrained bottom-up attention model based on a VG data set as a target detector, extracting the boundary frame position information and the boundary frame characteristics of each single commodity, and inputting the boundary frame position information and the boundary frame characteristics as the image characteristics of the combined commodity;

The process of obtaining entity information from the title text through the parsing tool comprises the following steps:

And extracting noun entity sets from the title text by using an NLP parsing tool, and inputting the noun entity sets as entity information.

Further, the specific process of step S2 is:

S21: for the boundary frame position and characteristic information, calculating the position information of each frame by using 5-dimensional vectors, wherein the position information comprises the upper left corner coordinates of the frames, the lower right corner coordinates and the size proportion of the frames to the whole image, and transmitting the 5-dimensional vectors into a linear full-connection layer to obtain position codes; using 0 as segment information to be transmitted into a linear full-connection layer to obtain segment coding; transmitting the frame characteristics into a linear full-connection layer to obtain the codes of the frame characteristics; finally, adding the position codes, the segment codes and the feature codes to obtain an embedded representation of the image mode;

S21: for the title text information, using the incremental natural number sequence to express their position information, and transmitting the position information into a linear full-connection layer to obtain a position code; using 1 as segment information to be transmitted into a linear full-connection layer to obtain segment coding; transmitting the title text into a linear full-connection layer to obtain a characteristic code of the text; finally, adding the position codes, the segment codes and the feature codes to obtain an embedded representation of the title text;

S23: for entity information, using an increasing natural number sequence to express their position information, and transmitting the position information into a linear full-connection layer to obtain a position code; using 1 as segment information to be transmitted into a linear full-connection layer to obtain segment coding; transmitting the entity into a linear full-connection layer to obtain a feature code of the entity; and finally, adding the position codes, the segment codes and the feature codes to obtain the embedded representation of the entity.

Further, in step S3, the process of inputting the image embedded representation, the title text embedded representation and the entity embedded representation into three convertors to gradually extract the retrieval features of the three mutually fused comprises:

1): the image/title text/entity transducer network extracts shallow layer characteristics of the image, the title text and the entity respectively, calculates attention weights by using Q and K, and multiplies the attention weights by V to obtain each modal characteristic representation, wherein the entity and the transducer of the text share network parameters, the layer is repeatedly carried out L times, and then the network is transmitted to the next network;

2): the cross transducer network extracts the mutual attention characteristics among the modes, and the layer comprises three independent cross multi-head self-attention networks which are realized by exchanging K and V in different modes; the image cross-converter calculates attention weight of the text so as to obtain image characteristics after cross attention; the title text cross-converter calculates the attention weight of the image and the entity so as to obtain the text characteristics after cross attention; entity intersection Tranformer calculates the attention weight of the text so as to obtain the entity characteristics after intersecting attention, and the layer repeatedly carries out K times and then transmits the entity characteristics into the next network;

3): extracting features of comprehensive fusion of an image, a title text and an entity by a public transducer network, splicing the features of the title text and the features of the image in the text-visual public transducer, calculating weights of all features concerned by each vector by using Q and K, and multiplying the weights by V to obtain a feature representation of the text and a feature representation of the image, wherein Q, K, V is obtained by the features spliced by two modes; in the title text-entity public Transformer, the title text features and the entity features are spliced, the weight of each vector focusing on all the features is calculated by using Q and K, and then the weight is multiplied by V to obtain the feature representation of the title text and the feature representation of the entity, wherein Q, K, V is obtained by the features spliced by two modes; for the image, the title text and the entity, the layer uses a multi-head attention mechanism to calculate the attention weights of all the characteristics of the image, the title text and the entity respectively, so as to obtain the characteristics after comprehensive fusion, and the layer iterates for H times.

Further, in step S3, the training process using four self-supervision tasks includes:

1) Inputting a title text sequence with the masked words into a multi-mode pre-training model by masking the words in the title text, and learning and recovering the masked words by the model in the training process so as to extract a characteristic representation with title text information;

2) Inputting the entity sequence with the hidden words into a multi-mode pre-training model by hiding the words in the entity information, and learning and recovering the hidden words in the training process by the model so as to extract a characteristic representation with the entity information;

3) Inputting the image frame feature sequence with the mask into a pre-training model by masking the boundary frame features in the picture, and learning and recovering the masked boundary frame features in the training process of the model so as to extract a feature representation with visual information;

4) Training the network using a contrast learned loss function: for paired pictures and title text pairs, shortening the distance in the training process; for unpaired picture title text pairs, the distance is increased in the training process, so that the image-text characteristics with distinguishing degrees are learned.

Further, in step S4, in the training process, the process of constructing the entity map enhancement module using the title text and the corresponding entity information includes:

S41: initializing an entity queue for the title text and the entity information extracted from the title text, coding the queue through a public transducer to obtain a joint embedded representation of entity characteristics, and then inputting the joint embedded representation into a pre-trained AdaGAE network to perform graph clustering to obtain the semantic relation of the entity graph; training by using node-based permutation loss and subgraph-based permutation loss to enable an entity to pay more attention to semantic neighbors and obtain good node representation;

S42: in the arrangement loss based on the nodes, for a small batch of data, selecting a negative sample by using cosine similarity, randomly selecting one of k entity samples with the lowest similarity as the negative sample for the batch of data entities, and calculating the arrangement loss by using the selected negative sample;

s43: the embedded representation of any sub-graph feature is calculated by global pooling of k nearest neighbor nodes based on the permutation loss of the sub-graph, and the permutation loss is calculated for nearest neighbor positive and negative sub-graphs of any node.

A combined commodity retrieval system based on a multi-modal pre-training model, comprising:

the multi-mode pre-training model module is used for extracting the retrieval characteristics of image-text fusion, three different self-attention network layers are used, and a self-supervision learning mode is adopted to train the model, so that the model can extract the characteristics of multi-mode fusion;

The entity diagram reinforcement learning module is used for constructing the relation among entity nodes and helping more accurate instance level retrieval; encoding entity information using a self-attention network layer, and dividing subgraphs using a graph clustering network; performing network training based on the node arrangement loss and the subgraph arrangement loss to obtain good semantic relation representation;

The multi-mode commodity retrieval module is used for extracting the characteristics of the single product and storing the characteristics in a retrieval library, extracting the characteristics of each target commodity of the combination product and retrieving the characteristics in the retrieval library, extracting the detection characteristics of the single product by using the detector module, and inputting the detection characteristics into the multi-mode pre-training model module to extract the retrieval characteristics; the detector module is used for extracting the boundary box and boundary box characteristics of each target commodity in the combination, the boundary box area and the title are input into the multi-mode pre-training model module for extracting query characteristics, the similarity between the query characteristics and the retrieval characteristics is calculated, and the most similar commodity is returned.

Further, the multi-modal pre-training model module includes:

The image, the title text and the entity embedded representation sub-module are used for encoding the information of the image mode to input a model, taking the position coordinates of an image frame and the area proportion of the frame to the image as the position codes, and respectively transmitting the position codes and the area proportion of the frame to the linear layer to obtain embedded representation of the image by combining the sectional codes and the detection characteristics of the frame; taking the serial number of each word in the title text and the entity as a position code, combining the segment codes and transmitting the serial number into a linear layer to obtain embedded representations of the title text and the entity, and transmitting the embedded representations of the image, the title text and the entity into a multi-head self-attention network;

The multi-head self-attention network sub-module is used for extracting the retrieval characteristics of the high fusion of the images, the title text and the entity, and three different multi-head self-attention networks are used for carrying out interaction among a plurality of modal information so as to extract the characteristics after comprehensive fusion;

The image-text-entity multi-mode pre-training sub-module is used for training the multi-mode model to learn the characteristic with the distinguishing degree, and four self-supervision tasks are used for completing the multi-mode fusion characteristic learning, including an image area covering task, a title text covering task, an entity covering task and image-text cross-mode comparison learning.

Further, the entity diagram reinforcement learning module includes:

The entity diagram construction sub-module is used for constructing a node diagram and dividing sub-diagrams for given entity information; coding the initialized entity queue through a public transducer to obtain a joint embedded representation of entity characteristics, and then inputting the joint embedded representation into a pre-trained AdaGAE network to perform graph clustering to obtain a semantic relation of an entity graph;

The entity diagram semantic information learning sub-module is used for learning semantic relation representation among nodes and helping more accurate instance level retrieval; training is performed by using the node-based permutation loss and the subgraph-based permutation loss, so that the entity focuses more on the semantic neighbors, and good node representation is obtained.

Further, the multi-modal merchandise retrieval module includes:

The single product feature extraction sub-module is used for extracting the features of single products and storing the features in a search library, and inputting images of each single product sample into a trained detector to extract image boundary frames and boundary frame features; taking all bounding boxes and the bounding box features and texts as inputs of a multi-mode pre-training model, extracting the features of graphic fusion, and storing a search library;

And a combination characteristic extraction submodule: for extracting the characteristics of each target commodity in the combined product, inputting the image into a trained detector for extracting the image bounding box and the bounding box characteristics aiming at each inquiry combined commodity; each bounding box and the feature of the bounding box are input to a multi-mode model module in combination with a title, and the feature of graphic fusion of each target commodity in the combination product is extracted;

and the characteristic cable detection sub-module: the method is used for searching single product results by the combination product, calculating cosine similarity between each target commodity and each single product of the combination product, and selecting the nearest single product as a result according to the cosine similarity to return.

Compared with the prior art, the technical scheme of the invention has the beneficial effects that:

The invention uses a self-supervision learning mode to train, and only relies on the naturally-occurring picture and title information and does not depend on any manually-marked category information; therefore, the method is easy to expand to large-scale data, a more determined characteristic representation is learned, a high-quality example level commodity retrieval task is ensured to be realized, and the method has stronger generalization; by using information of two modes of pictures and texts and adding entity information into a network training process, the image features and the text features can be effectively fused, more differentiated retrieval features are extracted, and the problem of incomplete single-mode information is solved; the method realizes the example level retrieval of the combined commodity, namely, all single products in the combined commodity are retrieved and returned, so that the commodity searching precision can be improved, and the online user is helped to search more accurate and specific commodities; the method can also be used for constructing an e-commerce knowledge graph and excavating commodity relations; the commodity relationship obtained by combined commodity retrieval can be used for commodity recommendation, and the recommendation effect of the shopping platform is improved.

Drawings

FIG. 1 is a flow chart of a method for combined commodity retrieval based on a multi-modal pre-training model according to an embodiment of the present invention;

FIG. 2 is a flow chart of a method for combined commodity retrieval based on a multi-modal pre-training model according to another embodiment of the present invention;

FIG. 3 is a flowchart of a method for combined commodity retrieval based on a multi-modal pre-training model according to another embodiment of the present invention;

FIG. 4 is a flowchart of a method for combined commodity retrieval based on a multi-modal pre-training model according to a further embodiment of the present invention;

FIG. 5 is a flowchart of a method for combined commodity retrieval based on a multi-modal pre-training model according to a further embodiment of the present invention;

FIG. 6 is a flowchart illustrating a method for searching combined merchandise based on a multi-mode pre-training model according to an embodiment of the present invention;

FIG. 7 is a schematic diagram of a network framework of a method for searching combined merchandise based on a multi-mode pre-training model according to an embodiment of the present invention;

FIG. 8 is a device diagram of a combined commodity retrieval system according to an embodiment of the present invention based on a multi-modal pre-training model;

FIG. 9 is a device diagram of a combined commodity retrieval system according to another embodiment of the present invention based on a multi-modal pre-training model;

FIG. 10 is a device diagram of a combined commodity retrieval system according to another embodiment of the present invention based on a multi-modal pre-training model;

FIG. 11 is a device diagram of a combined commodity retrieval system according to a further embodiment of the present invention based on a multi-modal pre-training model.

Detailed Description

The drawings are for illustrative purposes only and are not to be construed as limiting the present patent;

For the purpose of better illustrating the embodiments, certain elements of the drawings may be omitted, enlarged or reduced and do not represent the actual product dimensions;

It will be appreciated by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.

The technical scheme of the invention is further described below with reference to the accompanying drawings and examples.

Example 1

As shown in fig. 1-5, a combined commodity retrieval method based on a multi-mode pre-training model includes:

S10, inputting the images into a pre-trained top-down attention model for the combined commodity images and texts, and extracting the boundary box positions and characteristic information of all single commodities in the image; and obtaining entity information from the text through an analysis tool.

S20, extracting feature codes, position codes and segment codes of all modes respectively for the image features, text information and entity information, and learning embedded representation as input of a multi-mode pre-training model.

In a specific embodiment, S20, extracting feature codes, position codes and segment codes of each mode respectively for image features, text information and entity information, learning an embedded representation as input of a multi-mode pre-training model, includes:

S21, for the boundary frames and the characteristics output by the target detector, calculating the position information of each frame by using 5-dimensional vectors, wherein the position information comprises the upper left corner coordinates, the lower right corner coordinates and the size proportion of the frame to the whole image, and transmitting the 5-dimensional vectors into a linear full-connection layer to obtain position codes; using 0 as segment information to be transmitted into a linear full-connection layer to obtain segment coding; and transmitting the frame characteristics into a linear full-connection layer to obtain the frame characteristic codes. Finally, adding the position codes, the segment codes and the feature codes to obtain an embedded representation of the image mode;

S22, for a text sequence, using an incremental natural number sequence to represent position information, and transmitting the position information into a linear full-connection layer to obtain a position code; using 1 as segment information to be transmitted into a linear full-connection layer to obtain segment coding; and transmitting the text into a linear full-connection layer to obtain the feature codes of the text. And finally, adding the position codes, the segment codes and the feature codes to obtain the embedded representation of the text.

S23, for entity information, using an incremental natural number sequence to represent position information, and transmitting the position information into a linear full-connection layer to obtain position codes; using 1 as segment information to be transmitted into a linear full-connection layer to obtain segment coding; and transmitting the entity into a linear full-connection layer to obtain the feature code of the entity. And finally, adding the position codes, the segment codes and the feature codes to obtain the embedded representation of the entity.

S30, inputting the image embedded representation, the text embedded representation and the entity embedded representation into three convectors by the multi-mode pre-training model, and gradually extracting the retrieval features of the three mutually fused.

In a specific embodiment, the step S30, the multi-mode pre-training model inputs the image embedded representation, the text embedded representation and the entity embedded representation into three convertors, and extracts the retrieval features of the three mutually fused gradually, including:

S31, the visual/text/entity converter network extracts shallow features of the image, the text and the entity respectively, calculates attention weights by using Q and K, and multiplies the attention weights by V to obtain each modal feature representation. Wherein the entity and the Transformer of the text share network parameters. The layer is repeatedly carried out for L times and then is transferred into the next network;

S32, the cross transducer network extracts the mutual attention characteristics among the modes, and the layer comprises three independent cross multi-head self-attention networks which are realized by exchanging K and V in different modes. The visual cross-converter calculates attention weight of the text so as to obtain image characteristics after cross attention; the text cross-converter calculates the attention weight of the image and the entity so as to obtain the text characteristics after cross attention; entity intersection Tranformer calculates the attention weight to the text, so as to obtain the entity characteristics after intersecting attention. The layer is repeatedly carried out K times and then is transferred into the next network;

And S33, extracting the characteristics of the image, the text and the comprehensive fusion of the entity by using a public transducer network, splicing the text characteristics and the image characteristics in the text-visual public transducer, calculating the weight of each vector focusing on all the characteristics by using Q and K, and multiplying the weight by V to obtain the characteristic representation of the text and the characteristic representation of the image, wherein Q, K, V is obtained by the characteristics after the two modes are spliced. In the text-entity common transducer, text features and entity features are spliced, the weight of each vector focusing on all the features is calculated by using Q and K, and then the weight is multiplied by V to obtain the feature representation of the text and the feature representation of the entity, wherein Q, K, V is obtained by the features after the two modes are spliced. For the image, text and entity, the layer uses a multi-head attention mechanism to calculate the attention weights of all the characteristics of the image, text and entity respectively, so as to obtain the characteristics after comprehensive fusion. This layer is iterated H times.

S40, performing network training by adopting four self-supervision tasks based on a loss function of graph-text contrast learning by the multi-mode pre-training model.

In a specific embodiment, the step S40 of performing network training by using three self-supervision tasks based on a loss function of graph-text contrast learning by using a multi-mode pre-training model includes:

S41, inputting a text sequence with the hidden words into a multi-mode pre-training model by hiding the words in the title text, and learning and recovering the hidden words in the training process by the model so as to extract a characteristic representation with text information;

s42, inputting an entity sequence with the hidden words into a multi-mode pre-training model by hiding the words in the entity information, and learning and recovering the hidden words in the training process by the model so as to extract a characteristic representation with the entity information;

S43, inputting the image frame feature sequence with the mask into a pre-training model by masking the boundary frame features in the image, and learning and recovering the masked boundary frame features in the training process of the model so as to extract a feature representation with visual information;

S44, training the network using a contrast learned loss function: for paired pictures and text pairs, shortening the distance in the training process; for unpaired text pairs, the training process increases the distance, so that the distinguishing text features are learned.

S50, constructing an entity diagram enhancement module, and learning entity knowledge with real semantic information through arrangement loss based on nodes and arrangement loss based on subgraphs to enhance feature representation.

In a specific embodiment, S50, building an entity graph enhancement module, and learning entity knowledge with real semantic information through node-based permutation loss and subgraph-based permutation loss, and enhancing feature representation, where the method includes:

S51, initializing an entity queue for the text title and the entity information extracted from the text title, coding the queue through a public transducer to obtain a joint embedded representation of entity characteristics, and inputting the joint embedded representation into a pre-trained AdaGAE network to perform graph clustering to obtain the semantic relation of the entity graph. Training is performed by using the node-based permutation loss and the subgraph-based permutation loss, so that the entity focuses more on the semantic neighbors, and good node representation is obtained.

S52, in the arrangement loss based on the nodes, for a small batch of data, negative sample selection is performed by using cosine similarity. For the batch of data entities, one of the k entity samples with the lowest similarity is randomly selected as a negative sample. The permutation loss is calculated using the selected negative samples.

S53, in the permutation loss based on the subgraph, the embedded representation of any subgraph feature is obtained by global pooling calculation of k nearest neighbor nodes. The permutation loss is calculated for nearest neighbor positive and negative subgraphs of any node.

S60, inputting the image information and the text information of each single sample into the multi-mode model to extract the retrieval characteristics, and storing the retrieval characteristics in a retrieval library.

S70, inputting the image features, text information and entity information into a trained multi-mode model for each combined commodity, and extracting retrieval features of image-text fusion; and calculating cosine similarity of the fusion characteristics and each single product in the sample library, and selecting the single product with the highest similarity as a retrieval result to return.

Example 2

As shown in fig. 6, a combined commodity retrieval method based on a multi-mode pre-training model comprises the following steps and details:

step 1, inputting images into a pre-trained top-down attention model for combined commodity images and texts, and extracting the boundary box positions B= (B ₀,b₁,b₂,…,b_K) and feature information F= (F ₀,f₁,f₂,…,f_K) of all single commodities in the diagram; the title text t= (T ₀,t₁,t₂,…,t_L) is passed through the parsing tool to obtain the entity information e= (E ₀,e₁,e₂,…,e_H).

And 2, extracting feature codes, position codes and segment codes of all modes respectively for the image features, text information and entity information, and learning embedded representation as input of a multi-mode pre-training model.

The boundary box position and feature extracted by using the detector network are used as image feature input I＝((b₀,f₀),(b₁,f₁),(b₂,f₂),…,(b_K,f_K)), to extract feature code, position code and segment code through an embedding layer, and then added to obtain the image embedded representationExtracting feature codes, position codes and segment codes by using commodity title T= (T ₀,t₁,t₂,…,t_L) through an embedding layer, and adding to obtain text embedded representation/>Extracting feature codes, position codes and segment codes by using entity information E= (E ₀,e₁,e₂,…,e_H) through an embedding layer, and adding to obtain entity embedding representation/>

Specifically, as shown in fig. 7, the feature encoding vector is obtained by passing the bounding box feature F through the full-connection layerThe calculation formula is as follows: /(I)Where w ₁ and b ₁ are parameters of the fully connected layer and σ is the activation function.

Bounding box extracted from merchandise detectorCalculating the area ratio of each frame to the whole picture, and constructing a 5-dimensional vector/>Position coding vector/>, is output through full connection layerThe calculation formula is as follows: /(I)Where w ₂ and b ₂ are parameters of the fully connected layer and σ is the activation function.

The integer 0 is used as the segmentation information S _img of the image mode to obtain the segmentation coding vector through the full connection layerThe calculation formula is as follows: /(I)Where w ₃ and b ₃ are parameters of the fully connected layer and σ is the activation function.

The feature code vector, the position code vector and the segment code vector are added to obtain an embedded representation E _img of the image modality,

Specifically, as shown in fig. 7, the title text T is passed through the embedding layer to obtain a feature code vectorThe calculation formula is as follows: /(I)Where w ₄ and b ₄ are parameters of the fully connected layer and σ is the activation function.

The position information (natural number sequence) P of words in the title is processed by the full-connection layer to obtain the position coding vectorThe calculation formula is as follows: /(I)Where w ₅ and b ₅ are parameters of the fully connected layer and σ is the activation function.

The integer 1 is used as the segmentation information S _txt of the text mode to obtain the segmentation coding vector through the full connection layerThe calculation formula is as follows: /(I)Where w ₆ and b ₆ are parameters of the fully connected layer and σ is the activation function.

The feature code vector, the position code vector and the segment code vector are added to obtain an embedded representation E _txt of the text modality,

The embedded representation of the entity is the same as the embedded representation of the text.

Step 3, embedding the image into the representationText embedding representation/>And entity embedded representation/>And (3) inputting the images into three convertors, extracting retrieval features H of the images and texts which are mutually fused, and performing model training by adopting four self-supervision tasks.

Specifically, as shown in fig. 7, the picture embedding feature E _img, the text embedding feature E _txt, and the entity embedding feature E _ent are encoded by using a picture transducer, a text transducer, and an entity transducer, respectively, to obtain respective features of the picture, the text, and the entityThe pictures, the texts and the entity convertors are respectively provided with four layers, and the calculation formula of each layer is as follows:

wherein, t-1 and t are the transducer layer numbers, LN is LayerNorm layers, the feature normalization is carried out, MLP is the full connection layer, MSA is the multi-head attention layer, and the calculation formula is as follows:

MSA(H)＝Concat(Head_i,...,Head_h)W^O

Obtaining respective characteristics of pictures, texts and entities Then, the cross-transducer is transmitted to enable the picture information and the text information to be associated with each other and the entity information and the text information to be associated with each other, and the characteristic/>, after the two modes are associated with each other, is obtainedThe calculation formula is as follows:

Wherein CMSA is a cross-modal cross-multi-head attention network, and its calculation formula is as follows:

CMSA(H_img,H_txt)＝Concat(Head₁(H_img,H_txt),...,Head_n(H_img,H_txt))

CMSA(H_txt,H_img,H_ent)

＝Concat(Head₁(H_txt,H_img),...,Head_n(H_txt,H_img),Head₁(H_txt,H_ent),...,Head_n(H_txt,H_ent))

CMSA(H_ent,H_txt)＝Concat(Head₁(H_ent,H_txt),...,Head_n(H_ent,H_txt))

picture/text/entity features output from cross-transformers And (5) transmitting the fusion data into a public transducer to perform more comprehensive feature fusion.

The text-visual public transducer enables each area in the picture mode to focus on the characteristics of other areas and the characteristics of all words, and each word in the text mode focuses on the characteristics of other words and the characteristics of all pictures. The calculation formula is as follows:

wherein the text-entity common transducer focuses each word in the entity on the characteristics of the other words and on the characteristics of all the text, and each word in the text modality focuses on the characteristics of the other words and on the characteristics of all the entities.

The calculation formula is as follows:

Four pre-training tasks are used to train the model structure described above, including a text masking task, a solid masking task, an image region masking task, and a cross-modal contrast learning task.

Specifically, for each picture text pair i= { I _1,I_2,,I_3,,…,I_K,},T＝{T_1,T_2,,T_3,,…,T_L, }, and extracted entity e= { E ₀,E₁,E₂,…,E_H):

The text masking task is to replace the input text word with "[ MASK ] at 15% probability, and the model predicts the masked word from the remaining words and images with a loss function of:

the physical masking task is to replace the input physical word with "[ MASK ]" with 15% probability, and the model predicts the masked word from the remaining words and images with a loss function of:

The image region masking task is to replace the input image frame features with 0 vector with 15% probability, and the model predicts the masked image region features according to the rest of the image region and sentence words, and the loss function is as follows:

Besides learning of different modal features, the model needs to ensure consistency among different input modalities so as to learn the corresponding relation among graphics context, so that a cross-modal contrast learning task is used for aligning a picture modality and a text modality. For N pairs of picture text in a training batch, there are 2N data in total. For each sample, the corresponding other modality data is considered as a positive pair of samples, and the remaining samples are considered as negative pairs of samples. For each pair of input image-text pairs ((I _i,T_i)), the model text and image convertors output image-text features The loss function is as follows:

wherein, For calculating the similarity between the image-text pair u and v, τ is a temperature adjustment parameter,/>Is a binary index if and only if i-! Return 1 when =j. The contrast loss function causes pairs of teletext vectors to be pulled closer and unpaired teletext vectors to be pulled farther.

And 4, constructing an entity diagram enhancement module by using the title text and the corresponding entity information in the training process, and enhancing the feature representation by learning entity knowledge with real semantic information based on the arrangement loss of the nodes and the arrangement loss of the subgraph.

Specifically, for the text title and the entity information extracted from the text title, initializing an entity queue l, coding the queue through a public transducer to obtain a joint embedded representation l _f of entity characteristics, and then inputting the joint embedded representation l _f into a pre-trained AdaGAE network to perform graph clustering to obtain an embedded representation l _g of the entity graph. Training using node-based permutation loss and subgraph-based permutation loss:

In the node-based permutation loss, for a small batch of data, a negative sample selection is performed using cosine similarity. For any entity h _ei of the batch of data, randomly selecting one h _ek from k entity samples with the lowest similarity as a negative sample. Calculating an arrangement loss using the selected negative samples, the loss function being;

the embedded representation of any sub-graph feature is calculated from the global pooling of k nearest neighbors in the permutation loss based on the sub-graph. For nearest neighbor positive and negative subgraphs h _gi and h _gk of any entity h _ei, the permutation loss is calculated, with the loss function:

And 5, extracting a detection frame and a text word from each single sample by using the same processing mode as training, inputting the detection frame and the text word into a multi-modal model, extracting the characteristics of image-text fusion, and storing the characteristics of each single sample in a search library.

And 6, for each combined commodity, extracting the boundary box and boundary box characteristics of each target commodity in the picture by using a trained detector, inputting the boundary box and boundary box characteristics into a trained multi-mode model by combining text information, extracting retrieval characteristics of image-text fusion, calculating a similarity distance by using the characteristics and each single product characteristic in a retrieval library, and sorting according to the similarity, so as to obtain a retrieval single product sample which is the most matched as a final result.

Example 3

As shown in fig. 8-11, a combined commodity retrieval system based on a multi-modal pre-training model, comprising:

The multi-mode pre-training model module 10 is used for extracting the retrieval characteristics of image-text fusion, three different self-attention network layers are used, and a self-supervision learning mode is adopted to train the model, so that the model can extract the characteristics of multi-mode fusion;

The entity diagram reinforcement learning module 20 is configured to construct relationships between entity nodes to facilitate more accurate instance-level retrieval. Encoding entity information using a self-attention network layer, and dividing subgraphs using a graph clustering network; and performing network training based on the node arrangement loss and the subgraph arrangement loss, and acquiring good semantic relation representation.

The multi-mode commodity retrieval module 30 is used for extracting the characteristics of the single commodity and storing the characteristics in a retrieval library, extracting the characteristics of each target commodity of the combination product and retrieving the characteristics in the retrieval library, extracting the detection characteristics of the single commodity by using the detector module, and inputting the detection characteristics into the multi-mode pre-training model module to extract the retrieval characteristics; the detector module is used for extracting the boundary box and boundary box characteristics of each target commodity in the combination, the boundary box area and the title are input into the multi-mode pre-training model module for extracting query characteristics, the similarity between the query characteristics and the retrieval characteristics is calculated, and the most similar commodity is returned.

In a specific embodiment, the multi-modal pre-training model module 10 further includes:

An image, text and entity embedded representation sub-module 11, which is used for encoding the information of the image mode to input a model, taking the position coordinates of an image frame and the area proportion of the frame to the image as the position codes, and respectively transmitting the position codes and the area proportion of the frame to the linear layer to obtain embedded representation of the image by combining the detection characteristics of the segmented codes and the frame; taking the serial number of each word in the text and the entity as a position code, combining the segment codes and transmitting the serial number into a linear layer to obtain embedded representation of the text and the entity, and transmitting the embedded representation of the image, the text and the entity into a multi-head self-attention network;

A multi-head self-attention network sub-module 12, configured to extract highly fused retrieval features of images, texts and entities, perform interaction between a plurality of modal information by using three different multi-head self-attention networks, and extract fully fused features;

The image-text-entity multi-mode pre-training sub-module 13 is used for training the multi-mode model to learn the characteristic with the distinguishing degree, and four self-supervision tasks are used for completing the multi-mode fusion characteristic learning, including an image area masking task, a text masking task, an entity masking task and image-text cross-mode comparison learning.

In a specific embodiment, the entity-diagram reinforcement learning module 20 further includes:

The entity diagram construction sub-module 21 is configured to construct a node diagram and divide a sub-diagram for given entity information. And for the initialized entity queue, encoding the queue through a public transducer to obtain a joint embedded representation of entity characteristics, and inputting the joint embedded representation into a pre-trained AdaGAE network to perform graph clustering to obtain the semantic relation of the entity graph.

The entity diagram semantic information learning sub-module 22 is configured to learn the semantic relationship representation among the nodes to facilitate more accurate instance level retrieval. Training is performed by using the node-based permutation loss and the subgraph-based permutation loss, so that the entity focuses more on the semantic neighbors, and good node representation is obtained.

In a specific embodiment, the multi-modal merchandise retrieval module 30 further includes:

the single-product feature extraction sub-module 31 is used for extracting the features of single products and storing the single-product features in a search library, and inputting images of each single-product sample into a trained detector to extract image bounding boxes and bounding box features; taking all bounding boxes and the bounding box features and texts as inputs of a multi-mode pre-training model, extracting the features of graphic fusion, and storing a search library;

A combination feature extraction sub-module 32 for extracting features of each target commodity in the combination, inputting images into the trained detector for each inquiry combined commodity, and extracting image bounding boxes and bounding box features; and inputting each bounding box and the characteristics of the bounding box into a multi-mode model module in combination with the title, and extracting the characteristics of graphic fusion of each target commodity in the combination.

The feature retrieval sub-module 33 is configured to retrieve the single product result from the combination, calculate cosine similarity between each target commodity and each single product of the combination, and select the nearest single product as the result according to the cosine similarity for returning.

The same or similar reference numerals correspond to the same or similar components;

the positional relationship depicted in the drawings is for illustrative purposes only and is not to be construed as limiting the present patent;

It is to be understood that the above examples of the present invention are provided by way of illustration only and not by way of limitation of the embodiments of the present invention. Other variations or modifications of the above teachings will be apparent to those of ordinary skill in the art. It is not necessary here nor is it exhaustive of all embodiments. Any modification, equivalent replacement, improvement, etc. which come within the spirit and principles of the invention are desired to be protected by the following claims.

Claims

1. A combined commodity retrieval method based on a multi-mode pre-training model is characterized by comprising the following steps:

2. The method for searching for combined commodity based on the multi-mode pre-training model according to claim 1, wherein in said step S1:

the process of extracting the boundary box positions and the characteristic information of all single commodities in the image by using the top-down attention model comprises the following steps:

3. The method for searching combined commodity based on the multi-mode pre-training model according to claim 2, wherein the specific process of step S2 is as follows:

4. The method for searching for combined commodity based on a multi-mode pre-training model according to claim 3, wherein in step S3, the image embedded representation, the title text embedded representation and the entity embedded representation are input into three convectors, and the step of gradually extracting the search features of the three mutually fused comprises:

5. The method for combined commodity retrieval based on a multi-modal pre-training model according to claim 4, wherein in step S3, the training process using four self-supervision tasks includes:

6. The method for combined commodity retrieval based on a multi-mode pre-training model according to claim 5, wherein in step S4, the process of constructing an entity map enhancement module using the title text and the corresponding entity information during the training process includes:

7. A modular merchandise retrieval system based on a multimodal pre-training model, comprising:

8. The multi-modal pre-training model-based combined commodity retrieval system according to claim 7, wherein said multi-modal pre-training model module comprises:

9. The multi-modal pre-training model-based combined commodity retrieval system according to claim 8, wherein said entity-map reinforcement learning module comprises:

10. The multi-modal pre-training model based combined commodity retrieval system according to claim 9, wherein said multi-modal commodity retrieval module comprises: