CN114840705A - Combined commodity retrieval method and system based on multi-mode pre-training model - Google Patents

Combined commodity retrieval method and system based on multi-mode pre-training model Download PDF

Info

Publication number
CN114840705A
CN114840705A CN202210453799.4A CN202210453799A CN114840705A CN 114840705 A CN114840705 A CN 114840705A CN 202210453799 A CN202210453799 A CN 202210453799A CN 114840705 A CN114840705 A CN 114840705A
Authority
CN
China
Prior art keywords
entity
image
text
information
retrieval
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210453799.4A
Other languages
Chinese (zh)
Other versions
CN114840705B (en
Inventor
詹巽霖
吴洋鑫
董晓
梁小丹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sun Yat Sen University
Original Assignee
Sun Yat Sen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sun Yat Sen University filed Critical Sun Yat Sen University
Priority to CN202210453799.4A priority Critical patent/CN114840705B/en
Publication of CN114840705A publication Critical patent/CN114840705A/en
Application granted granted Critical
Publication of CN114840705B publication Critical patent/CN114840705B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/583Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/5846Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using extracted text
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/06Buying, selling or leasing transactions
    • G06Q30/0601Electronic shopping [e-shopping]
    • G06Q30/0631Item recommendations
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/30Computing systems specially adapted for manufacturing

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Business, Economics & Management (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Library & Information Science (AREA)
  • Accounting & Taxation (AREA)
  • Finance (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • General Business, Economics & Management (AREA)
  • Marketing (AREA)
  • Economics (AREA)
  • Development Economics (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Strategic Management (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a combined commodity retrieval method and a combined commodity retrieval system based on a multi-mode pre-training model, wherein the method uses information of two modes of pictures and texts, and also adds entity information into a network training process, so that the image characteristics and the text characteristics can be effectively highly fused, the retrieval characteristics with higher discrimination are extracted, and the problem of incomplete single-mode information is solved; the method and the device realize instance-level retrieval of the combined commodities, namely all the single commodities in the returned combined commodities are retrieved, can improve the commodity searching precision and help online users to search more accurate and specific commodities; the method can also be used for constructing an E-commerce knowledge map and mining commodity relations; the commodity relation obtained by combined commodity retrieval can be used for commodity recommendation, and the shopping platform recommendation effect is improved.

Description

Combined commodity retrieval method and system based on multi-mode pre-training model
Technical Field
The invention relates to the field of example-level commodity retrieval, in particular to a combined commodity retrieval method and system based on a multi-mode pre-training model.
Background
With the rapid development of electronic commerce, the types of commodities are more and more abundant, and the demands of online customers are increased and diversified. In electronic goods, many goods are presented in a package, i.e. multiple instances of different goods exist in one image. However, a customer or merchant may wish to find a single product in a product portfolio for similar item retrieval or online item recommendation, or to match the same product for price comparison.
In the field of commodity retrieval, the existing method is to input data of a single mode, such as a text or a picture, and then perform matching search in a retrieval library. However, in the e-commerce field, pictures and texts are widely present in a search library, and due to the lack of full utilization of multi-modal data, the current search mode greatly limits the real use scenes.
In addition, the existing methods mainly focus on relatively simple situations, such as picture-level retrieval, which do not determine whether there are multiple objects in the commodity picture and distinguish the objects; while the example-level commodity search is to search all the single commodities in the combined commodity, the search method has little research on the search method. The existing method also depends on labeled information for training, the method is lack of generalization in large-scale real scene data set, and heterogeneous data generated by shopping websites are accumulated continuously, so that the large-scale and weakly labeled data are difficult to process by using an algorithm for multi-modal retrieval.
The combined commodity retrieval has higher practical value and application prospect in the E-commerce field. Firstly, the method is beneficial to improving the commodity searching precision and helping the online user to search more accurate and specific commodities; secondly, the method can be used for constructing an E-commerce knowledge graph and mining commodity relations; thirdly, the commodity relation obtained by combined commodity retrieval can be used for commodity recommendation, and the shopping platform recommendation effect is improved. Under the real scene of large data size and lack of labels, how to perform multi-modal example-level combined commodity retrieval is a problem which has practical value but is not solved.
The patent firstly obtains interaction information of a to-be-recommended commodity, wherein the interaction information of the to-be-recommended commodity comprises an ID information set of the to-be-recommended commodity, a user ID information set, popularity information of the to-be-recommended commodity and interaction record information of the user commodity, and the popularity information of the to-be-recommended commodity represents interaction quantity information between users and commodities; inputting the interaction information of the commodities to be recommended into the trained commodity recommendation model to obtain the recommendation result of the commodities to be recommended; the trained commodity recommendation model is constructed through a drawing attention network and is obtained by training sample commodity ID information and sample user ID information which are marked as negative samples, wherein the sample commodity ID information and the sample user ID information which are marked as the negative samples are obtained by determining sample commodity popularity information and preset hyper-parameters; however, the patent only reports how to solve the problem of low accuracy caused by commodity retrieval depending on monomodal data and picture level retrieval.
Disclosure of Invention
The invention provides a combined commodity retrieval method based on a multi-mode pre-training model, which solves the problem of low accuracy caused by commodity retrieval depending on single-mode data and picture level retrieval.
The invention further aims to provide a combined commodity retrieval system based on the multi-mode pre-training model.
In order to achieve the technical effects, the technical scheme of the invention is as follows:
a combined commodity retrieval method based on a multi-mode pre-training model comprises the following steps:
s1: for the combined commodity image and the title text, inputting the image into a pre-trained top-down attention model, and extracting the position and the characteristic information of a bounding box of all single commodities in the image; obtaining entity information from the title text through an analysis tool;
s2: respectively extracting feature codes, position codes and segmentation codes of the image features, the title text information and the entity information, and learning embedded representation as input of a multi-mode pre-training model;
s3: inputting the image embedding representation, the title text embedding representation and the entity embedding representation into three transformers, gradually extracting the mutually fused retrieval characteristics of the three, and training by adopting four self-supervision tasks;
s4: in the training process, a title text and corresponding entity information are used for constructing an entity graph enhancement module, entity knowledge with real semantic information is learned through the arrangement loss based on nodes and the arrangement loss based on subgraphs, and feature representation is enhanced;
s5: inputting image information and title text information of each single commodity sample into a multi-modal model to extract retrieval characteristics, and storing the retrieval characteristics in a retrieval library;
s6: for each combined commodity, inputting image features, title text information and entity information into a pre-trained multi-mode model, and extracting retrieval features of image-text fusion; and calculating the cosine similarity of the fusion feature and each single product in the sample library, and selecting the single product with the highest similarity as a retrieval result to return.
Further, in step S1, the process of extracting the bounding box positions and the feature information of all the individual commodities in the image by the top-down attention model includes:
using a pre-trained bottom-up attention model based on a VG data set as a target detector, inputting a combined commodity image, extracting boundary frame position information and boundary frame characteristics of each single commodity, and inputting the boundary frame position information and the boundary frame characteristics as image characteristics of the combined commodity;
the process of obtaining entity information from the title text through the analysis tool comprises the following steps:
and extracting a noun entity set from the title text by using an NLP (NLP) analysis tool to be used as entity information input.
Further, the specific process of step S2 is:
s21: for the position and the characteristic information of the boundary frame, 5-dimensional vectors are used for calculating the position information of each frame, wherein the position information comprises the coordinates of the upper left corner and the lower right corner of the frame and the size proportion of the frame in the whole image, and the 5-dimensional vectors are transmitted into a linear full-connection layer to obtain position codes; using 0 as segmentation information to be transmitted into a linear full-connection layer to obtain segmentation codes; transmitting the frame characteristics into a linear full-connection layer to obtain the codes of the frame characteristics; finally, adding the position codes, the sectional codes and the feature codes to obtain the embedded representation of the image mode;
s21: for the title text information, incremental natural number sequence lists are used for representing the position information of the title text information, and the position information is transmitted into a linear full-connection layer to obtain position codes; using 1 as segmentation information to be transmitted into a linear full-connection layer to obtain segmentation codes; transmitting the title text into a linear full-connection layer to obtain a feature code of the text; finally, adding the position codes, the segment codes and the feature codes to obtain the embedded representation of the title text;
s23: for entity information, incremental natural number sequence lists are used for representing the position information of the entity information, and a linear full-connection layer is transmitted to obtain position codes; using 1 as segmentation information to be transmitted into a linear full-connection layer to obtain segmentation codes; transmitting the entity into a linear full-connection layer to obtain the characteristic code of the entity; and finally, adding the position code, the segmented code and the feature code to obtain the embedded representation of the entity.
Further, in step S3, the process of inputting the image-embedded representation, the header text-embedded representation, and the entity-embedded representation into three types of transformers and gradually extracting the search features fused with each other includes:
1): respectively extracting shallow layer characteristics of an image, a title text and an entity by using an image/title text/entity Transformer network, calculating attention weight by using Q and K, multiplying by V to obtain characteristic representation of each mode, wherein the transformers of the entity and the text share network parameters, and repeating the layer for L times and then transmitting into a next network;
2): the cross Transformer network extracts the mutual attention characteristics among the modes, the layer comprises three independent cross multi-head self-attention networks, and the cross Transformer network is realized by exchanging K and V in different modes; the image cross Transformer calculates attention weight to the text so as to obtain image characteristics after cross attention; the title text cross Transformer is used for calculating attention weights of the images and the entities so as to obtain text features after cross attention; the entity cross transformer calculates the attention weight of the text so as to obtain the entity characteristics after cross attention, and the layer is repeatedly carried out for K times and then is transmitted to the next network;
3): extracting characteristics of comprehensive fusion of an image, a title text and an entity by a public Transformer network, splicing the characteristics of the title text and the characteristics of the image in a text-vision public Transformer, calculating the weight of all characteristics of each vector attention by using Q and K, and multiplying the weight by V to obtain the characteristic representation of the text and the characteristic representation of the image, wherein Q, K, V is obtained by the characteristics after splicing two modes; in the title text-entity public Transformer, splicing title text features and entity features, calculating the weight of all the features of each vector attention by using Q and K, and multiplying by V to obtain feature representation of a title text and feature representation of an entity, wherein Q, K, V is obtained by the features after splicing of two modalities; for the image, the title text and the entity, the layer respectively calculates attention weights of all the characteristics of the image, the title text and the entity by using a multi-head attention mechanism so as to obtain the fully fused characteristics, and the layer is iterated for H times.
Further, in step S3, the training process using four self-supervision tasks includes:
1) the method comprises the following steps of inputting a title text sequence with masked words into a multi-mode pre-training model by masking the words in a title text, and learning and recovering the masked words in the training process of the model so as to extract a feature representation with title text information;
2) inputting the entity sequence with the masked words into a multi-mode pre-training model by masking the words in the entity information, and learning and recovering the masked words by the model in the training process so as to extract a feature representation with the entity information;
3) inputting the image frame feature sequence with the mask into a pre-training model through the feature of the boundary frame in the mask image, and learning and recovering the feature of the masked boundary frame in the training process of the model so as to extract a feature representation with visual information;
4) training the network using a comparative learned loss function: for pairs of pictures and title texts, shortening the distance of the pairs in the training process; and for unpaired picture title text pairs, enlarging the distance in the training process so as to learn the image-text characteristics with discrimination.
Further, in step S4, in the training process, the process of constructing the entity graph enhancing module by using the header text and the corresponding entity information includes:
s41: initializing an entity queue for the header text and the entity information extracted from the header text, coding the entity queue through a common Transformer to obtain a joint embedded representation of entity characteristics, and inputting the entity queue into a pre-trained AdaGAE network for graph clustering to obtain a semantic relation of an entity graph; training by using the arrangement loss based on the nodes and the arrangement loss based on the subgraph, so that the entity can pay more attention to the semantic neighbors to obtain good node representation;
s42: in the arrangement loss based on the nodes, for a small batch of data, selecting a negative sample based on cosine similarity, randomly selecting one of k entity samples with the lowest similarity as the negative sample for the batch of data entities, and calculating the arrangement loss by using the selected negative sample;
s43: in the arrangement loss based on the subgraph, the embedding expression of any subgraph feature is obtained by global pooling calculation of k nearest neighbor nodes, and the arrangement loss is calculated for the nearest neighbor positive and negative subgraphs of any node.
A combined commodity retrieval system based on a multi-modal pre-trained model, comprising:
the multi-mode pre-training model module is used for extracting retrieval characteristics of image-text fusion, and training a model by using three different self-attention network layers and adopting a self-supervision learning mode so that the model can extract the characteristics of multi-mode fusion;
the entity graph reinforcement learning module is used for constructing the relationship among the entity nodes and helping more accurate instance-level retrieval; coding entity information by using a self-attention network layer, and dividing subgraphs by using a graph clustering network; performing network training based on the arrangement loss of the nodes and the arrangement loss of the subgraph to obtain good semantic relation representation;
the multi-mode commodity retrieval module is used for extracting the features of the single commodity and storing the features in a retrieval library, extracting the features of each target commodity of the combined commodity and retrieving the features in the retrieval library, extracting the detection features of the single commodity by using the detector module and inputting the detection features into the multi-mode pre-training model module to extract the retrieval features; and extracting the boundary box and the boundary box characteristics of each target commodity in the combined product by using a detector module, inputting the boundary box area and the title into a multi-mode pre-training model module to extract query characteristics, calculating the similarity between the query characteristics and the retrieval characteristics, and returning the most similar commodity.
Further, the multi-modal pre-training model module comprises:
the image, title text and entity embedded representation submodule is used for coding information of an image mode to input a model, taking position coordinates of an image frame and area proportion of the image frame as position codes, and respectively transmitting the position coordinates and the area proportion of the image frame into a linear layer to obtain embedded representation of the image by combining detection characteristics of segmented codes and frames; taking the sequence number of each word in the title text and the entity as position codes, transmitting the position codes into a linear layer in combination with sectional codes to obtain embedded representations of the title text and the entity, and transmitting the embedded representations of the image, the title text and the entity into a multi-head self-attention network;
the multi-head self-attention network sub-module is used for extracting retrieval characteristics of high fusion of images, title texts and entities, carrying out interaction among multiple modal information by using three different multi-head self-attention networks and extracting fully fused characteristics;
the image-text-entity multi-modal pre-training sub-module is used for training a multi-modal model to learn the characteristics with discrimination, and completing the learning of multi-modal fusion characteristics by using four self-supervision tasks, including an image area covering task, a title text covering task, an entity covering task and image-text cross-modal contrast learning.
Further, the entity graph reinforcement learning module comprises:
the entity graph constructing submodule is used for constructing a node graph and a partitioning subgraph for given entity information; for an initialized entity queue, coding the queue through a public Transformer to obtain a joint embedded representation of entity characteristics, and inputting the entity characteristic into a pre-trained AdaGAE network for graph clustering to obtain a semantic relation of an entity graph;
the entity graph semantic information learning submodule is used for learning semantic relation representation among nodes and helping more accurate instance-level retrieval; training is carried out by using the arrangement loss based on the nodes and the arrangement loss based on the subgraph, so that the entity can pay more attention to the adjacent semanteme, and good node representation is obtained.
Further, the multi-modal merchandise retrieval module includes:
the single-item feature extraction submodule is used for extracting the features of the single item and storing the features in a search library, and inputting the image of each single-item sample into a trained detector to extract an image boundary frame and the features of the boundary frame; all the bounding boxes and the bounding box features are combined with texts to serve as the input of a multi-mode pre-training model, the image-text fusion features are extracted, and a search library is stored;
a combined product feature extraction submodule: the system is used for extracting the characteristics of each target commodity in the combined product, and inputting the image into a trained detector to extract an image boundary box and the characteristics of the boundary box aiming at each inquiry combined commodity; inputting each bounding box and the characteristics of the bounding boxes into a multi-mode model module in combination with a title, and extracting the characteristics of image-text fusion of each target commodity in the combined product;
a characteristic retrieval submodule: and the method is used for searching the single product results of the combined product, calculating the cosine similarity between each target commodity and each single product of the combined product, and selecting the closest single product as a result to return according to the cosine similarity.
Compared with the prior art, the technical scheme of the invention has the beneficial effects that:
the invention uses a self-supervision learning mode for training, and only depends on the naturally existing picture and title information and does not depend on any manually labeled category information; therefore, the method is easy to expand to large-scale data, learns a more judged feature representation, ensures the realization of high-quality instance-level commodity retrieval tasks and has stronger generalization; by using the information of two modes of the picture and the text and adding the entity information into the network training process, the image characteristic and the text characteristic can be effectively highly fused, the retrieval characteristic with higher discrimination is extracted, and the problem of incomplete information of a single mode is solved; the method and the device realize instance-level retrieval of the combined commodities, namely all the single commodities in the returned combined commodities are retrieved, can improve the commodity searching precision and help online users to search more accurate and specific commodities; the method can also be used for constructing an E-commerce knowledge map and mining commodity relations; the commodity relation obtained by combined commodity retrieval can be used for commodity recommendation, and the shopping platform recommendation effect is improved.
Drawings
FIG. 1 is a flowchart of a combined merchandise retrieval method based on a multi-modal pre-training model according to an embodiment of the present invention;
FIG. 2 is a flowchart of a combined merchandise retrieval method based on a multi-modal pre-training model according to another embodiment of the present invention;
FIG. 3 is a flowchart of a combined merchandise retrieval method based on a multi-modal pre-training model according to another embodiment of the present invention;
FIG. 4 is a flowchart of a combined merchandise retrieval method based on a multi-modal pre-training model according to another embodiment of the present invention;
FIG. 5 is a flowchart of a combined merchandise retrieval method based on a multi-modal pre-training model according to another embodiment of the present invention;
FIG. 6 is a flowchart illustrating steps of a combined merchandise retrieval method based on a multi-modal pre-training model according to an embodiment of the present invention;
FIG. 7 is a schematic diagram of a network framework of a combined commodity retrieval method based on a multi-modal pre-training model according to an embodiment of the present invention;
FIG. 8 is a diagram of an apparatus of a combined merchandise retrieval system based on a multi-modal pre-training model according to an embodiment of the present invention;
FIG. 9 is a diagram of an apparatus of a combined merchandise retrieval system based on a multi-modal pre-training model according to another embodiment of the present invention;
FIG. 10 is a diagram of an apparatus of a combined merchandise retrieval system based on a multi-modal pre-training model according to another embodiment of the present invention;
fig. 11 is a diagram of an apparatus of a combined merchandise retrieval system based on a multi-modal pre-training model according to another embodiment of the present invention.
Detailed Description
The drawings are for illustrative purposes only and are not to be construed as limiting the patent;
for the purpose of better illustrating the embodiments, certain features of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product;
it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.
The technical solution of the present invention is further described below with reference to the accompanying drawings and examples.
Example 1
As shown in fig. 1-5, a combined commodity retrieval method based on a multi-modal pre-training model includes:
s10, inputting the combined commodity image and text into a pre-trained top-down attention model, and extracting the position and characteristic information of a bounding box of all single commodities in the image; and obtaining entity information from the text through an analysis tool.
And S20, respectively extracting feature codes, position codes and segmentation codes of each mode for the image features, the text information and the entity information, and learning and embedding the representation to be used as the input of the multi-mode pre-training model.
In a specific embodiment, S20, for the image feature, the text information, and the entity information, respectively extracting feature codes, position codes, and segment codes of each modality, and learning the embedded representation as an input of the multi-modality pre-training model, includes:
s21, for the boundary frames and the features output by the target detector, calculating the position information of each frame by using 5-dimensional vectors, wherein the position information comprises the coordinates of the upper left corner and the lower right corner of the frame and the size proportion of the frame in the whole image, and transmitting the 5-dimensional vectors into a linear full-link layer to obtain position codes; using 0 as segmentation information to be transmitted into a linear full-connection layer to obtain segmentation codes; and transmitting the frame features into a linear full-connection layer to obtain the codes of the frame features. Finally, adding the position codes, the sectional codes and the feature codes to obtain the embedded representation of the image mode;
s22, for the text sequence, using the increasing natural number sequence to represent the position information, and transmitting the position information into the linear full-connection layer to obtain the position code; using 1 as segmentation information to be transmitted into a linear full-connection layer to obtain segmentation codes; and transmitting the text into a linear full-connection layer to obtain the feature code of the text. And finally, adding the position codes, the segmented codes and the feature codes to obtain the embedded representation of the text.
S23, for the entity information, using the increasing natural number sequence to represent the position information, and transmitting the position information into the linear full-connection layer to obtain the position code; using 1 as segmentation information to be transmitted into a linear full-connection layer to obtain segmentation codes; and transmitting the entity into a linear full-connection layer to obtain the feature code of the entity. And finally, adding the position code, the segmented code and the feature code to obtain the embedded representation of the entity.
And S30, inputting the image embedding representation, the text embedding representation and the entity embedding representation into three transformers by the multi-mode pre-training model, and gradually extracting the mutually fused retrieval characteristics of the three.
In a specific embodiment, the S30, inputting the image embedding representation, the text embedding representation, and the entity embedding representation into three transforms by the multi-modal pre-training model, and gradually extracting the search features fused with each other, includes:
s31, respectively extracting shallow features of the image, the text and the entity by the visual/text/entity transform network, calculating attention weight by using Q and K, and multiplying V to obtain each modal feature representation. Wherein the entity and the textual Transformer share network parameters. The layer is repeatedly carried out for L times and then is transmitted into the next network;
and S32, extracting the characteristics of mutual attention among the modalities by a cross Transformer network, wherein the layer comprises three independent cross multi-head self-attention networks and is realized by exchanging K and V in different modalities. The visual cross Transformer calculates attention weight of the text so as to obtain image characteristics after cross attention; the text cross Transformer calculates attention weights of the images and the entities so as to obtain text features after cross attention; and the entity cross transformer calculates attention weight of the text so as to obtain the cross attention entity characteristics. The layer is repeatedly carried out for K times and then is transmitted into the next network;
and S33, extracting the characteristics of the comprehensive fusion of the image, the text and the entity through a public Transformer network, splicing the characteristics of the text and the characteristics of the image in a text-vision public Transformer, calculating the weight of all characteristics of each vector attention by using Q and K, and multiplying the weight by V to obtain the characteristic representation of the text and the characteristic representation of the image, wherein Q, K, V is obtained by the characteristics after the two modalities are spliced. In the text-entity common Transformer, text features and entity features are spliced, the weight of all the features of interest of each vector is calculated by using Q and K, and then V is multiplied to obtain feature representation of the text and feature representation of the entity, wherein Q, K, V is obtained by the features after splicing of two modalities. For images, texts and entities, the layer respectively calculates attention weights of all the characteristics of the images, the texts and the entities by using a multi-head attention mechanism so as to obtain fully fused characteristics. This layer is iterated H times.
And S40, performing network training by adopting four self-supervision tasks based on the loss function of image-text contrast learning through the multi-mode pre-training model.
In a specific embodiment, the S40 multi-modal pre-training model performs network training by using three self-supervision tasks based on a loss function of graph-text contrast learning, including:
s41, by covering the words in the title text, inputting the text sequence with the covered words into a multi-mode pre-training model, and learning and recovering the covered words in the training process of the model so as to extract a feature representation with text information;
s42, inputting the entity sequence with the masked words into a multi-mode pre-training model by masking the words in the entity information, and learning and recovering the masked words in the training process of the model so as to extract a feature representation with the entity information;
s43, inputting the image frame feature sequence with the mask into a pre-training model by masking the boundary frame features in the image, and learning and recovering the masked boundary frame features in the training process of the model so as to extract a feature representation with visual information;
s44, training the network by using the loss function of contrast learning: for the paired picture and text pairs, shortening the distance of the paired picture and text pairs in the training process; and for the unpaired image text pair, the distance is enlarged in the training process, so that the image text characteristic with discrimination is learned.
S50, constructing an entity graph enhancement module, learning entity knowledge with real semantic information through arrangement loss based on nodes and arrangement loss based on subgraphs, and enhancing feature representation.
In a specific embodiment, S50, constructing an entity graph enhancing module, and learning entity knowledge with true semantic information and enhancing feature representation through node-based arrangement loss and subgraph-based arrangement loss, includes:
and S51, initializing an entity queue for the text titles and the entity information extracted from the text titles, coding the entity queue through a public Transformer to obtain a joint embedded representation of the entity characteristics, and inputting the entity queue into a pre-trained AdaGAE network for graph clustering to obtain the semantic relation of the entity graph. Training is carried out by using the arrangement loss based on the nodes and the arrangement loss based on the subgraph, so that the entity can pay more attention to the adjacent semanteme, and good node representation is obtained.
S52, in the arrangement loss based on the nodes, for a small batch of data, the cosine similarity degree is used based on the selection of negative samples. For the data entity of the batch, randomly selecting one of k entity samples with the lowest similarity as a negative sample. The selected negative examples are used to calculate alignment loss.
S53, in the arrangement loss based on the subgraph, the embedded representation of any subgraph feature is obtained by the global pooling calculation of k nearest neighbor nodes. Permutation penalties are computed for the nearest neighbor positive and negative subgraphs of any node.
And S60, inputting the image information and the text information of each single sample into the multi-modal model to extract retrieval characteristics, and storing the retrieval characteristics in a retrieval library.
S70, inputting image features, text information and entity information into the trained multi-mode model for each combined commodity, and extracting retrieval features of image-text fusion; and calculating the cosine similarity of the fusion feature and each single product in the sample library, and selecting the single product with the highest similarity as a retrieval result to return.
Example 2
As shown in fig. 6, a combined commodity retrieval method based on a multi-modal pre-training model includes the following steps and details:
step 1, inputting the combined commodity image and text into a pre-trained top-down attention model, and extracting the bounding box positions B of all single commodities in the image (B) 0 ,b 1 ,b 2 ,…,b K ) And the characteristic information F ═ F 0 ,f 1 ,f 2 ,…,f K ) (ii) a Title text T ═ T 0 ,t 1 ,t 2 ,…,t L ) Obtaining the entity information E ═ (E) by the analysis tool 0 ,e 1 ,e 2 ,…,e H )。
And 2, respectively extracting feature codes, position codes and segmentation codes of all modes for the image features, the text information and the entity information, and learning embedded representation as the input of the multi-mode pre-training model.
Using the extracted bounding box position and feature of the detector network as the image feature input I ═ b 0 ,f 0 ),(b 1 ,f 1 ),(b 2 ,f 2 ),…,(b K ,f K ) Extracting feature codes, position codes and segment codes through an embedding layer, and adding to obtain image embedding representation
Figure BDA0003619892640000111
Using the product title T ═ T (T) 0 ,t 1 ,t 2 ,…,t L ) Extracting feature codes, position codes and segmentation codes through an embedding layer, and adding to obtain text embedding representation
Figure BDA0003619892640000112
Using entity information E ═ (E) 0 ,e 1 ,e 2 ,…,e H ) Extracting characteristic code, position code and segment code through embedding layer, adding to obtain entity embedding expression
Figure BDA0003619892640000113
Specifically, as shown in fig. 7, the feature coding vector is obtained by passing the bounding box feature F through the full-connected layer
Figure BDA0003619892640000114
The calculation formula is as follows:
Figure BDA0003619892640000115
wherein w 1 And b 1 Is a parameter of the fully-connected layer and σ is the activation function.
Bounding box extracted from commodity detector
Figure BDA0003619892640000116
Calculating the area ratio of each frame to the whole picture, and constructing a 5-dimensional vector
Figure BDA0003619892640000117
Outputting position-coded vectors via full-connection layers
Figure BDA0003619892640000118
The calculation formula is as follows:
Figure BDA0003619892640000119
wherein w 2 And b 2 Is a parameter of the fully connected layer and σ is the activation function.
Using integer 0 as segmentation information S of image modality img Obtaining segmented coding vectors through full connection layer
Figure BDA00036198926400001110
The calculation formula is as follows:
Figure BDA00036198926400001111
wherein w 3 And b 3 Is a parameter of the fully-connected layer and σ is the activation function.
Adding the feature encoding vector, the position encoding vector and the segment encoding vector to obtain an embedded representation E of the image modality img
Figure BDA00036198926400001112
Specifically, as shown in fig. 7, the title text T is passed through the embedding layer to obtain the feature encoding vector
Figure BDA00036198926400001113
The calculation formula is as follows:
Figure BDA00036198926400001114
wherein w 4 And b 4 Is a parameter of the fully-connected layer and σ is the activation function.
Position information (natural number sequence) P of words in the title is processed by a full connection layer to obtain a position coding vector
Figure BDA00036198926400001115
The calculation formula is as follows:
Figure BDA00036198926400001116
wherein w 5 And b 5 Is a parameter of the fully-connected layer and σ is the activation function.
Taking integer 1 as segmentation information S of text mode txt Obtaining segmented coding vectors through full connection layer
Figure BDA00036198926400001117
The calculation formula is as follows:
Figure BDA0003619892640000121
wherein w 6 And b 6 Is a parameter of the fully-connected layer and σ is the activation function.
Adding the feature encoding vector, the position encoding vector and the segment encoding vector to obtain an embedded representation E of the text mode txt
Figure BDA0003619892640000122
The embedded representation of the entity is the same as the embedded representation of the text.
Step 3, embedding the image into the representation
Figure BDA0003619892640000123
Text embedded representation
Figure BDA0003619892640000124
And entity embedded representation
Figure BDA0003619892640000125
Inputting the images and the texts into three transformers, extracting retrieval characteristics H of mutual fusion of the images and the texts, and performing model training by adopting four self-supervision tasks.
Specifically, as shown in fig. 7, first, a picture is embedded with a feature E using a picture Transformer, a text Transformer and an entity Transformer respectively img Text embedding feature E txt And entity embedding features E ent Coding to obtain respective characteristics of picture, text and entity
Figure BDA0003619892640000126
The image, the text and the entity Transformer are respectively provided with four layers, and the calculation formula of each layer is as follows:
Figure BDA0003619892640000127
Figure BDA0003619892640000128
Figure BDA0003619892640000129
Figure BDA00036198926400001210
Figure BDA00036198926400001211
Figure BDA00036198926400001212
wherein t-1 and t are transform layer numbers, LN is a LayerNorm layer, characteristic normalization is performed, MLP is a full link layer, MSA is a multi-head attention layer, and the calculation formula is as follows:
Figure BDA00036198926400001213
Figure BDA00036198926400001214
MSA(H)=Concat(Head i ,...,Head h )W O
obtaining respective characteristics of picture, text and entity
Figure BDA00036198926400001215
Then, a cross Transformer is transmitted to enable the picture information and the text information to be correlated, and the entity information and the text information to be correlated, so that the characteristics of two modes after being correlated are obtained
Figure BDA00036198926400001216
The calculation formula is as follows:
Figure BDA00036198926400001217
Figure BDA00036198926400001218
Figure BDA0003619892640000131
Figure BDA0003619892640000132
Figure BDA0003619892640000133
Figure BDA0003619892640000134
Figure BDA0003619892640000135
Figure BDA0003619892640000136
Figure BDA0003619892640000137
the CMSA is a cross-modal cross multi-head attention network, and the calculation formula is as follows:
CMSA(H img, H txt )=Concat(Head 1 (H img ,H txt ),...,Head n (H img ,H txt ))
CMSA(H txt ,H img ,H ent )
=Concat(Head 1 (H txt ,H img ),...,Head n (H txt ,H img ),Head 1 (H txt ,H ent ),...,Head n (H txt ,H ent ))
CMSA(H ent, H txt )=Concat(Head 1 (H ent ,H txt ),...,Head n (H ent ,H txt ))
Figure BDA0003619892640000138
Figure BDA0003619892640000139
Figure BDA00036198926400001310
Figure BDA00036198926400001311
picture/text/entity features output from a cross-Transformer
Figure BDA00036198926400001312
And transmitting the data into a public Transformer for more comprehensive feature fusion.
The text-vision common Transformer enables each block of region in the image modality to pay attention to the characteristics of other regions and the characteristics of all characters, and each character in the text modality pays attention to the characteristics of other characters and the characteristics of all images. The calculation formula is as follows:
Figure BDA00036198926400001313
Figure BDA00036198926400001314
Figure BDA00036198926400001315
the text-entity common Transformer enables each word in the entity to focus on the characteristics of other words and all texts, and each word in the text modality focuses on the characteristics of other words and all entities.
The calculation formula is as follows:
Figure BDA00036198926400001316
Figure BDA00036198926400001317
Figure BDA00036198926400001318
four pre-training tasks are used to train the model structure described above, including a text masking task, an entity masking task, an image area masking task, and a cross-modal contrast learning task.
Specifically, for each picture text pair I ═ I 1, I 2, ,I 3, ,…,I K, },T={T 1, T 2, ,T 3, ,…,T L, And extracted entity E ═ E 0 ,E 1 ,E 2 ,…,E H ):
The text masking task is to replace the entered text word with "[ MASK ]" at a 15% probability, and the model predicts the masked word from the remaining words and images with a penalty function of:
Figure BDA0003619892640000141
the entity masking task is to replace the entered entity word with "[ MASK ]" at a 15% probability, and the model predicts the masked word from the remaining words and images with a penalty function of:
Figure BDA0003619892640000142
the image region masking task is to replace the input image frame features with a 15% probability into a 0 vector, and the model predicts the masked image region features from the remaining image regions and sentence words with a penalty function of:
Figure BDA0003619892640000143
in addition to the learning of different modality features, the model needs to guarantee consistency between different input modalities in order to learn the correspondence between the graphics and the text, so a cross-modality contrast learning task is used to align the picture modality and the text modality. There are 2N data for N pairs of picture text in a training batch. For each sample, the corresponding other modality data is treated as a positive sample pair, and the remaining samples are treated as a negative sample pair. For the input teletext pair ((I) i ,T i ) Each pair of image-text characteristics output by the model text Transformer and the image Transformer
Figure BDA0003619892640000144
The loss function is:
Figure BDA0003619892640000145
wherein the content of the first and second substances,
Figure BDA0003619892640000147
for calculating the similarity between the image-text pairs u and v, tau is a temperature regulation parameter,
Figure BDA0003619892640000146
is a binary indicator if and only if i! When j, it returns to 1. The contrast loss function causes the paired teletext vectors to zoom in, while the unpaired teletext vectors zoom out.
And 4, in the training process, constructing an entity graph enhancement module by using the title text and the corresponding entity information, learning entity knowledge with real semantic information through the arrangement loss based on the nodes and the arrangement loss based on the subgraph, and enhancing feature representation.
Specifically, initializing an entity queue l for a text title and entity information extracted from the text title, and coding the queue through a common Transformer to obtain a joint embedded representation l of entity characteristics f Then inputting the data into a pre-trained AdaGAE network for graph clustering to obtain an embedded expression l of the entity graph g . Training using node-based permutation loss and subgraph-based permutation loss:
in node-based permutation loss, for a small batch of data, cosine similarity is used based on negative sample selection. For any entity h of the batch of data ei Randomly selecting one h from k physical samples with the lowest similarity ek As negative examples. Calculating the arrangement loss by using the selected negative sample, wherein the loss function is as follows;
Figure BDA0003619892640000151
in the subgraph-based arrangement loss, the embedded representation of any subgraph feature is obtained by global pooling calculation of k nearest neighbor nodes. For any entity h ei Nearest neighbor positive and negative subgraph h gi And h gk The alignment loss is calculated, and the loss function is:
Figure BDA0003619892640000152
and 5, extracting a detection frame and a text word for each single sample in the same processing mode as training, inputting the detection frame and the text word into the multi-mode model, extracting image-text fusion characteristics, and storing the characteristics of each single sample in a search library.
And 6, for each combined commodity, extracting a boundary frame and boundary frame characteristics of each target commodity in the picture by using a trained detector, inputting the boundary frame and the boundary frame characteristics into a trained multi-mode model by combining text information, extracting retrieval characteristics of image-text fusion, calculating a Cosine similarity distance by using the characteristics and each single-commodity characteristic in a retrieval library, and sequencing according to the similarity, thereby obtaining a retrieval single-commodity sample which is the most matched as a final result.
Example 3
As shown in fig. 8-11, a combined commodity retrieval system based on multi-modal pre-training models comprises:
the multi-mode pre-training model module 10 is used for extracting retrieval characteristics of image-text fusion, and training a model by using three different self-attention network layers and adopting a self-supervision learning mode so that the model can extract the characteristics of multi-mode fusion;
and the entity graph reinforcement learning module 20 is used for constructing the relationship between the entity nodes and helping more accurate instance-level retrieval. Coding entity information by using a self-attention network layer, and dividing subgraphs by using a graph clustering network; and carrying out network training based on the arrangement loss of the nodes and the arrangement loss of the subgraph to obtain good semantic relation representation.
The multi-mode commodity retrieval module 30 is used for extracting the features of the single commodities and storing the features in a retrieval library, extracting the features of each target commodity of the combined commodity and retrieving the features in the retrieval library, extracting the detection features of the single commodities by using the detector module, and inputting the detection features into the multi-mode pre-training model module to extract the retrieval features; and extracting the boundary box and the boundary box characteristics of each target commodity in the combined product by using a detector module, inputting the boundary box area and the title into a multi-mode pre-training model module to extract query characteristics, calculating the similarity between the query characteristics and the retrieval characteristics, and returning the most similar commodity.
In a specific embodiment, the multi-modal pre-training model module 10 further includes:
the image, text and entity embedded representation submodule 11 is used for coding information of an image modality to input a model, taking position coordinates of an image frame and area proportion of the image frame as position codes, and respectively transmitting the position coordinates and the area proportion of the image frame into a linear layer to obtain embedded representation of the image by combining detection characteristics of the segmented codes and the frame; the sequence number of each word in the text and the entity is used as position coding, the sequence number is transmitted into a linear layer by combining with sectional coding to obtain the embedded representation of the text and the entity, and the embedded representation of the image, the text and the entity is transmitted into a multi-head self-attention network;
the multi-head self-attention network sub-module 12 is used for extracting retrieval characteristics of high fusion of images, texts and entities, carrying out interaction among a plurality of modal information by using three different multi-head self-attention networks and extracting fully fused characteristics;
and the image-text-entity multi-mode pre-training sub-module 13 is used for training a multi-mode model to learn the characteristics with the discrimination, and completing the learning of multi-mode fusion characteristics by using four self-supervision tasks, including an image area covering task, a text covering task, an entity covering task and image-text cross-mode comparison learning.
In a specific embodiment, the entity map reinforcement learning module 20 further includes:
and the entity graph constructing sub-module 21 is configured to construct a node graph and a partitioning subgraph for given entity information. And for the initialized entity queue, coding the queue through a common Transformer to obtain the joint embedded representation of the entity characteristics, and inputting the entity characteristic into a pre-trained AdaGAE network for graph clustering to obtain the semantic relation of the entity graph.
And the entity graph semantic information learning submodule 22 is used for learning semantic relation representation among nodes and helping more accurate instance-level retrieval. Training is carried out by using the arrangement loss based on the nodes and the arrangement loss based on the subgraph, so that the entity can pay more attention to the adjacent semanteme, and good node representation is obtained.
In a specific embodiment, the multi-modal merchandise retrieval module 30 further includes:
the single-item feature extraction submodule 31 is used for extracting the features of the single item, storing the features in a search library, and inputting the image of each single-item sample into a trained detector to extract an image boundary frame and the features of the boundary frame; combining all the bounding boxes and the bounding box features with texts as the input of a multi-mode pre-training model, extracting image-text fusion features, and storing the image-text fusion features in a search library;
the combined product feature extraction submodule 32 is used for extracting the features of each target commodity in the combined product, and inputting the images into a trained detector to extract image bounding boxes and bounding box features aiming at each inquiry combined commodity; and inputting each bounding box and the bounding box characteristics into the multi-mode model module in combination with the title, and extracting the characteristics of image-text fusion of each target commodity in the combined product.
And the characteristic retrieval submodule 33 is used for retrieving the single product results of the combined product, calculating cosine similarity between each target commodity and each single product of the combined product, and selecting the closest single product as a result to return according to the cosine similarity.
The same or similar reference numerals correspond to the same or similar parts;
the positional relationships depicted in the drawings are for illustrative purposes only and should not be construed as limiting the present patent;
it should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.

Claims (10)

1. A combined commodity retrieval method based on a multi-mode pre-training model is characterized by comprising the following steps:
s1: for the combined commodity image and the title text, inputting the image into a pre-trained top-down attention model, and extracting the position and the characteristic information of a bounding box of all single commodities in the image; obtaining entity information from the title text through an analysis tool;
s2: respectively extracting feature codes, position codes and segmentation codes of the image features, the title text information and the entity information, and learning embedded representation as input of a multi-mode pre-training model;
s3: inputting the image embedding representation, the title text embedding representation and the entity embedding representation into three transformers, gradually extracting the mutually fused retrieval characteristics of the three, and training by adopting four self-supervision tasks;
s4: in the training process, a title text and corresponding entity information are used for constructing an entity graph enhancement module, entity knowledge with real semantic information is learned through the arrangement loss based on nodes and the arrangement loss based on subgraphs, and feature representation is enhanced;
s5: inputting image information and title text information of each single commodity sample into a multi-modal model to extract retrieval characteristics, and storing the retrieval characteristics in a retrieval library;
s6: for each combined commodity, inputting image features, title text information and entity information into a pre-trained multi-mode model, and extracting retrieval features of image-text fusion; and calculating the cosine similarity of the fusion feature and each single product in the sample library, and selecting the single product with the highest similarity as a retrieval result to return.
2. The combined merchandise retrieval method based on multi-modal pre-trained model according to claim 1, wherein in step S1:
the process of extracting the bounding box positions and the characteristic information of all single commodities in the image by the top-down attention model comprises the following steps:
using a pre-trained bottom-up attention model based on a VG data set as a target detector, inputting a combined commodity image, extracting boundary frame position information and boundary frame characteristics of each single commodity, and inputting the boundary frame position information and the boundary frame characteristics as image characteristics of the combined commodity;
the process of obtaining entity information from the title text through the analysis tool comprises the following steps:
and extracting a noun entity set from the title text by using an NLP (NLP) analysis tool, and inputting the noun entity set as entity information.
3. The combined commodity retrieval method based on the multi-modal pre-trained model according to claim 2, wherein the specific process of step S2 is:
s21: for the position and the characteristic information of the bounding box, 5-dimensional vectors are used for calculating the position information of each box, including the coordinates of the upper left corner and the lower right corner of the box and the size proportion of the box in the whole image, and the 5-dimensional vectors are transmitted into a linear full-connection layer to obtain position codes; using 0 as segmentation information to be transmitted into a linear full-connection layer to obtain segmentation codes; transmitting the frame characteristics into a linear full-connection layer to obtain the codes of the frame characteristics; finally, adding the position codes, the sectional codes and the feature codes to obtain the embedded representation of the image mode;
s21: for the title text information, incremental natural number sequence lists are used for representing the position information of the title text information, and the position information is transmitted into a linear full-connection layer to obtain position codes; using 1 as segmentation information to be transmitted into a linear full-connection layer to obtain segmentation codes; transmitting the title text into a linear full-connection layer to obtain a feature code of the text; finally, adding the position codes, the segment codes and the feature codes to obtain the embedded representation of the title text;
s23: for entity information, incremental natural number sequence lists are used for representing the position information of the entity information, and a linear full-connection layer is transmitted to obtain position codes; using 1 as segmentation information to be transmitted into a linear full-connection layer to obtain segmentation codes; transmitting the entity into a linear full-connection layer to obtain the characteristic code of the entity; and finally, adding the position code, the segmented code and the feature code to obtain the embedded representation of the entity.
4. The combined merchandise retrieval method based on multi-modal pre-trained model of claim 3, wherein the step S3 of inputting the image embedded representation, the title text embedded representation and the entity embedded representation into three transformers and gradually extracting the retrieval features of the three transformers in combination comprises:
1): respectively extracting shallow layer characteristics of an image, a title text and an entity by using an image/title text/entity Transformer network, calculating attention weight by using Q and K, multiplying by V to obtain characteristic representation of each mode, wherein the transformers of the entity and the text share network parameters, and repeating the layer for L times and then transmitting into a next network;
2): the cross Transformer network extracts the mutual attention characteristics among the modes, the layer comprises three independent cross multi-head self-attention networks, and the cross Transformer network is realized by exchanging K and V in different modes; the image cross Transformer calculates attention weight to the text so as to obtain image characteristics after cross attention; the title text cross Transformer is used for calculating attention weights of the images and the entities so as to obtain text features after cross attention; the entity cross transformer calculates the attention weight of the text so as to obtain the entity characteristics after cross attention, and the layer is repeatedly carried out for K times and then is transmitted to the next network;
3): extracting characteristics of comprehensive fusion of an image, a title text and an entity by a public Transformer network, splicing the characteristics of the title text and the characteristics of the image in a text-vision public Transformer, calculating the weight of all characteristics of each vector attention by using Q and K, and multiplying the weight by V to obtain the characteristic representation of the text and the characteristic representation of the image, wherein Q, K, V is obtained by the characteristics after splicing two modes; in the title text-entity public Transformer, splicing title text features and entity features, calculating the weight of all the features of each vector attention by using Q and K, and multiplying by V to obtain feature representation of a title text and feature representation of an entity, wherein Q, K, V is obtained by the features after splicing of two modalities; for the image, the title text and the entity, the layer respectively calculates attention weights of all the characteristics of the image, the title text and the entity by using a multi-head attention mechanism so as to obtain the fully fused characteristics, and the layer is iterated for H times.
5. The combined merchandise retrieval method based on multi-modal pre-training model according to claim 4, wherein in step S3, the training process using four self-supervision tasks comprises:
1) the method comprises the following steps of inputting a title text sequence with masked words into a multi-mode pre-training model by masking the words in a title text, and learning and recovering the masked words in the training process of the model so as to extract a feature representation with title text information;
2) inputting the entity sequence with the masked words into a multi-mode pre-training model by masking the words in the entity information, and learning and recovering the masked words by the model in the training process so as to extract a feature representation with the entity information;
3) inputting the image frame feature sequence with the mask into a pre-training model through the feature of the boundary frame in the mask image, and learning and recovering the feature of the masked boundary frame in the training process of the model so as to extract a feature representation with visual information;
4) training the network using a comparative learned loss function: for pairs of pictures and title texts, shortening the distance of the pairs in the training process; and for unpaired picture title text pairs, enlarging the distance in the training process so as to learn the image-text characteristics with discrimination.
6. The combined merchandise retrieval method based on multi-modal pre-trained model according to claim 5, wherein in step S4, the process of constructing the entity map enhancement module using the heading text and the corresponding entity information in the training process comprises:
s41: initializing an entity queue for the header text and the entity information extracted from the header text, coding the entity queue through a common Transformer to obtain a joint embedded representation of entity characteristics, and inputting the entity queue into a pre-trained AdaGAE network for graph clustering to obtain a semantic relation of an entity graph; training by using the arrangement loss based on the nodes and the arrangement loss based on the subgraph, so that the entity can pay more attention to the semantic neighbors to obtain good node representation;
s42: in the arrangement loss based on the nodes, for a small batch of data, selecting a negative sample based on cosine similarity, randomly selecting one of k entity samples with the lowest similarity as the negative sample for the batch of data entities, and calculating the arrangement loss by using the selected negative sample;
s43: in the arrangement loss based on the subgraph, the embedding expression of any subgraph feature is obtained by global pooling calculation of k nearest neighbor nodes, and the arrangement loss is calculated for the nearest neighbor positive and negative subgraphs of any node.
7. A combined commodity retrieval system based on a multi-modal pre-training model is characterized by comprising:
the multi-mode pre-training model module is used for extracting retrieval characteristics of image-text fusion, and training a model by using three different self-attention network layers and adopting a self-supervision learning mode so that the model can extract the characteristics of multi-mode fusion;
the entity graph reinforcement learning module is used for constructing the relationship among the entity nodes and helping more accurate instance-level retrieval; coding entity information by using a self-attention network layer, and dividing subgraphs by using a graph clustering network; performing network training based on the arrangement loss of the nodes and the arrangement loss of the subgraph to obtain good semantic relation representation;
the multi-mode commodity retrieval module is used for extracting the features of the single commodity and storing the features in a retrieval library, extracting the features of each target commodity of the combined commodity and retrieving the features in the retrieval library, extracting the detection features of the single commodity by using the detector module and inputting the detection features into the multi-mode pre-training model module to extract the retrieval features; and extracting the bounding box and the bounding box characteristics of each target commodity in the combined product by using a detector module, inputting the bounding box area and the title into a multi-mode pre-training model module to extract query characteristics, calculating the similarity between the query characteristics and the retrieval characteristics, and returning the most similar commodity.
8. The combined commodity retrieval system based on the multi-modal pre-trained model according to claim 7, wherein the multi-modal pre-trained model module comprises:
the image, title text and entity embedded representation submodule is used for coding information of an image mode to input a model, taking position coordinates of an image frame and area proportion of the image frame as position codes, and respectively transmitting the position coordinates and the area proportion of the image frame into a linear layer to obtain embedded representation of the image by combining detection characteristics of segmented codes and frames; taking the sequence number of each word in the title text and the entity as position codes, transmitting the position codes into a linear layer in combination with sectional codes to obtain embedded representations of the title text and the entity, and transmitting the embedded representations of the image, the title text and the entity into a multi-head self-attention network;
the multi-head self-attention network sub-module is used for extracting retrieval characteristics of high fusion of images, title texts and entities, carrying out interaction among multiple modal information by using three different multi-head self-attention networks and extracting fully fused characteristics;
the image-text-entity multi-modal pre-training sub-module is used for training a multi-modal model to learn the characteristics with discrimination, and completing the learning of multi-modal fusion characteristics by using four self-supervision tasks, including an image area covering task, a title text covering task, an entity covering task and image-text cross-modal contrast learning.
9. The combined commodity retrieval system based on the multi-modal pre-trained model according to claim 8, wherein the entity map reinforcement learning module comprises:
the entity graph constructing submodule is used for constructing a node graph and a partitioning subgraph for given entity information; for an initialized entity queue, coding the queue through a public Transformer to obtain a joint embedded representation of entity characteristics, and inputting the entity characteristic into a pre-trained AdaGAE network for graph clustering to obtain a semantic relation of an entity graph;
the entity graph semantic information learning submodule is used for learning semantic relation representation among nodes and helping more accurate instance-level retrieval; training is carried out by using the arrangement loss based on the nodes and the arrangement loss based on the subgraph, so that the entity can pay more attention to the adjacent semanteme, and good node representation is obtained.
10. The combined merchandise retrieval system based on multi-modal pre-trained model of claim 9, wherein the multi-modal merchandise retrieval module comprises:
the single-item feature extraction submodule is used for extracting the features of the single item and storing the features in a search library, and inputting the image of each single-item sample into a trained detector to extract an image boundary frame and the features of the boundary frame; combining all the bounding boxes and the bounding box features with texts as the input of a multi-mode pre-training model, extracting image-text fusion features, and storing the image-text fusion features in a search library;
a combined product feature extraction submodule: the system is used for extracting the characteristics of each target commodity in the combined product, and inputting the image into a trained detector to extract an image boundary box and the characteristics of the boundary box aiming at each inquiry combined commodity; inputting each bounding box and the characteristics of the bounding boxes into a multi-mode model module in combination with a title, and extracting the characteristics of image-text fusion of each target commodity in the combined product;
a characteristic retrieval submodule: and the method is used for searching the single product results of the combined product, calculating the cosine similarity between each target commodity and each single product of the combined product, and selecting the closest single product as a result to return according to the cosine similarity.
CN202210453799.4A 2022-04-27 2022-04-27 Combined commodity retrieval method and system based on multi-mode pre-training model Active CN114840705B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210453799.4A CN114840705B (en) 2022-04-27 2022-04-27 Combined commodity retrieval method and system based on multi-mode pre-training model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210453799.4A CN114840705B (en) 2022-04-27 2022-04-27 Combined commodity retrieval method and system based on multi-mode pre-training model

Publications (2)

Publication Number Publication Date
CN114840705A true CN114840705A (en) 2022-08-02
CN114840705B CN114840705B (en) 2024-04-19

Family

ID=82568012

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210453799.4A Active CN114840705B (en) 2022-04-27 2022-04-27 Combined commodity retrieval method and system based on multi-mode pre-training model

Country Status (1)

Country Link
CN (1) CN114840705B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115080766A (en) * 2022-08-16 2022-09-20 之江实验室 Multi-modal knowledge graph characterization system and method based on pre-training model
CN115964560A (en) * 2022-12-07 2023-04-14 南京擎盾信息科技有限公司 Information recommendation method and equipment based on multi-mode pre-training model
CN116051132A (en) * 2023-04-03 2023-05-02 之江实验室 Illegal commodity identification method and device, computer equipment and storage medium
CN116383491A (en) * 2023-03-21 2023-07-04 北京百度网讯科技有限公司 Information recommendation method, apparatus, device, storage medium, and program product
CN117035074A (en) * 2023-10-08 2023-11-10 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) Multi-modal knowledge generation method and device based on feedback reinforcement
CN117708354A (en) * 2024-02-06 2024-03-15 湖南快乐阳光互动娱乐传媒有限公司 Image indexing method and device, electronic equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020143137A1 (en) * 2019-01-07 2020-07-16 北京大学深圳研究生院 Multi-step self-attention cross-media retrieval method based on restricted text space and system
CN112860930A (en) * 2021-02-10 2021-05-28 浙江大学 Text-to-commodity image retrieval method based on hierarchical similarity learning
CA3167569A1 (en) * 2020-04-15 2021-10-21 Maksymilian Clark Polaczuk Systems and methods for determining entity attribute representations
CN114036246A (en) * 2021-12-06 2022-02-11 国能(北京)商务网络有限公司 Commodity map vectorization method and device, electronic equipment and storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020143137A1 (en) * 2019-01-07 2020-07-16 北京大学深圳研究生院 Multi-step self-attention cross-media retrieval method based on restricted text space and system
CA3167569A1 (en) * 2020-04-15 2021-10-21 Maksymilian Clark Polaczuk Systems and methods for determining entity attribute representations
CN112860930A (en) * 2021-02-10 2021-05-28 浙江大学 Text-to-commodity image retrieval method based on hierarchical similarity learning
CN114036246A (en) * 2021-12-06 2022-02-11 国能(北京)商务网络有限公司 Commodity map vectorization method and device, electronic equipment and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
姚静天;王永利;侍秋艳;董振江;: "基于联合物品搭配度的推荐算法框架", 上海理工大学学报, no. 01, 15 February 2017 (2017-02-15) *
邓一姣;张凤荔;陈学勤;艾擎;余苏?;: "面向跨模态检索的协同注意力网络模型", 计算机科学, no. 04, 31 December 2020 (2020-12-31) *

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115080766A (en) * 2022-08-16 2022-09-20 之江实验室 Multi-modal knowledge graph characterization system and method based on pre-training model
CN115080766B (en) * 2022-08-16 2022-12-06 之江实验室 Multi-modal knowledge graph characterization system and method based on pre-training model
CN115964560A (en) * 2022-12-07 2023-04-14 南京擎盾信息科技有限公司 Information recommendation method and equipment based on multi-mode pre-training model
CN115964560B (en) * 2022-12-07 2023-10-27 南京擎盾信息科技有限公司 Information recommendation method and equipment based on multi-mode pre-training model
CN116383491A (en) * 2023-03-21 2023-07-04 北京百度网讯科技有限公司 Information recommendation method, apparatus, device, storage medium, and program product
CN116383491B (en) * 2023-03-21 2024-05-24 北京百度网讯科技有限公司 Information recommendation method, apparatus, device, storage medium, and program product
CN116051132A (en) * 2023-04-03 2023-05-02 之江实验室 Illegal commodity identification method and device, computer equipment and storage medium
CN117035074A (en) * 2023-10-08 2023-11-10 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) Multi-modal knowledge generation method and device based on feedback reinforcement
CN117035074B (en) * 2023-10-08 2024-02-13 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) Multi-modal knowledge generation method and device based on feedback reinforcement
CN117708354A (en) * 2024-02-06 2024-03-15 湖南快乐阳光互动娱乐传媒有限公司 Image indexing method and device, electronic equipment and storage medium
CN117708354B (en) * 2024-02-06 2024-04-30 湖南快乐阳光互动娱乐传媒有限公司 Image indexing method and device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN114840705B (en) 2024-04-19

Similar Documents

Publication Publication Date Title
CN114840705B (en) Combined commodity retrieval method and system based on multi-mode pre-training model
CN112966127B (en) Cross-modal retrieval method based on multilayer semantic alignment
CN109062893B (en) Commodity name identification method based on full-text attention mechanism
CN114445201A (en) Combined commodity retrieval method and system based on multi-mode pre-training model
Guo et al. Small object sensitive segmentation of urban street scene with spatial adjacency between object classes
CN115033670A (en) Cross-modal image-text retrieval method with multi-granularity feature fusion
CN114936623B (en) Aspect-level emotion analysis method integrating multi-mode data
CN113297370B (en) End-to-end multi-modal question-answering method and system based on multi-interaction attention
CN115034224A (en) News event detection method and system integrating representation of multiple text semantic structure diagrams
CN111651974A (en) Implicit discourse relation analysis method and system
CN114612767B (en) Scene graph-based image understanding and expressing method, system and storage medium
CN115994990A (en) Three-dimensional model automatic modeling method based on text information guidance
CN114418032A (en) Five-modal commodity pre-training method and retrieval system based on self-coordination contrast learning
CN114461821A (en) Cross-modal image-text inter-searching method based on self-attention reasoning
CN114332288B (en) Method for generating text generation image of confrontation network based on phrase drive and network
CN112036189A (en) Method and system for recognizing gold semantic
CN114639109A (en) Image processing method and device, electronic equipment and storage medium
CN114036246A (en) Commodity map vectorization method and device, electronic equipment and storage medium
Du et al. From plane to hierarchy: Deformable transformer for remote sensing image captioning
CN117807232A (en) Commodity classification method, commodity classification model construction method and device
CN116823321B (en) Method and system for analyzing economic management data of electric business
CN116522942A (en) Chinese nested named entity recognition method based on character pairs
Wang et al. Inductive zero-shot image annotation via embedding graph
CN114898192A (en) Model training method, prediction method, device, storage medium, and program product
CN113763084A (en) Product recommendation processing method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB03 Change of inventor or designer information
CB03 Change of inventor or designer information

Inventor after: Zhang Jingyi

Inventor after: Dong Xiao

Inventor after: Zhan Xunlin

Inventor after: Liang Xiaodan

Inventor before: Zhan Xunlin

Inventor before: Wu Yangxin

Inventor before: Dong Xiao

Inventor before: Liang Xiaodan

GR01 Patent grant
GR01 Patent grant