CN115018010A - Multi-mode commodity matching method based on images and texts - Google Patents
Multi-mode commodity matching method based on images and texts Download PDFInfo
- Publication number
- CN115018010A CN115018010A CN202210809470.7A CN202210809470A CN115018010A CN 115018010 A CN115018010 A CN 115018010A CN 202210809470 A CN202210809470 A CN 202210809470A CN 115018010 A CN115018010 A CN 115018010A
- Authority
- CN
- China
- Prior art keywords
- features
- image
- text
- similarity
- modal
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 48
- 238000012549 training Methods 0.000 claims description 25
- 238000012545 processing Methods 0.000 claims description 9
- 230000008569 process Effects 0.000 claims description 8
- 238000001514 detection method Methods 0.000 claims description 6
- 230000006870 function Effects 0.000 claims description 6
- 238000004364 calculation method Methods 0.000 claims description 4
- 238000007781 pre-processing Methods 0.000 claims description 4
- 230000007246 mechanism Effects 0.000 claims description 3
- 238000010606 normalization Methods 0.000 claims description 3
- 239000000654 additive Substances 0.000 claims description 2
- 230000000996 additive effect Effects 0.000 claims description 2
- 230000008859 change Effects 0.000 claims description 2
- 238000003062 neural network model Methods 0.000 claims description 2
- 230000009467 reduction Effects 0.000 claims description 2
- 230000000717 retained effect Effects 0.000 claims description 2
- 238000012805 post-processing Methods 0.000 abstract description 9
- 238000013528 artificial neural network Methods 0.000 abstract description 4
- 230000004927 fusion Effects 0.000 abstract 1
- 230000008707 rearrangement Effects 0.000 abstract 1
- 230000006872 improvement Effects 0.000 description 8
- 238000010586 diagram Methods 0.000 description 4
- 230000000694 effects Effects 0.000 description 3
- 238000000605 extraction Methods 0.000 description 3
- 239000000047 product Substances 0.000 description 3
- 238000012216 screening Methods 0.000 description 3
- 230000000295 complement effect Effects 0.000 description 2
- 238000010276 construction Methods 0.000 description 2
- 230000004913 activation Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000013527 convolutional neural network Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000036651 mood Effects 0.000 description 1
- 238000011176 pooling Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000012163 sequencing technique Methods 0.000 description 1
- 239000013589 supplement Substances 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q30/00—Commerce
- G06Q30/06—Buying, selling or leasing transactions
- G06Q30/0601—Electronic shopping [e-shopping]
- G06Q30/0623—Item investigation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/74—Image or video pattern matching; Proximity measures in feature spaces
- G06V10/75—Organisation of the matching processes, e.g. simultaneous or sequential comparisons of image or video features; Coarse-fine approaches, e.g. multi-scale approaches; using context analysis; Selection of dictionaries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/74—Image or video pattern matching; Proximity measures in feature spaces
- G06V10/761—Proximity, similarity or dissimilarity measures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/764—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- General Physics & Mathematics (AREA)
- Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- Business, Economics & Management (AREA)
- General Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Software Systems (AREA)
- Databases & Information Systems (AREA)
- Computing Systems (AREA)
- Artificial Intelligence (AREA)
- Multimedia (AREA)
- Health & Medical Sciences (AREA)
- Accounting & Taxation (AREA)
- Finance (AREA)
- Development Economics (AREA)
- Economics (AREA)
- Marketing (AREA)
- Strategic Management (AREA)
- General Business, Economics & Management (AREA)
- Image Analysis (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
A multi-mode goods matching method based on images and texts aims at finding out matched goods by utilizing image information of a goods cover and text information in a goods title; the method comprises the following specific steps: firstly, using a metric learning method to enable a network to learn characteristics with discriminant; secondly, extracting commodity characteristics through an image and a text network respectively; thirdly, calculating cosine distances of features among the samples from three angles of images, texts and multiple modes, and realizing rearrangement of matching results by adopting a query expansion method; and finally, setting a dynamic threshold value to realize the fusion of the multi-modal results, and adding the samples meeting the threshold value condition into the final matching result. The neural network structure and the post-processing method can effectively solve the problems of few matches and mismatches in a single mode. The matching accuracy is met, and meanwhile the commodity recall rate is remarkably improved.
Description
Technical Field
The invention relates to a matching method, in particular to a multi-mode commodity matching method based on images and texts, and belongs to the field of deep learning and neural network methods.
Background
Consumers often compare the prices of the same item in different vendors during shopping, and desire to be able to purchase their mood items at the lowest cost. Retailers also want to be able to provide a comparison report that is most cost effective to show their products to consumers to improve their competitiveness, which requires that consumers' query products be associated with other similar products in the library. Commodity matching methods are commonly used to accomplish this task, where it is desirable to find samples in a commodity library that belong to the same commodity as a given commodity.
To date, some research efforts have been made on commercial matching methods. However, the following problems still exist: 1) the same commodity may have very different image characteristics due to factors such as color, size, shooting angle or label noise; 2) the same commodity may have very different text representations due to factors such as languages, commodity specifications and emphasis differences; 3) the balance between the accuracy and the recall rate is difficult to realize by utilizing the monomodal commodity information; 4) the actual application scene belongs to an open set task and may contain commodities never appearing in the training set.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention provides a multi-modal commodity matching method based on images and texts, which can simultaneously utilize the images of commodities and corresponding text description information on websites to extract multi-modal features from a complementary angle so as to enhance the matching effect. And training by using a metric learning method, and designing a reliable post-processing flow during reasoning. Therefore, comprehensive consideration of different modal characteristics and further enhancement of the query vector are realized, and the matching effect is improved.
In order to achieve the purpose, the technical scheme of the invention is as follows:
a multi-mode commodity matching method based on images and texts specifically comprises the following steps:
step 1: in the training stage, preprocessing the image and text information of the commodity as the input of an image model and a text model;
step 2: respectively designing a neural network model for processing images and texts, extracting image and text characteristics, carrying out classification training by using a metric learning method to enable the network to learn more discriminative characteristics, facilitating similarity calculation in subsequent reasoning, using ArcFace as a loss function, setting the category number N of commodities to be matched as the output dimension of weight W, optimizing by adopting an additive angle penalty term m, setting s as the radius of a hypersphere and theta as j Is the angle between the weight vector of the jth class and the input vector, y i For the true category, Loss is defined as
And step 3: using the trained image model to extract image features, calculating cosine similarity of the features of each image sample and the features of all other image samples through KNN (K nearest neighbor) to obtain N samples before image similarity arrangement, and obtaining two sample features f i And f j The similarity between them is defined as
And 4, step 4: using the trained Bert model and TFIDF, carrying out cascade connection after weighting the features of the Bert model and the TFIDF, extracting the normalized text features, and calculating the feature similarity between text samples by using KNN to obtain N samples before text similarity arrangement;
and 5: the obtained image features and the text features are spliced after being weighted to serve as multi-modal features which are fused with image and text information at the same time, and the similarity among the multi-modal features is calculated by using KNN again to obtain N samples before the similarity is arranged;
step 6: using P, P before alignment respectively<N, performing query expansion on the image, the text and the multi-mode matching result, namely using the similarity of TopP as weight, performing weighted summation on the neighborhood characteristics of the sample to be used as a new query vector,let f (q) be the query vector, f q (top i ) Is the ith feature closest to query q, and alpha is a weight hyperparameter, then query expansion is implemented as
And 7: for the image, the text and the multi-modal features after query expansion, the similarity of the image, the text and the multi-modal features between each sample and all the other samples is calculated by KNN again, the step can be repeated for many times, and the size of P is continuously reduced along with the increase of the repeated process;
and 8: respectively setting dynamic threshold values for the image, the text and the multi-modal characteristics to obtain matching results of the image, the text and the multi-modal characteristics under corresponding threshold values, finally, comprehensively considering the classification results of the image, the text and the multi-modal characteristics, recalling samples meeting the threshold values, setting k as the minimum required matching number, and stride as the step length of each threshold value change, wherein the process is that
As a further improvement of the present invention, in step 1, the commodity to be matched is cut by a target detection algorithm, the influence of an irrelevant background on subsequent matching is eliminated, the cut image is scaled to 512 × 512px, and finally data enhancement processing is performed.
As a further improvement of the invention, in the step 2, the image model is composed of a high-efficiency network and a denormalization network with an ECA attention mechanism, the text model is composed of a Sennce-Bert and a TFIDF, the training uses Arcface as a loss function, samples belonging to the same commodity are regarded as the same class, and classification training is carried out in a training set.
As a further improvement of the present invention, in step 3, the image features are extracted by using the image model trained in step 2, the features of the plurality of image models are normalized and then spliced to serve as the integrated image features, and then the cosine distance is calculated for the integrated features by using a K nearest neighbor method to obtain the features of N before the image similarity arrangement.
As a further improvement of the present invention, in step 4, text features are extracted by using the Bert model and the TFIDF model trained in step 2, dimension reduction processing is performed on the TFIDF features, then the features of the two are normalized and spliced to serve as integrated text features, and then the K nearest neighbor method is used for the integrated features to calculate cosine distances to obtain the features of N before text similarity arrangement.
As a further improvement of the present invention, in the step 5, the image and text features obtained in the steps 3 and 4 are spliced, and the k-nearest neighbor method is used again to calculate the cosine distance of the spliced features, so as to obtain the features of N before the multi-modal similarity arrangement.
As a further improvement of the present invention, in said step 6, for each commodity, the images, texts and multi-modal matching samples of N before arrangement obtained in step 3, step 4 and step 5 are respectively used, and P with the highest similarity are selected for query expansion, i.e. the similarity of the TopP matching samples is used as weight, and the features of the TopP are weighted and summed to form a new query vector.
As a further improvement of the present invention, in step 7, the weighted new image, text and multi-modal query vector obtained in step 6 are used, the cosine similarity is calculated again by the K nearest neighbor method, and the samples of N before the arrangement of the back similarity are retained.
As a further improvement of the present invention, in step 8, for each commodity, the image, the text, and the multi-modal matching result of N before arrangement finally obtained in step 7 are obtained, dynamic thresholds are respectively set for the three according to the magnitude of similarity, and are synchronously changed, samples smaller than respective thresholds of the three are merged and added to the final result, if the matching number after merging is larger than k, the threshold cycle is exited, otherwise, the threshold is continuously widened.
The invention has the beneficial effects that: 1. the multi-mode commodity matching method based on the images and the texts can obtain a more reliable matching result than a single mode by comprehensively considering the images and the texts of the commodities and two kinds of fused multi-mode information, so that the same commodities are classified into one class; 2. aiming at actual use scenes, the invention provides a training method with more discriminability, and ensures that the characteristics with enough distinguishability are obtained; 3. in addition, the invention also integrates query expansion and dynamic threshold strategy, and realizes the promotion of recall rate on the premise of ensuring accuracy. And finally, further screening the matching result to ensure the consistency of the prediction result.
Drawings
FIG. 1 is a schematic flow diagram of the invention as a whole;
FIG. 2 is a network architecture diagram of the image and text training portion of the present invention;
FIG. 3 is a network architecture diagram of the present invention for extracting image and text features;
FIG. 4 is a schematic flow diagram of a feature post-processing portion of the present invention.
Detailed Description
The invention is described in further detail below with reference to the following detailed description and accompanying drawings:
example 1: as depicted in fig. 1; a multi-mode commodity matching method based on images and texts comprises the following steps:
step 1: and a data input stage.
Step 2: and a model building stage, including an image and text model.
And step 3: and (5) a model training stage.
And 4, step 4: and (5) extracting characteristics.
And 5: and (5) a characteristic post-processing stage.
Step 6: stage of post-processing of results
In the data input stage, input data preprocessing is completed. Firstly, separating image information and text information in a library, and marking samples belonging to the same class. Then, the image is cut by using a target detection algorithm, and is zoomed to a specified size, and finally corresponding data enhancement is carried out. Meanwhile, the language of the text data is converted.
In the model construction stage, a deep neural network and a TFIDF model are constructed according to task construction requirements. The image part comprises EfficientNet-B3, EfficientNet-B5 and ECA-NFNet-L0, and the text part comprises a Bert model pre-trained on a large text corpus and a TFIDF model based on word frequency-inverse document frequency.
In the model training stage, the ArcFace is used as a loss function, and classification training is respectively carried out on the CNN model and the Bert model on the well-constructed training set.
In the characteristic extraction stage, the model trained in the model training stage and the TFIDF are used for extracting distinguishing image and text characteristics, and the image and text characteristics are spliced to obtain multi-modal characteristics.
In the feature post-processing stage, the obtained image, text and multi-modal features are subjected to query expansion processing respectively, and then KNN is used for obtaining a matching sample so as to improve the information capacity of a query vector.
And in the post-processing stage of results, fusing the results in different modes by adopting a dynamic threshold method. And post-screening treatment is carried out on the fused prediction results, so that the prediction consistency of different samples is ensured.
The details of each stage are described below.
(1) And in the data input stage, in order to meet the requirement of model training, preprocessing operation needs to be carried out on data in the data set. Firstly, the similar commodities in the data set are classified to generate a class label. Then, for the image data, it is desirable to focus the model on the matching of the commodity itself rather than the matching of the background, so the coordinates of the commodity in the image are obtained by using the target detection algorithm YOLO, and the corresponding region of the commodity body is cut out from the coordinates. Wherein, one image may contain a plurality of detection results, and the minimum and maximum x and y coordinates of all the detection results are used as the cutting boundary. Let us say that n targets are detected, x i,1 Is the left abscissa, x, of the ith target i,2 Horizontal boundary x of the i-th object 1 And x 2 Is composed of
x 1 =min({x 1,1 ,x 2,1 ,...,x n,1 })
x 2 =max({x 1,2 ,x 2,2 ,...,x n,2 })
Then the size is scaled to 512 x 512px, and data enhancement is performed using random horizontal flipping, random vertical flipping, random brightness variation, and the like. And finally, normalizing the image data. For text data, aiming at the condition that different languages exist, the text is firstly converted into English, so that a Bert model pre-trained on the English data can be better utilized, the subsequent matching process is facilitated, and the special conditions that the semantics are the same and the languages are different are avoided.
(2) The model building stage is mainly divided into two categories, namely an image model and a text model, as shown in fig. 1. For image input, the invention adopts three models of EfficientNet-B3, EfficientNet-B5 and ECA-NFNet-L0. The output dimensions of the FC layer are all set to 512. The EfficientNet is an efficient network obtained through neural network search, can achieve better performance with smaller calculation amount, and is suitable for application scenarios of the invention. ECA-NFNet-L0 fuses an efficient channel attention mechanism (ECA) into a non-normalized network NFNet, and replaces the original SiLU activation function with Mish to improve the performance of the model. Let the input be x and the output be f (x), then the expression of Mish is
f(x)=x×tanh(log(1+e x ))
For text input, the invention adopts two models of Sennce-Bert and TFIDF. For Bert, the [ CLS ] mark in the last layer of output is used for subsequent processing, and the output dimension is set to be 768. For TFIDF, text feature extraction is realized by counting word frequency and importance weight of the word frequency in a text, PCA is used for reducing dimensions of high-dimensional features of the TFIDF, and feature dimensions are fixed to 768. If N is the total number of occurrences of all words in the commodity description, N is the total number of occurrences of a word in the commodity description, B is the total number of text descriptions in the whole library, and B is the number of text descriptions in which the word occurs, the calculation method of TFIDF is that
(3) In the model training stage, as shown in fig. 2, the invention converts the commodity matching problem into a metric learning problem, and aims to increase the intra-class similarity and decrease the inter-class similarity during training, so that the characteristics with sufficient discriminability are expected to be learned. Therefore, even if commodities which are not seen in the training set appear in the practical application, the matching effect can be ensured. For the image model, the output dimension is fixed to 512 dimensions using first Global Average Pooling (GAP) after the last convolution layer, and then Dropout and FC. Then normalization processing is performed by using BatchNorm. Finally, a Ranger optimizer is used, and ArcFace loss is adopted for classification training.
And for the text model Bert, extracting 768-dimensional features output by the CLS in the last layer, and performing classification training by adopting ArcFace loss on the basis of different commodity categories.
(4) And a feature extraction stage, as shown in fig. 3. The method uses the three image models trained in the step (3), reserves network structures except for the ArcFace layer, and respectively extracts 512-dimensional features. And performing L2 normalization operation on the extracted three features, and performing weighted cascade to obtain 1536-dimensional image features to realize the integration of image models. And (3) extracting 768-dimensional features from the text part by using the trained Bert in the step (3), extracting the same 768-dimensional features by using a TFIDF model, weighting and splicing the two text features, and finally obtaining the 1536-dimensional text features.
(5) The feature post-processing stage, as shown in fig. 4. According to the invention, the image and text features extracted in the step (4) are spliced to obtain 3072-dimensional multi-modal features, and the multi-modal features are used as supplement of single-modal information to provide a complementary visual angle. And respectively carrying out query expansion operation on the image, the text and the multi-modal features. Specifically, the cosine distances between the current sample and other homomodal features are calculated by using KNN, and the first N most similar features are obtained. And selecting the most similar P features from the P features, using the alpha power of the cosine similarity of the P features as a weight, and performing weighted summation on the P features to serve as a new feature of the current sample, wherein the feature realizes the comprehensive consideration of adjacent features and can effectively improve the recall rate of the process. The above process is repeated m times, and in the repeated process, the adjacent features become more and more similar, and the size of P is gradually reduced to avoid the decrease of the accuracy rate.
(6) And (5) in a result post-processing stage, calculating cosine similarity of the processed image, text and multi-modal features obtained in the step (5) by using KNN again respectively to obtain the first N most similar features, and sequencing the features from small to large according to cosine distances. Finally, the invention uses a dynamic threshold method to screen the results of the three parts. Specifically, instead of using only a single threshold for each sample, a loop is made from small to large with the lower threshold L, the upper threshold H, and the threshold step S. The specific parameter settings are related to the discriminative power of the model. If the number of matches made by the sample is greater than or equal to k (i.e., at least k matches are found) at some threshold, the loop exits. In order to further realize the balance between the precision and the recall, in each threshold value circulation judgment, matching labels of the image, the text and the multi-modal characteristics meeting the conditions under respective threshold values are obtained, and then the three parts of labels are merged to be used as a final result.
And finally screening the matching result on the basis of the result. And if the first n predicted results of the two samples are the same, merging the predicted results of the two samples to serve as the final results of the two samples. In addition, in order to further increase the matching accuracy, for the case that the sample A is matched with the sample B and the sample B is not matched with the sample A, the sample B is deleted from the matching result of the sample A, and the consistency of prediction is kept.
The above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention in any way, but any modifications or equivalent variations made according to the technical spirit of the present invention are within the scope of the present invention as claimed.
Claims (9)
1. A multi-mode commodity matching method based on images and texts is characterized by comprising the following steps:
step 1: in the training stage, preprocessing the image and text information of the commodity as the input of an image model and a text model;
step 2: respectively designing a neural network model for processing images and texts, extracting image and text characteristics, carrying out classification training by using a metric learning method to enable the network to learn more discriminative characteristics, facilitating similarity calculation in subsequent reasoning, using ArcFace as a loss function, setting the category number N of commodities to be matched as the output dimension of weight W, optimizing by adopting an additive angle penalty term m, setting s as the radius of a hypersphere and theta as j Is the angle between the weight vector of the jth class and the input vector, y i For the true category, Loss is defined as
And step 3: using the trained image model to extract image features, calculating cosine similarity of the features of each image sample and the features of all other image samples through KNN (K nearest neighbor) to obtain N samples before image similarity arrangement, and obtaining two sample features f i And f j The similarity between them is defined as
And 4, step 4: using the trained Bert model and TFIDF, carrying out cascade connection after weighting the features of the Bert model and the TFIDF, extracting the normalized text features, and calculating the feature similarity between text samples by using KNN to obtain N samples before text similarity arrangement;
and 5: the obtained image features and the text features are spliced after being weighted to serve as multi-modal features which are fused with image and text information at the same time, and the similarity among the multi-modal features is calculated by using KNN again to obtain N samples before the similarity is arranged;
step 6: using P, P before alignment respectively<N, performing query expansion on the image, the text and the multi-mode matching result, namely performing weighted summation on the neighborhood characteristics of the sample by using the similarity of TopP as weight to serve as a new query vector, and setting f (q) as the query vector f q (top i ) Is the ith feature closest to query q, and alpha is a weight hyperparameter, then query expansion is implemented as
And 7: for the image, the text and the multi-modal features after query expansion, the similarity of the image, the text and the multi-modal features between each sample and the rest of all samples is calculated by KNN again, the step can be repeated for many times, and the size of P is continuously reduced along with the increase of the repeated process;
and 8: setting dynamic threshold values for the image, the text and the multi-modal characteristics respectively to obtain matching results of the image, the text and the multi-modal characteristics under corresponding threshold values, finally, comprehensively considering the classification results of the image, the text and the multi-modal characteristics, recalling samples meeting the threshold values, setting k as the minimum required matching number, and stride as the step length of each threshold value change, wherein the flow is that
2. The multi-modal image and text-based product matching method according to claim 1, wherein in step 1, the product to be matched is cut by a target detection algorithm, the influence of an irrelevant background on subsequent matching is eliminated, the cut image is scaled to 512 x 512px, and finally data enhancement processing is performed.
3. The multi-modal matching method for commodities based on images and texts as claimed in claim 1, wherein in said step 2, the image model is composed of a high-efficiency network and a de-normalization network with ECA attention mechanism, the text model is composed of a Sennce-Bert and a TFIDF, and the training uses Arcface as a loss function, samples belonging to the same commodity are regarded as the same category, and classification training is performed in the training set.
4. The multi-mode commodity matching method based on the images and the texts as claimed in claim 1, wherein in the step 3, the image models trained in the step 2 are used to extract image features, the features of the image models are normalized and then spliced to be used as integrated image features, and then a K nearest neighbor method is used for the integrated features to calculate cosine distance, so as to obtain the features of N before the arrangement of the image similarity.
5. The multi-modal commodity matching method based on images and texts as claimed in claim 1, wherein in said step 4, text features are extracted using the Bert model and TFIDF model trained in step 2, dimension reduction processing is performed on TFIDF features, then features of the two are normalized and then spliced to serve as integrated text features, and then K nearest neighbor method is used for the integrated features to calculate cosine distance to obtain N features before text similarity arrangement.
6. The multi-modal commodity matching method based on images and texts as claimed in claim 1, wherein in said step 5, the images and text features obtained in step 3 and step 4 are spliced, and the cosine distance of the spliced features is calculated again by using k nearest neighbor method, so as to obtain the features of N before multi-modal similarity arrangement.
7. The method as claimed in claim 1, wherein in step 6, the N images, texts before arrangement obtained in step 3, step 4 and step 5 and the multi-modal matching samples are used respectively for each product, and P with the highest similarity are selected for query expansion, that is, the similarity of the TopP matching samples is used as a weight, and the features of the TopP are weighted and summed to form a new query vector.
8. The multi-modal image and text-based commodity matching method according to claim 1, wherein in the step 7, the cosine similarity is calculated again by the K-nearest neighbor method using the weighted new image, text and multi-modal query vector obtained in the step 6, and the samples of N before the post-similarity arrangement are retained, and the step can be iterated for a plurality of times, and the size of P is continuously reduced in the iteration process.
9. The multi-modal commodity matching method based on the images and the texts as claimed in claim 1, wherein in the step 8, for each commodity, the image, the text and the multi-modal matching result before arrangement, which are finally obtained in the step 7, are obtained, dynamic thresholds are respectively set for the three according to the similarity, the dynamic thresholds are synchronously changed, samples smaller than respective thresholds of the three are merged and added into the final result, if the number of matching after merging is larger than k, the threshold cycle is exited, otherwise, the threshold is continuously widened.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210809470.7A CN115018010A (en) | 2022-07-11 | 2022-07-11 | Multi-mode commodity matching method based on images and texts |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210809470.7A CN115018010A (en) | 2022-07-11 | 2022-07-11 | Multi-mode commodity matching method based on images and texts |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115018010A true CN115018010A (en) | 2022-09-06 |
Family
ID=83080316
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210809470.7A Pending CN115018010A (en) | 2022-07-11 | 2022-07-11 | Multi-mode commodity matching method based on images and texts |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115018010A (en) |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113537305A (en) * | 2021-06-29 | 2021-10-22 | 复旦大学 | Image classification method based on matching network less-sample learning |
WO2021227091A1 (en) * | 2020-05-15 | 2021-11-18 | 南京智谷人工智能研究院有限公司 | Multi-modal classification method based on graph convolutional neural network |
CN114298159A (en) * | 2021-12-06 | 2022-04-08 | 湖南工业大学 | Image similarity detection method based on text fusion under label-free sample |
CN114445201A (en) * | 2022-02-16 | 2022-05-06 | 中山大学 | Combined commodity retrieval method and system based on multi-mode pre-training model |
-
2022
- 2022-07-11 CN CN202210809470.7A patent/CN115018010A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2021227091A1 (en) * | 2020-05-15 | 2021-11-18 | 南京智谷人工智能研究院有限公司 | Multi-modal classification method based on graph convolutional neural network |
CN113537305A (en) * | 2021-06-29 | 2021-10-22 | 复旦大学 | Image classification method based on matching network less-sample learning |
CN114298159A (en) * | 2021-12-06 | 2022-04-08 | 湖南工业大学 | Image similarity detection method based on text fusion under label-free sample |
CN114445201A (en) * | 2022-02-16 | 2022-05-06 | 中山大学 | Combined commodity retrieval method and system based on multi-mode pre-training model |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110309331B (en) | Cross-modal deep hash retrieval method based on self-supervision | |
CN110334705B (en) | Language identification method of scene text image combining global and local information | |
Zhang et al. | Small sample image recognition using improved Convolutional Neural Network | |
CN106649561B (en) | Intelligent question-answering system for tax consultation service | |
Pouyanfar et al. | Automatic video event detection for imbalance data using enhanced ensemble deep learning | |
CN108595636A (en) | The image search method of cartographical sketching based on depth cross-module state correlation study | |
CN107683469A (en) | A kind of product classification method and device based on deep learning | |
Dumont et al. | Fast multi-class image annotation with random subwindows and multiple output randomized trees | |
CN108427740B (en) | Image emotion classification and retrieval algorithm based on depth metric learning | |
CN107239565A (en) | A kind of image search method based on salient region | |
CN105184298A (en) | Image classification method through fast and locality-constrained low-rank coding process | |
CN111159485A (en) | Tail entity linking method, device, server and storage medium | |
Lin et al. | Effective feature space reduction with imbalanced data for semantic concept detection | |
CN111931953A (en) | Multi-scale characteristic depth forest identification method for waste mobile phones | |
CN112069307B (en) | Legal provision quotation information extraction system | |
Yang et al. | ConvPatchTrans: A script identification network with global and local semantics deeply integrated | |
CN111723287B (en) | Content and service recommendation method and system based on large-scale machine learning | |
KR20200071865A (en) | Image object detection system and method based on reduced dimensional | |
CN112988970A (en) | Text matching algorithm serving intelligent question-answering system | |
CN114140657A (en) | Image retrieval method based on multi-feature fusion | |
Gao et al. | An improved XGBoost based on weighted column subsampling for object classification | |
Singh et al. | A deep learning approach for human face sentiment classification | |
Foumani et al. | A probabilistic topic model using deep visual word representation for simultaneous image classification and annotation | |
CN111061939A (en) | Scientific research academic news keyword matching recommendation method based on deep learning | |
CN115018010A (en) | Multi-mode commodity matching method based on images and texts |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |