CN115018010A

CN115018010A - Multi-mode commodity matching method based on images and texts

Info

Publication number: CN115018010A
Application number: CN202210809470.7A
Authority: CN
Inventors: 李佳汶; 孙长银; 王腾; 王远大
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2022-07-11
Filing date: 2022-07-11
Publication date: 2022-09-06

Abstract

A multi-mode goods matching method based on images and texts aims at finding out matched goods by utilizing image information of a goods cover and text information in a goods title; the method comprises the following specific steps: firstly, using a metric learning method to enable a network to learn characteristics with discriminant; secondly, extracting commodity characteristics through an image and a text network respectively; thirdly, calculating cosine distances of features among the samples from three angles of images, texts and multiple modes, and realizing rearrangement of matching results by adopting a query expansion method; and finally, setting a dynamic threshold value to realize the fusion of the multi-modal results, and adding the samples meeting the threshold value condition into the final matching result. The neural network structure and the post-processing method can effectively solve the problems of few matches and mismatches in a single mode. The matching accuracy is met, and meanwhile the commodity recall rate is remarkably improved.

Description

Multi-mode commodity matching method based on images and texts

Technical Field

The invention relates to a matching method, in particular to a multi-mode commodity matching method based on images and texts, and belongs to the field of deep learning and neural network methods.

Background

Consumers often compare the prices of the same item in different vendors during shopping, and desire to be able to purchase their mood items at the lowest cost. Retailers also want to be able to provide a comparison report that is most cost effective to show their products to consumers to improve their competitiveness, which requires that consumers' query products be associated with other similar products in the library. Commodity matching methods are commonly used to accomplish this task, where it is desirable to find samples in a commodity library that belong to the same commodity as a given commodity.

To date, some research efforts have been made on commercial matching methods. However, the following problems still exist: 1) the same commodity may have very different image characteristics due to factors such as color, size, shooting angle or label noise; 2) the same commodity may have very different text representations due to factors such as languages, commodity specifications and emphasis differences; 3) the balance between the accuracy and the recall rate is difficult to realize by utilizing the monomodal commodity information; 4) the actual application scene belongs to an open set task and may contain commodities never appearing in the training set.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides a multi-modal commodity matching method based on images and texts, which can simultaneously utilize the images of commodities and corresponding text description information on websites to extract multi-modal features from a complementary angle so as to enhance the matching effect. And training by using a metric learning method, and designing a reliable post-processing flow during reasoning. Therefore, comprehensive consideration of different modal characteristics and further enhancement of the query vector are realized, and the matching effect is improved.

In order to achieve the purpose, the technical scheme of the invention is as follows:

a multi-mode commodity matching method based on images and texts specifically comprises the following steps:

step 1: in the training stage, preprocessing the image and text information of the commodity as the input of an image model and a text model;

step 2: respectively designing a neural network model for processing images and texts, extracting image and text characteristics, carrying out classification training by using a metric learning method to enable the network to learn more discriminative characteristics, facilitating similarity calculation in subsequent reasoning, using ArcFace as a loss function, setting the category number N of commodities to be matched as the output dimension of weight W, optimizing by adopting an additive angle penalty term m, setting s as the radius of a hypersphere and theta as _j Is the angle between the weight vector of the jth class and the input vector, y _i For the true category, Loss is defined as

And step 3: using the trained image model to extract image features, calculating cosine similarity of the features of each image sample and the features of all other image samples through KNN (K nearest neighbor) to obtain N samples before image similarity arrangement, and obtaining two sample features f _i And f _j The similarity between them is defined as

And 4, step 4: using the trained Bert model and TFIDF, carrying out cascade connection after weighting the features of the Bert model and the TFIDF, extracting the normalized text features, and calculating the feature similarity between text samples by using KNN to obtain N samples before text similarity arrangement;

and 5: the obtained image features and the text features are spliced after being weighted to serve as multi-modal features which are fused with image and text information at the same time, and the similarity among the multi-modal features is calculated by using KNN again to obtain N samples before the similarity is arranged;

step 6: using P, P before alignment respectively<N, performing query expansion on the image, the text and the multi-mode matching result, namely using the similarity of TopP as weight, performing weighted summation on the neighborhood characteristics of the sample to be used as a new query vector,let f (q) be the query vector, f _q (top _i ) Is the ith feature closest to query q, and alpha is a weight hyperparameter, then query expansion is implemented as

And 7: for the image, the text and the multi-modal features after query expansion, the similarity of the image, the text and the multi-modal features between each sample and all the other samples is calculated by KNN again, the step can be repeated for many times, and the size of P is continuously reduced along with the increase of the repeated process;

and 8: respectively setting dynamic threshold values for the image, the text and the multi-modal characteristics to obtain matching results of the image, the text and the multi-modal characteristics under corresponding threshold values, finally, comprehensively considering the classification results of the image, the text and the multi-modal characteristics, recalling samples meeting the threshold values, setting k as the minimum required matching number, and stride as the step length of each threshold value change, wherein the process is that

As a further improvement of the present invention, in step 1, the commodity to be matched is cut by a target detection algorithm, the influence of an irrelevant background on subsequent matching is eliminated, the cut image is scaled to 512 × 512px, and finally data enhancement processing is performed.

As a further improvement of the invention, in the step 2, the image model is composed of a high-efficiency network and a denormalization network with an ECA attention mechanism, the text model is composed of a Sennce-Bert and a TFIDF, the training uses Arcface as a loss function, samples belonging to the same commodity are regarded as the same class, and classification training is carried out in a training set.

As a further improvement of the present invention, in step 3, the image features are extracted by using the image model trained in step 2, the features of the plurality of image models are normalized and then spliced to serve as the integrated image features, and then the cosine distance is calculated for the integrated features by using a K nearest neighbor method to obtain the features of N before the image similarity arrangement.

As a further improvement of the present invention, in step 4, text features are extracted by using the Bert model and the TFIDF model trained in step 2, dimension reduction processing is performed on the TFIDF features, then the features of the two are normalized and spliced to serve as integrated text features, and then the K nearest neighbor method is used for the integrated features to calculate cosine distances to obtain the features of N before text similarity arrangement.

As a further improvement of the present invention, in the step 5, the image and text features obtained in the steps 3 and 4 are spliced, and the k-nearest neighbor method is used again to calculate the cosine distance of the spliced features, so as to obtain the features of N before the multi-modal similarity arrangement.

As a further improvement of the present invention, in said step 6, for each commodity, the images, texts and multi-modal matching samples of N before arrangement obtained in step 3, step 4 and step 5 are respectively used, and P with the highest similarity are selected for query expansion, i.e. the similarity of the TopP matching samples is used as weight, and the features of the TopP are weighted and summed to form a new query vector.

As a further improvement of the present invention, in step 7, the weighted new image, text and multi-modal query vector obtained in step 6 are used, the cosine similarity is calculated again by the K nearest neighbor method, and the samples of N before the arrangement of the back similarity are retained.

As a further improvement of the present invention, in step 8, for each commodity, the image, the text, and the multi-modal matching result of N before arrangement finally obtained in step 7 are obtained, dynamic thresholds are respectively set for the three according to the magnitude of similarity, and are synchronously changed, samples smaller than respective thresholds of the three are merged and added to the final result, if the matching number after merging is larger than k, the threshold cycle is exited, otherwise, the threshold is continuously widened.

The invention has the beneficial effects that: 1. the multi-mode commodity matching method based on the images and the texts can obtain a more reliable matching result than a single mode by comprehensively considering the images and the texts of the commodities and two kinds of fused multi-mode information, so that the same commodities are classified into one class; 2. aiming at actual use scenes, the invention provides a training method with more discriminability, and ensures that the characteristics with enough distinguishability are obtained; 3. in addition, the invention also integrates query expansion and dynamic threshold strategy, and realizes the promotion of recall rate on the premise of ensuring accuracy. And finally, further screening the matching result to ensure the consistency of the prediction result.

Drawings

FIG. 1 is a schematic flow diagram of the invention as a whole;

FIG. 2 is a network architecture diagram of the image and text training portion of the present invention;

FIG. 3 is a network architecture diagram of the present invention for extracting image and text features;

FIG. 4 is a schematic flow diagram of a feature post-processing portion of the present invention.

Detailed Description

The invention is described in further detail below with reference to the following detailed description and accompanying drawings:

example 1: as depicted in fig. 1; a multi-mode commodity matching method based on images and texts comprises the following steps:

step 1: and a data input stage.

Step 2: and a model building stage, including an image and text model.

And step 3: and (5) a model training stage.

And 4, step 4: and (5) extracting characteristics.

And 5: and (5) a characteristic post-processing stage.

Step 6: stage of post-processing of results

In the data input stage, input data preprocessing is completed. Firstly, separating image information and text information in a library, and marking samples belonging to the same class. Then, the image is cut by using a target detection algorithm, and is zoomed to a specified size, and finally corresponding data enhancement is carried out. Meanwhile, the language of the text data is converted.

In the model construction stage, a deep neural network and a TFIDF model are constructed according to task construction requirements. The image part comprises EfficientNet-B3, EfficientNet-B5 and ECA-NFNet-L0, and the text part comprises a Bert model pre-trained on a large text corpus and a TFIDF model based on word frequency-inverse document frequency.

In the model training stage, the ArcFace is used as a loss function, and classification training is respectively carried out on the CNN model and the Bert model on the well-constructed training set.

In the characteristic extraction stage, the model trained in the model training stage and the TFIDF are used for extracting distinguishing image and text characteristics, and the image and text characteristics are spliced to obtain multi-modal characteristics.

In the feature post-processing stage, the obtained image, text and multi-modal features are subjected to query expansion processing respectively, and then KNN is used for obtaining a matching sample so as to improve the information capacity of a query vector.

And in the post-processing stage of results, fusing the results in different modes by adopting a dynamic threshold method. And post-screening treatment is carried out on the fused prediction results, so that the prediction consistency of different samples is ensured.

The details of each stage are described below.

(1) And in the data input stage, in order to meet the requirement of model training, preprocessing operation needs to be carried out on data in the data set. Firstly, the similar commodities in the data set are classified to generate a class label. Then, for the image data, it is desirable to focus the model on the matching of the commodity itself rather than the matching of the background, so the coordinates of the commodity in the image are obtained by using the target detection algorithm YOLO, and the corresponding region of the commodity body is cut out from the coordinates. Wherein, one image may contain a plurality of detection results, and the minimum and maximum x and y coordinates of all the detection results are used as the cutting boundary. Let us say that n targets are detected, x _i,1 Is the left abscissa, x, of the ith target _i,2 Horizontal boundary x of the i-th object ₁ And x ₂ Is composed of

x ₁ ＝min({x _1,1 ,x _2,1 ,...,x _n,1 })

x ₂ ＝max({x _1,2 ,x _2,2 ,...,x _n,2 })

Then the size is scaled to 512 x 512px, and data enhancement is performed using random horizontal flipping, random vertical flipping, random brightness variation, and the like. And finally, normalizing the image data. For text data, aiming at the condition that different languages exist, the text is firstly converted into English, so that a Bert model pre-trained on the English data can be better utilized, the subsequent matching process is facilitated, and the special conditions that the semantics are the same and the languages are different are avoided.

(2) The model building stage is mainly divided into two categories, namely an image model and a text model, as shown in fig. 1. For image input, the invention adopts three models of EfficientNet-B3, EfficientNet-B5 and ECA-NFNet-L0. The output dimensions of the FC layer are all set to 512. The EfficientNet is an efficient network obtained through neural network search, can achieve better performance with smaller calculation amount, and is suitable for application scenarios of the invention. ECA-NFNet-L0 fuses an efficient channel attention mechanism (ECA) into a non-normalized network NFNet, and replaces the original SiLU activation function with Mish to improve the performance of the model. Let the input be x and the output be f (x), then the expression of Mish is

f(x)＝x×tanh(log(1+e ^x ))

For text input, the invention adopts two models of Sennce-Bert and TFIDF. For Bert, the [ CLS ] mark in the last layer of output is used for subsequent processing, and the output dimension is set to be 768. For TFIDF, text feature extraction is realized by counting word frequency and importance weight of the word frequency in a text, PCA is used for reducing dimensions of high-dimensional features of the TFIDF, and feature dimensions are fixed to 768. If N is the total number of occurrences of all words in the commodity description, N is the total number of occurrences of a word in the commodity description, B is the total number of text descriptions in the whole library, and B is the number of text descriptions in which the word occurs, the calculation method of TFIDF is that

(3) In the model training stage, as shown in fig. 2, the invention converts the commodity matching problem into a metric learning problem, and aims to increase the intra-class similarity and decrease the inter-class similarity during training, so that the characteristics with sufficient discriminability are expected to be learned. Therefore, even if commodities which are not seen in the training set appear in the practical application, the matching effect can be ensured. For the image model, the output dimension is fixed to 512 dimensions using first Global Average Pooling (GAP) after the last convolution layer, and then Dropout and FC. Then normalization processing is performed by using BatchNorm. Finally, a Ranger optimizer is used, and ArcFace loss is adopted for classification training.

And for the text model Bert, extracting 768-dimensional features output by the CLS in the last layer, and performing classification training by adopting ArcFace loss on the basis of different commodity categories.

(4) And a feature extraction stage, as shown in fig. 3. The method uses the three image models trained in the step (3), reserves network structures except for the ArcFace layer, and respectively extracts 512-dimensional features. And performing L2 normalization operation on the extracted three features, and performing weighted cascade to obtain 1536-dimensional image features to realize the integration of image models. And (3) extracting 768-dimensional features from the text part by using the trained Bert in the step (3), extracting the same 768-dimensional features by using a TFIDF model, weighting and splicing the two text features, and finally obtaining the 1536-dimensional text features.

(5) The feature post-processing stage, as shown in fig. 4. According to the invention, the image and text features extracted in the step (4) are spliced to obtain 3072-dimensional multi-modal features, and the multi-modal features are used as supplement of single-modal information to provide a complementary visual angle. And respectively carrying out query expansion operation on the image, the text and the multi-modal features. Specifically, the cosine distances between the current sample and other homomodal features are calculated by using KNN, and the first N most similar features are obtained. And selecting the most similar P features from the P features, using the alpha power of the cosine similarity of the P features as a weight, and performing weighted summation on the P features to serve as a new feature of the current sample, wherein the feature realizes the comprehensive consideration of adjacent features and can effectively improve the recall rate of the process. The above process is repeated m times, and in the repeated process, the adjacent features become more and more similar, and the size of P is gradually reduced to avoid the decrease of the accuracy rate.

(6) And (5) in a result post-processing stage, calculating cosine similarity of the processed image, text and multi-modal features obtained in the step (5) by using KNN again respectively to obtain the first N most similar features, and sequencing the features from small to large according to cosine distances. Finally, the invention uses a dynamic threshold method to screen the results of the three parts. Specifically, instead of using only a single threshold for each sample, a loop is made from small to large with the lower threshold L, the upper threshold H, and the threshold step S. The specific parameter settings are related to the discriminative power of the model. If the number of matches made by the sample is greater than or equal to k (i.e., at least k matches are found) at some threshold, the loop exits. In order to further realize the balance between the precision and the recall, in each threshold value circulation judgment, matching labels of the image, the text and the multi-modal characteristics meeting the conditions under respective threshold values are obtained, and then the three parts of labels are merged to be used as a final result.

And finally screening the matching result on the basis of the result. And if the first n predicted results of the two samples are the same, merging the predicted results of the two samples to serve as the final results of the two samples. In addition, in order to further increase the matching accuracy, for the case that the sample A is matched with the sample B and the sample B is not matched with the sample A, the sample B is deleted from the matching result of the sample A, and the consistency of prediction is kept.

The above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention in any way, but any modifications or equivalent variations made according to the technical spirit of the present invention are within the scope of the present invention as claimed.

Claims

1. A multi-mode commodity matching method based on images and texts is characterized by comprising the following steps:

step 6: using P, P before alignment respectively<N, performing query expansion on the image, the text and the multi-mode matching result, namely performing weighted summation on the neighborhood characteristics of the sample by using the similarity of TopP as weight to serve as a new query vector, and setting f (q) as the query vector f _q (top _i ) Is the ith feature closest to query q, and alpha is a weight hyperparameter, then query expansion is implemented as

And 7: for the image, the text and the multi-modal features after query expansion, the similarity of the image, the text and the multi-modal features between each sample and the rest of all samples is calculated by KNN again, the step can be repeated for many times, and the size of P is continuously reduced along with the increase of the repeated process;

and 8: setting dynamic threshold values for the image, the text and the multi-modal characteristics respectively to obtain matching results of the image, the text and the multi-modal characteristics under corresponding threshold values, finally, comprehensively considering the classification results of the image, the text and the multi-modal characteristics, recalling samples meeting the threshold values, setting k as the minimum required matching number, and stride as the step length of each threshold value change, wherein the flow is that

2. The multi-modal image and text-based product matching method according to claim 1, wherein in step 1, the product to be matched is cut by a target detection algorithm, the influence of an irrelevant background on subsequent matching is eliminated, the cut image is scaled to 512 x 512px, and finally data enhancement processing is performed.

3. The multi-modal matching method for commodities based on images and texts as claimed in claim 1, wherein in said step 2, the image model is composed of a high-efficiency network and a de-normalization network with ECA attention mechanism, the text model is composed of a Sennce-Bert and a TFIDF, and the training uses Arcface as a loss function, samples belonging to the same commodity are regarded as the same category, and classification training is performed in the training set.

4. The multi-mode commodity matching method based on the images and the texts as claimed in claim 1, wherein in the step 3, the image models trained in the step 2 are used to extract image features, the features of the image models are normalized and then spliced to be used as integrated image features, and then a K nearest neighbor method is used for the integrated features to calculate cosine distance, so as to obtain the features of N before the arrangement of the image similarity.

5. The multi-modal commodity matching method based on images and texts as claimed in claim 1, wherein in said step 4, text features are extracted using the Bert model and TFIDF model trained in step 2, dimension reduction processing is performed on TFIDF features, then features of the two are normalized and then spliced to serve as integrated text features, and then K nearest neighbor method is used for the integrated features to calculate cosine distance to obtain N features before text similarity arrangement.

6. The multi-modal commodity matching method based on images and texts as claimed in claim 1, wherein in said step 5, the images and text features obtained in step 3 and step 4 are spliced, and the cosine distance of the spliced features is calculated again by using k nearest neighbor method, so as to obtain the features of N before multi-modal similarity arrangement.

7. The method as claimed in claim 1, wherein in step 6, the N images, texts before arrangement obtained in step 3, step 4 and step 5 and the multi-modal matching samples are used respectively for each product, and P with the highest similarity are selected for query expansion, that is, the similarity of the TopP matching samples is used as a weight, and the features of the TopP are weighted and summed to form a new query vector.

8. The multi-modal image and text-based commodity matching method according to claim 1, wherein in the step 7, the cosine similarity is calculated again by the K-nearest neighbor method using the weighted new image, text and multi-modal query vector obtained in the step 6, and the samples of N before the post-similarity arrangement are retained, and the step can be iterated for a plurality of times, and the size of P is continuously reduced in the iteration process.

9. The multi-modal commodity matching method based on the images and the texts as claimed in claim 1, wherein in the step 8, for each commodity, the image, the text and the multi-modal matching result before arrangement, which are finally obtained in the step 7, are obtained, dynamic thresholds are respectively set for the three according to the similarity, the dynamic thresholds are synchronously changed, samples smaller than respective thresholds of the three are merged and added into the final result, if the number of matching after merging is larger than k, the threshold cycle is exited, otherwise, the threshold is continuously widened.