CN112862569B

CN112862569B - Product appearance style evaluation method and system based on image and text multi-modal data

Info

Publication number: CN112862569B
Application number: CN202110241232.6A
Authority: CN
Inventors: 朱思羽; 戚进; 胡洁
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2021-03-04
Filing date: 2021-03-04
Publication date: 2023-04-07
Anticipated expiration: 2041-03-04
Also published as: CN112862569A

Abstract

The invention provides a product appearance style evaluation method and system based on image and text multi-modal data, which comprises the following steps: the image aesthetic style model is a multilayer convolution neural network model, the color image is input, and the multi-dimensional image style is classified into output; the image aesthetic style prediction algorithm is used for pre-training and transfer learning to predict the style type of the product image; the semantic emotion analysis module is used for processing online comments of the user by using style labels in an image aesthetic style prediction algorithm and calculating the product style tendency fed back by the user; and the multi-mode fusion evaluation module is used for fusing the product style prediction output by the image aesthetic style prediction algorithm and the product style feedback output by the semantic emotion analysis and providing a product evaluation result in the aspect of appearance style. The method integrates product image information and user feedback text information, realizes product evaluation in the aspect of appearance style based on data modeling and analysis, and has the advantages of being more objective, scientific and accurate compared with the traditional expert evaluation method.

Description

Product appearance style evaluation method and system based on image and text multi-modal data

Technical Field

The invention relates to the technical field of multi-modal data, in particular to a product appearance style evaluation method and system based on image and text multi-modal data.

Background

With the increasing comprehensive requirements of consumers and the increasing variety of commodities in recent years, the influence of product appearance on purchasing decisions of the consumers is also increasing. For many everyday consumer products such as radios, hair dryers, etc., product appearance is becoming a determining factor affecting product success. The aesthetic style of the appearance of a product is important to the overall appearance of the product and is closely related to the type of user that is attracted. The aesthetic style is generally an abstract aesthetic concept depicted by a specific vocabulary semantic, has certain subjectivity and fuzziness, and may have difference from the aesthetic association transmitted by the specific vocabulary to a user. The aesthetic style to be transmitted by a product designer is generally represented by a product image, the style actually experienced by a user is often represented in a user feedback comment, the difference between the two reflects the success degree of product style presentation, the more successful the appearance design, the closer the aesthetic style to be transmitted is to the style actually fed back by the user.

The image aesthetic style analysis is based on image processing and analysis, and by modeling the mapping relation between the image and the aesthetic style label, the rule of the aesthetic style presented by the image is discovered, so that the method can be used for predicting the aesthetic style of the product image. The aesthetic styles have greater universality, for example, the styles suitable for images such as landscapes, people and the like can also be used for describing the appearance of the product, so that the image and aesthetic style mapping relation learned based on the existing large-scale image aesthetic style classification data set can be suitable for the product image through smaller adjustment. An AVA (A Large-Scale Database for Aesthetic Visual Analysis) is an image Aesthetic dataset containing over 250000 labeled images, for a total of 14 Aesthetic style labels. A smaller labeled product image data set is created in a specific product field, only part of product images need to be collected for style labeling, and creation is completed with less cost after data enhancement.

Semantic emotion analysis is a semantic processing technology for performing emotion tendency analysis based on a text, which is rapidly developed in recent years, and can obtain emotion tendencies of some features reflected by the text by processing and analyzing the text. These features may be concrete things such as products or abstract concepts such as certain aesthetic styles. The emotional tendency is generally bipolar, positive and negative, and the more the emotional tendency is biased to be positive, the higher the corresponding feature is embodied.

The traditional appearance style evaluation method is mainly an expert scoring method, has the defect of strong subjectivity, and has more obvious defects of abstract and fuzzy tasks such as appearance style evaluation.

Patent document CN106600385A (application number: CN 201611251457.5) discloses an online product analysis system based on user tracking, which includes a user comment data module, a text data module, an image data module, a file data analysis module, an image data analysis module, a comprehensive evaluation analysis module, and a user interaction module, where the user comment data module is used to extract comment data of a commodity user, the user comment data module is respectively connected to the text data module and the image data module, the text data module is connected to the file data analysis module, the file data analysis module is connected to the comprehensive evaluation analysis module, the image data module is connected to the image data analysis module, the image data analysis module is connected to the comprehensive evaluation analysis module, and the comprehensive evaluation analysis module is connected to the user interaction module. The method is trained based on the model and the algorithm, so that the method is more real and accurate.

Disclosure of Invention

Aiming at the defects in the prior art, the invention aims to provide a product appearance style evaluation method and system based on image and text multi-modal data.

The product appearance style evaluation method based on image and text multi-mode data comprises the steps of constructing an image aesthetic style model, and performing semantic emotion analysis and multi-mode fusion evaluation by using an image aesthetic style prediction algorithm;

the image aesthetic style model is a multilayer convolution neural network model, a color image is used as input, and multi-dimensional image style classification is used as output;

the image aesthetic style prediction algorithm is used for pre-training and transfer learning to predict the style type of the product image;

the semantic emotion analysis comprises the following steps: processing online comments of the user by using a style label in an image aesthetic style prediction algorithm, and calculating a product style tendency fed back by the user;

the multi-modal fusion assessment comprises: product style prediction output by the image aesthetic style prediction algorithm and product style feedback output by semantic emotion analysis are combined, and a product evaluation result in the aspect of appearance style is provided.

Preferably, the image aesthetic style model comprises, connected in sequence:

-an input layer, the input being a color image scaled to a size of 224 x 224, the input dimension being b 224 x 3, where b is the batch size;

-4 convolutional layers, with a convolutional kernel size of 9 x 9, step size of 1, number of convolutional kernels of 64, activation function of ReLU function;

-a batch normalization layer;

-1 pooling layer with maximum pooling, pooling size 2 x 2;

-3 convolutional layers, convolutional kernel size 7 × 7, step size 1, convolutional kernel number 64, activation function is the ReLU function;

-1 pooling layer with maximum pooling, pooling size 2 x 2;

-3 convolutional layers, convolutional kernel size 5 x 5, step size 1, convolutional kernel number 128, activation function is the ReLU function;

-Dropout layer, dropout probability 0.1;

-a batch normalization layer;

-1 pooling level, using maximum pooling, size of pooling 2 x 2

-3 convolutional layers, convolutional kernel size 3 x 3, step size 1, convolutional kernel number 128, activation function is the ReLU function;

-1 pooling level, using maximum pooling, size of pooling 2 x 2;

-a flat layer, developing a b 14 x 128 dimensional feature map into a one dimensional b 14 x 128 length vector;

and a full connection layer for outputting the style classification result, wherein the number of the labels is 14, the labels correspond to 14 style labels respectively, and the activation function is Softmax.

Preferably, the loss function of the image aesthetic style model is a minimum cross entropy loss function, an Adam optimizer is used for weight updating, and the learning rate is set to be 0.0001.

Preferably, the image aesthetic style prediction algorithm adopts a transfer learning strategy, firstly, the large-scale image aesthetic style classification data set AVA is pre-trained by using 14 style labels of the data set, then, the small-scale product image style data set marked by the 14 style labels in the specific product field is finely adjusted, and a test is carried out on a test set without labels;

the prediction output of the image aesthetic style model of the test image is the style prediction result of the image, and is a 14-dimensional vector P = (P) ₁ ，P ₂ …P ₁₄ ) And P satisfies:

∑ _i P _i ＝1

wherein, P _i Indicating the probability that the image belongs to the ith style.

Preferably, the semantic emotion analysis module processes online user comments by using 14 style labels in an image aesthetic style prediction algorithm, finds synonyms of the 14 style labels by using a synonym search method lemma _ names of a WordNet semantic dictionary, and expands each style label into a style word set, and includes the following steps:

step 1: for the ith style tag word, searching semantic set syncs corresponding to the word in WordNet _i ；

Step 2: for syncets _i The jth semantic syncet in (1) _ij Finding out its synonym set lem by using lemma _ names method _ij ；

And step 3: all synonym sets lem corresponding to ith style label words _ij Form the ith style word Set _i ：

Set _i ＝∪ _j lem _ij 。

Preferably, the semantic emotion analysis comprises: after the style tag is expanded into a style word set, online user comments of a preset product are collected from an online e-commerce platform, and a comment text is cleaned and preprocessed, wherein the method comprises the following steps:

step 1: text collection, which is to use python kurilib to automatically collect online user comments;

and 2, step: text cleaning, including screening out repeated sentences, screening out sentences which do not belong to a preset language, screening out sentences which only contain non-text contents, and removing words with misspelling;

and 3, step 3: text preprocessing, including converting all characters into lower case letters, eliminating punctuation marks which do not meet the standard, eliminating stop words, and converting all verbs into current tenses.

Preferably, the semantic emotion analyzing module includes: after the comment text is cleaned and preprocessed, similarity between each word in the text and 14 style word sets is respectively calculated by using a semantic similarity calculation method lin _ similarity provided by WordNet, after all the texts are processed, similarity results of all the words are counted to obtain style tendency results fed back by a user, wherein the kth word w is a word with a certain similarity value _k With the ith style word Set _i Similarity Sim of _k,i The calculation formula is as follows:

Sim _k,i,t for the k-th word w _k With the ith style word Set _i Similarity of the t-th word:

wherein synset _km For the k-th word w _k Semantic collections Synsets of _k M semantic of (1), synset _itn Set for the ith style word Set _i Semantic collection syncets of the t-th word _it The nth semantic, a semantic similarity calculation method provided by lin _ similarity for WordNet;

counting the similarity calculation results of all the words, and the normalized tendency value O of the ith style fed back by the user _i Comprises the following steps:

O' _i ＝∑ _k Sim _k，i

O _i ＝O′ _i /∑ _i O′ _i

finally, the style tendency O = (O) of the user feedback of the product is obtained ₁ ,O ₂ …O ₁₄ ) As output of the semantic emotion analysis.

Preferably, the multimodal fusion assessment comprises: comparing a style predicted value P output by an image aesthetic style prediction algorithm with a style tendency feedback value O output by a semantic emotion analysis module;

absolute value | P of difference between each corresponding position element of P and O _i -O _i I represents the difference size of the information conveyed by the product image and the user feedback information from the view point of the ith aesthetic style; | P of each style label _i -O _i Sum of | ∑ _i |P _i -O _i L represents the difference between the overall aesthetic style of the product image and the style fed back by the user; index | P _i -O _i Sum sigma _i |P _i -O _i And l, the success degree of the product in the aspect of style presentation is evaluated in an auxiliary mode, and the larger the index value is, the larger the difference between the product image and the result fed back by the user is, and the more unsuccessful the product is in the aspect of style presentation.

Preferably, the multimodal fusion assessment comprises: comparing and fusing the style predicted value P output by the image aesthetic style prediction algorithm with the style tendency feedback value O output by the semantic emotion analysis to obtain a comprehensive product appearance style evaluation F = (F) ₁ ，F ₂ …F ₁₄ ) Comprehensive evaluation of the i-th aesthetic Style F _i Is P _i And O _i The fusion result of (2):

F′ _i ＝(P _i +O _i )/(2*|P _i -O _i |)

F _i ＝F′ _i /∑ _i F′ _i

wherein, P _i Indicates the probability that the image belongs to the ith style, O _i Normalized propensity values for the ith style representing user feedback.

The product appearance style evaluation system based on image and text multi-modal data comprises an image aesthetic style model, an image aesthetic style prediction algorithm, a semantic emotion analysis module and a multi-modal fusion evaluation module;

the semantic emotion analysis module is used for processing online comments of the user by using style labels in an image aesthetic style prediction algorithm and calculating the product style tendency fed back by the user;

the multi-mode fusion evaluation module is used for fusing product style prediction output by the image aesthetic style prediction algorithm and product style feedback output by semantic emotion analysis and providing a product evaluation result in the aspect of appearance style.

Compared with the prior art, the invention has the following beneficial effects:

(1) The method integrates product image information and user feedback text information, can realize product evaluation in the aspect of appearance style based on data modeling and analysis, and has the advantages of being more objective, scientific and accurate compared with the traditional expert evaluation method;

(2) According to the invention, through semantic emotion analysis, a large amount of texts can be rapidly analyzed, and the method plays an important role in the background of Internet big data;

(3) According to the invention, through multi-mode data, data in different modes such as images, texts, voice and the like are fused, multiple information sources can be mutually supplemented, and the reflection of real information is more accurate than that of single-mode data.

Drawings

Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments with reference to the following drawings:

FIG. 1 is a flow chart of a method for product evaluation based on multimodal data in accordance with the present invention;

FIG. 2 is a schematic structural diagram of an image aesthetic style model according to the present invention.

Detailed Description

The present invention will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the invention, but are not intended to limit the invention in any way. It should be noted that it would be obvious to those skilled in the art that various changes and modifications can be made without departing from the spirit of the invention. All falling within the scope of the present invention.

Example 1:

the product appearance style evaluation method based on the image and text multi-mode data comprises the steps of constructing an image aesthetic style model, and performing semantic emotion analysis and multi-mode fusion evaluation by using an image aesthetic style prediction algorithm;

the image aesthetic style prediction algorithm is used for pre-training and transfer learning and predicting the style type of the product image;

the multimodal fusion assessment comprises: and product style feedback output by product style prediction and semantic emotion analysis output by the image aesthetic style prediction algorithm is fused, and a product evaluation result in the aspect of appearance style is provided.

As shown in fig. 2, the image aesthetic style model comprises sequentially connected:

input layer, input is a color image scaled to size 224 × 224, input dimension b 224 × 3, where b is batch size batch _ size;

-a batch normalization layer;

-1 pooling level, using maximum pooling, size of pooling 2 x 2;

-1 pooling layer with maximum pooling, pooling size 2 x 2;

-3 convolutional layers, convolutional kernel size 5 x 5, step size 1, number of convolutional kernels 128, activation function as ReLU function;

-Dropout layer, dropout probability 0.1;

-a batch normalization layer;

-1 pooling level, using maximum pooling, size of pooling 2 x 2

-1 pooling level, using maximum pooling, size of pooling 2 x 2;

The loss function of the image aesthetic style model is a minimum cross entropy loss function, an Adam optimizer is used for weight updating, and the learning rate is set to be 0.0001. The image aesthetic style prediction algorithm adopts a transfer learning strategy, firstly, pre-training is carried out on a large-scale image aesthetic style classification data set AVA by using 14 style labels of a data set, then, fine adjustment is carried out on a small-scale product image style data set marked by the 14 style labels in the specific product field, and testing is carried out on a label-free test set; the prediction output of the image aesthetic style model of the test image is the style prediction result of the image and is a 14-dimensional vector P = (P) ₁ ，P ₂ …P ₁₄ ) And P satisfies:

∑ _i P _i ＝1

The semantic emotion analysis module processes online user comments by using 14 style labels in an image aesthetic style prediction algorithm, finds out synonyms of the 14 style labels by using a synonym search method lemma _ names of a WordNet semantic dictionary, and expands each style label into a style word set, and comprises the following steps of:

Step 2: for Synsets _i The jth semantic synset in (1) _ij Finding out its synonym set lem by using lemma _ names method _ij ；

And step 3: all synonym sets lem corresponding to the ith style label words _ij Form the ith style word Set _i ：

Set _i ＝∪ _j lem _ij 。

The semantic emotion analysis comprises the following steps: after the style tags are expanded into style word sets, online user comments of preset products are collected from an online e-commerce platform, and comment texts are cleaned and preprocessed, wherein the method comprises the following steps:

step 2: text cleaning, including screening out repeated sentences, screening out sentences which do not belong to a preset language, screening out sentences which only contain non-text contents, and removing words with misspelling;

and step 3: text preprocessing, including converting all characters into lower case letters, eliminating punctuation marks which do not meet the standard, eliminating stop words, and converting all verbs into current tenses.

The semantic emotion analysis module comprises: after the comment text is cleaned and preprocessed, the similarity between each word in the text and 14 style word sets is respectively calculated by using a semantic similarity calculation method lin _ similarity provided by WordNet, after all the texts are processed, the similarity results of all the words are counted to obtain a style tendency result fed back by a user, and the kth word w is a word w _k With the ith style word Set _i Similarity Sim of _k，i Is calculated by the formula：

Sim _k，i,t For the k-th word w _k With the ith style word Set _i Similarity of the t-th word:

wherein, synset _km For the k-th word w _k Semantic collections Synsets of _k M semantic of (1), synset _itn Set for ith style word Set _i Semantic set syncs of the t-th word _it The nth semantic in (1), the semantic similarity calculation method provided by lin _ similarity for WordNet;

counting the similarity calculation results of all words, and the normalized tendency value O fed back by the user to the ith style _i Comprises the following steps:

O' _i ＝∑ _k Sim _k,i

O _i ＝O′ _i /∑ _i O′ _i

The multi-modal fusion assessment comprises: comparing the style predicted value P output by the image aesthetic style prediction algorithm with the style tendency feedback value O output by the semantic emotion analysis module;

absolute value | P of difference between each corresponding position element of P and O _i -O _i I represents the difference size of the information conveyed by the product image and the user feedback information from the view point of the ith aesthetic style; | P of each style label _i -O _i Sum of | ∑ _i |P _i -O _i L represents the difference between the overall aesthetic style of the product image and the style fed back by the user; the index | P _i -O _i I and E _i |P _i -O _i And | the success degree of the product in the aspect of style presentation is evaluated in an auxiliary manner, and the larger the index value is, the larger the difference between the product image and the result fed back by the user is, and the less successful the product is in the aspect of style presentation.

The multi-modal fusion assessment comprises: comparing and fusing a style predicted value P output by an image aesthetic style prediction algorithm with a style tendency feedback value O output by semantic emotion analysis to obtain a comprehensive product appearance style evaluation F = (F) ₁ ,F ₂ …F ₁₄ ) Comprehensive evaluation of the i-th aesthetic Style F _i Is P _i And O _i The fusion result of (2):

F′ _i ＝(P _i +O _i )/(2*|P _i -O _i |)

F _i ＝F′ _i /∑ _i F′ _i

wherein, P _i Indicates the probability that the image belongs to the ith style, O _i A normalized propensity value for the ith style representing user feedback.

The product appearance style evaluation system based on image and text multi-modal data comprises an image aesthetic style model, an image aesthetic style prediction algorithm, a semantic emotion analysis module and a multi-modal fusion evaluation module, as shown in figure 1;

The present invention will be described in more detail below by way of preferred examples.

Example 2:

the method is based on image aesthetic analysis and semantic emotion analysis technology, by means of a mature online e-commerce platform which is developed at present, data modeling analysis is carried out by utilizing a large amount of quickly-obtained product images and text comment data, the product image style is automatically predicted, the comment text is automatically subjected to style feedback analysis, and intelligent support is provided for product evaluation in the aspect of appearance style through comparison and fusion of multi-mode data.

The invention provides a product evaluation method based on multi-modal data, which comprises the following steps:

step 1: and constructing a product image style classification data set of a specific product field. Collecting product images in a specific product field on the Internet, manually marking aesthetic styles according to AVA data set standards, and then performing a data enhancement step to form a small product image style classification data set;

step 2: constructing an image aesthetic style model, pre-training the model by using an AVA data set, adopting an Adam optimizer and adopting a default learning rate of 0.0001;

and step 3: using the constructed product image style classification data set to finely adjust the model, adopting an Adam optimizer and adopting a default learning rate of 0.00005;

and 4, step 4: testing the model on a product image data set without a label, wherein the aesthetic style prediction output corresponding to the product image is the style prediction value P of the product image;

and 5: according to 14 style labels in the image aesthetic style prediction algorithm, expanding each style label into a style word Set by using a lemma _ names method of a semantic dictionary WordNet _i ；

And 6: collecting user comments from an Amazom.com e-commerce platform, and cleaning and preprocessing texts;

and 7: calculating and counting similarity results of all words and 14 style word sets respectively to obtain style tendency O fed back by a user of a certain product;

and 8: and comparing the style predicted value P with the style tendency feedback value O to obtain a related conclusion of the style presentation success degree of the product appearance, and fusing the style predicted value P and the style tendency feedback value O to obtain a comprehensive product appearance style evaluation F.

The method comprises the following steps of 1, constructing a product image style classification data set, wherein the source of the product image comprises commodity description of an e-commerce platform, blogs, forums of related products and the like, the product image is collected and then uniformly scaled to 224 × 224, the aesthetic styles of referenced AVA data sets are 14, and the data enhancement step can enlarge the data set scale and increase the diversity of data without influencing the quality, and comprises random rotation and cutting operations.

The image aesthetic style model structure in the step 2 sequentially comprises an input layer, a convolution layer and a pooling layer which are sequentially stacked, a full-connection layer and an output layer, and the loss function is a minimum cross entropy loss function. Construction was performed using the tenserflow and keras deep learning frameworks.

In the step 3, the model is finely adjusted by using the product image style classification data set on the basis of the step 2, because the pre-training in the step 2 enables the model to effectively extract the characteristics of the low-level image, the data set required for fine adjustment is smaller, and the acceptable accuracy of the test set can be achieved by using less training cycles.

The unlabeled product image data set in the step 4 comprises product images corresponding to subsequent user comment processing, the model given in the step 3 outputs 14 styles of predicted values P according to the input product images, and the softmax function of the output layer enables the 14 styles of predicted values P _i The sum is 1.

Step 5, expanding the style tag into a style word Set _i Firstly, obtaining all semantemes of the ith style tag word by using a semantic query method of WordNet, then traversing all semantemes, querying to obtain synonyms of each semanteme by using a lemma _ names method of WordNet, and summarizing all synonyms of the ith style tag word into a style word Set of the synonyms _i . The synonymous word groups are linked and generalized in the style word set by the '_' symbol, and when the style word set is used in step 6, the words containing the '_' symbol still existWith word form processing, the built-in mechanism of WordNet can process word semantics and phrase semantics compatibly. The WordNet semantic dictionary runs using the nltk (native language toolkit) library provided by the python language.

And 6, collecting, cleaning and preprocessing user comments on the Amazom.com E-commerce platform, wherein the user comments are operated by using a urllib library, an nltk library and a Beautiful Soup library, and the user comments correspond to product pictures for predicting the aesthetic styles of the images. Firstly, a url lib library is used for obtaining source codes of comment webpages of E-commerce platform users, a Beautiful Soup library is used for analyzing, comment text contents of all verified users are extracted, repeated sentences are screened out by comparing the text contents, then a WordNet semantic dictionary is used for searching all words, at the moment, words which do not belong to English words, misspelled words, emoticons and the like are screened out because the words cannot be searched out in WordNet, and corresponding comment sentences are screened out and misspelled words are removed through manual screening. The cleaned user reviews are preprocessed using the nltk library, converting all characters to lower case letters, a single or multiple consecutive "! ", a plurality of consecutive" ","; ", single, or multiple consecutive"? The words which have no specific meaning or are redundant or are easy to cause misunderstanding are sorted into stop words, the words belonging to the stop words in the user comments are removed, and finally all verbs are converted into the current tense.

And 7, calculating the similarity between each word and 14 style word sets respectively, wherein the similarity between each word and the style word sets is the maximum value of the similarity between each word and the words in the style word sets, and the similarity between each word and each word is the maximum similarity between semantic sets of the two words calculated by a lin _ similarity method. Counting the similarity Simk, i of 14 style word sets of all words respectively as a user feedback tendency result, wherein a style tendency value oi fed back by a user is the sum of the similarity Simk, i of all words and the style word sets of the style, and normalizing the style tendency values O of 14 styles fed back by the user in all styles _i The sum is 1.

And 8, comparing the style predicted value P with the style tendency feedback value O, wherein each type of the style predicted value P and the style tendency feedback value OThe closer the predicted value and the feedback value of the style are, the more successful the style presentation is, and the appearance style presentation success degree of the product is the L1 distance of the two vectors P and O. The average value and the distance of the style predicted value P and the style tendency feedback value O need to be comprehensively considered, and the element F of the final appearance style evaluation result F _i Is O _i And P _i The quotient of the average value and the distance of (c), and the result of normalization between all the styles, the normalization makes the sum of the evaluation values of the 14 fused styles 1.

Those skilled in the art will appreciate that, in addition to implementing the systems, apparatus, and various modules thereof provided by the present invention in purely computer readable program code, the same procedures can be implemented entirely by logically programming method steps such that the systems, apparatus, and various modules thereof are provided in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers and the like. Therefore, the system, the device and the modules thereof provided by the present invention can be considered as a hardware component, and the modules included in the system, the device and the modules thereof for implementing various programs can also be considered as structures in the hardware component; modules for performing various functions may also be considered to be both software programs for performing the methods and structures within hardware components.

The foregoing description of specific embodiments of the present invention has been presented. It is to be understood that the present invention is not limited to the specific embodiments described above, and that various changes or modifications may be made by one skilled in the art within the scope of the appended claims without departing from the spirit of the invention. The embodiments and features of the embodiments of the present application may be combined with each other arbitrarily without conflict.

Claims

1. A product appearance style evaluation method based on image and text multi-mode data is characterized by comprising the steps of constructing an image aesthetic style model, and performing semantic emotion analysis and multi-mode fusion evaluation by using an image aesthetic style prediction algorithm;

the image aesthetic style model is a multilayer convolution neural network model, takes a color image as input and takes multi-dimensional image style classification as output;

the multi-modal fusion assessment comprises: product style prediction output by an image aesthetic style prediction algorithm and product style feedback output by semantic emotion analysis are fused, and a product evaluation result in the aspect of appearance style is provided;

the image aesthetic style prediction algorithm adopts a transfer learning strategy, firstly, the 14 style labels of a data set are used for pre-training on a large-scale image aesthetic style classification data set AVA, then, fine tuning is carried out on an image style data set of a small-scale product marked by the 14 style labels in the specific product field, and a test is carried out on a label-free test set;

the prediction output of the image aesthetic style model of the test image is the style prediction result of the image and is a 14-dimensional vector P = (P) ₁ ,P ₂ …P ₁₄ ) And P satisfies:

∑ _i P _i ＝1

wherein, P _i Representing the probability that the image belongs to the ith style;

step 1: for the ith style label word, searching the semantic set Syncuts corresponding to the word in WordNet respectively _i ；

And 2, step: for syncets _i The jth semantic syncet in (1) _ij Finding out its synonym set lem by using lemma _ names method _ij ；

And step 3: all synonym sets corresponding to the ith style label wordlem _ij Form the ith style word Set _i ：

Set _i ＝∪ _j lem _ij ；

The semantic emotion analysis comprises the following steps: after the style tag is expanded into a style word set, online user comments of a preset product are collected from an online e-commerce platform, and a comment text is cleaned and preprocessed, wherein the method comprises the following steps:

and 3, step 3: text preprocessing, including converting all characters into lower case letters, eliminating punctuation marks which do not meet the standard, eliminating stop words and converting all verbs into current tenses;

the semantic emotion analysis module comprises: after the comment text is cleaned and preprocessed, similarity between each word in the text and 14 style word sets is respectively calculated by using a semantic similarity calculation method lin _ similarity provided by WordNet, after all the texts are processed, similarity results of all the words are counted to obtain style tendency results fed back by a user, wherein the kth word w is a word with a certain similarity value _k With the ith style word Set _i Similarity Sim of _k,i The calculation formula is as follows:

/>

wherein, synset _km For the k-th word w _k Semantic collections Synsets of _k M semantic of (1), synset _itn Set for ith style word Set _i Semantic set syncs of the t-th word _it The nth semantic, a semantic similarity calculation method provided by lin _ similarity for WordNet;

O′ _i ＝Σ _k Sim _k,i

O _i ＝O′ _i /∑ _i O′ _i

2. The method according to claim 1, wherein the image aesthetic style model comprises, connected in sequence:

-a batch normalization layer;

-1 pooling level, using maximum pooling, size of pooling 2 x 2;

-3 convolutional layers, with a convolutional kernel size of 7 × 7, step size of 1, number of convolutional kernels of 64, activation function of ReLU function;

-1 pooling level, using maximum pooling, size of pooling 2 x 2;

-Dropout layer, dropout probability 0.1;

-a batch normalization layer;

-1 pooling level, using maximum pooling, size of pooling 2 x 2

-3 convolutional layers, with a convolutional kernel size of 3 x 3, step size of 1, number of convolutional kernels of 128, activation function of ReLU function;

-1 pooling layer with maximum pooling, pooling size 2 x 2;

-a Flatten layer, developing the two dimensional b 14 x 128 feature map into a vector of one dimension b 14 x 128 length;

a full connection layer, outputting the style classification result, wherein the number of the tags is 14, the tags respectively correspond to 14 style tags, and the activation function is Softmax.

3. The method for evaluating the appearance style of a product based on image and text multi-modal data according to claim 1, wherein the loss function of the image aesthetic style model is a minimized cross entropy loss function, the Adam optimizer is used for weight update, and the learning rate is set to 0.0001.

4. The method of claim 1, wherein the multi-modal fusion assessment comprises: comparing a style predicted value P output by an image aesthetic style prediction algorithm with a style tendency feedback value O output by a semantic emotion analysis module;

absolute value | P of difference between each corresponding position element of P and O _i -O _i I represents the difference size of the information conveyed by the product image and the user feedback information from the view point of the ith aesthetic style; | P of each style label _i -O _i Sum of | ∑ _i |P _i -O _i L represents the difference between the overall aesthetic style of the product image and the style fed back by the user; the index | P _i -O _i Sum sigma _i |P _i -O _i And l, the success degree of the product in the aspect of style presentation is evaluated in an auxiliary mode, and the larger the index value is, the larger the difference between the product image and the result fed back by the user is, and the more unsuccessful the product is in the aspect of style presentation.

5. The method of claim 1, wherein the multi-modal data based product appearance style assessment is based on image and text multi-modal dataThe fusion evaluation included: comparing and fusing the style predicted value P output by the image aesthetic style prediction algorithm with the style tendency feedback value O output by the semantic emotion analysis to obtain a comprehensive product appearance style evaluation F = (F) ₁ ,F ₂ …F ₁₄ ) Comprehensive evaluation of the i-th aesthetic Style F _i Is P _i And O _i The fusion result of (2):

F′ _i ＝(P _i +O _i )/(2*|P _i -O _i |)

F _i ＝F′ _i /∑ _i F′ _i

6. A product appearance style evaluation system based on image and text multi-modal data is characterized in that the product appearance style evaluation method based on the image and text multi-modal data, which is disclosed by any one of claims 1 to 5, is adopted, and comprises an image aesthetic style model, an image aesthetic style prediction algorithm, a semantic emotion analysis module and a multi-modal fusion evaluation module;