CN115982630A

CN115982630A - Intelligent commodity classification method, system, equipment and medium with multiple classifiers cooperated

Info

Publication number: CN115982630A
Application number: CN202310208508.XA
Authority: CN
Inventors: 王静; 李燕北; 朱俊; 夏竟翔; 戴智鑫; 闫晨光; 沈达峰
Original assignee: Ouye Industrial Products Co ltd
Current assignee: Ouye Industrial Products Co ltd
Priority date: 2023-03-06
Filing date: 2023-03-06
Publication date: 2023-04-18

Abstract

The invention provides a method, a system, equipment and a medium for intelligently classifying commodities under the cooperation of a plurality of classifiers, wherein the method comprises the following steps: step S1: acquiring a training set with uniformly distributed data quantity; step S2: performing word segmentation and word stop on the description information of each commodity in the training set to obtain word segmentation results; and step S3: performing feature coding on each word segmentation, and calculating TF-IDF values of the word segmentation as coding weight values of the word; and step S4: calculating a weighted average value of the code weight values of all the participles as a characteristic code of the commodity; step S5: dividing all data into a training set and a test set for training classifiers, and training a plurality of classifiers; step S6: calculating the weight value of each classifier, and weighting and summing the results of each classifier; step S7: and taking the category with the highest score as a classification result. The invention can judge the belonged class in the platform commodity management system according to the description information of the commodity, and provides support for the functions of digital management, commodity recommendation and the like of the commodity.

Description

Intelligent commodity classification method, system, equipment and medium with multiple classifiers cooperated

Technical Field

The invention relates to the technical field of commodity classification, in particular to a method, a system, equipment and a medium for intelligently classifying commodities by means of cooperation of multiple classifiers.

Background

The enterprise electronic commerce platform is a virtual network space for carrying out commerce activities on the Internet and a management environment for ensuring the smooth operation of the commerce; the system is an important place for coordinating and integrating information flow, cargo flow and fund flow in order, relevance and high-efficiency flow. Enterprises and merchants can make full use of shared resources such as network infrastructure, payment platform, security platform, management platform and the like provided by the electronic commerce platform to effectively develop own commercial activities at low cost.

The prior art has the following disadvantages: the goods on the e-commerce platform have wide coverage range and complex classification system, and the conditions that the seller fills in the goods categories are not standard or missing, and the like, are easy to occur; the commodity information uploaded by different sellers is differentiated and incomplete, and the general classification method is poor in performance.

Disclosure of Invention

Aiming at the defects in the prior art, the invention provides a method, a system, equipment and a medium for intelligently classifying commodities by cooperating a plurality of classifiers.

According to the intelligent commodity classification method, system, equipment and medium with multiple classifiers in cooperation, the scheme is as follows:

in a first aspect, a method for intelligently classifying commodities by cooperating multiple classifiers is provided, and the method comprises the following steps:

step S1: acquiring a training set with uniformly distributed data quantity;

step S2: performing word segmentation and word stop on the description information of each commodity in the training set to obtain word segmentation results;

and step S3: after word segmentation, performing feature coding on each segmented word, calculating TF-IDF values of the segmented words, and taking the TF-IDF values of the segmented words as coding weight values of the words;

and step S4: the product of the feature code of each participle multiplied by the weight value is used as the weighted feature of the participle under the category to which the participle belongs, and the sum of all the weighted features of the participles in the commodity is used as the feature code of the commodity;

step S5: dividing all data into a training set and a test set for training classifiers, and respectively training a plurality of classifiers;

step S6: calculating the weight value of each classifier, and weighting and summing the results of each classifier;

step S7: and taking the category with the highest score as a classification result.

Preferably, the calculation of TF-IDF in step S3 includes: TF and IDF;

wherein TF represents the frequency of occurrence of a certain vocabulary in a certain document; the IDF represents a measure of the universal importance of a vocabulary, namely if the number of documents containing a certain vocabulary is less, the IDF is larger, and the vocabulary has good category distinguishing capability; if a certain vocabulary appears in a document with a high frequency TF and rarely appears in other documents, the vocabulary is considered to have good category distinguishing capability and is suitable for classification.

Preferably, the ith word t _i With respect to the jth document d _j The TF-IDF of (1) is calculated as follows:

wherein n is _ij Representing the ith word t _i Appear in the jth document d _j The number of times of (c); s is the total number of the documents; k represents the number of words in the jth document; i represents the inclusion of t _i Has a collection of documents.

Preferably, the step S6 adopts AIC information criterion:

AIC _K ＝-2logl _k +2λ _k

wherein l _k And λ _k The maximum likelihood function and the classifier parameters of the kth classifier are respectively;

the weight of each classifier is:

let the probability of classifying each sample i into the class j obtained by the above k algorithms respectively be

Thus, the i-th sample is classified into the class J summary after the classifier weights

Comprises the following steps:

ith sample selection

As a result of the classification.

In a second aspect, a system for intelligently classifying commodities with multiple classifiers in cooperation is provided, and the system comprises:

a module M1: acquiring a training set with uniformly distributed data quantity;

a module M2: performing word segmentation and word stop on the description information of each commodity in the training set to obtain word segmentation results;

a module M3: after word segmentation, performing characteristic coding on each segmented word, calculating TF-IDF values of the segmented words, and taking the TF-IDF values of the segmented words as coding weight values of the vocabulary;

a module M4: the product of the feature code of each participle multiplied by the weight value is used as the weighted feature of the participle under the category to which the participle belongs, and the sum of all the weighted features of the participles in the commodity is used as the feature code of the commodity;

a module M5: dividing all data into a training set and a test set for training classifiers, and respectively training a plurality of classifiers;

a module M6: calculating the weight value of each classifier, and weighting and summing the results of each classifier;

a module M7: and taking the category with the highest score as a classification result.

Preferably, the calculation of TF-IDF in said module M3 comprises: TF and IDF;

wherein, TF represents the frequency of occurrence of a certain vocabulary in a certain document; the IDF represents a measure of the universal importance of a vocabulary, namely if the number of documents containing a certain vocabulary is smaller, the IDF is larger, and the vocabulary has good category distinguishing capability; if a certain vocabulary appears in a document with a high frequency TF and rarely appears in other documents, the vocabulary is considered to have good category distinguishing capability and is suitable for classification.

Preferably, the ith word t _i With respect to the jth document d _j The TF-IDF of (A) is calculated as follows:

wherein n is _ij Representing the ith word t _i Appear in the jth document d _j The number of times of (c); s is the total number of the documents; k represents the number of words in the jth document; i represents a group containing t _i Has a collection of documents.

Preferably, said module M6 uses the AIC information criterion:

AIC _K ＝-2logl _k +2λ _k

wherein l _k And λ _k The maximum likelihood function and the classifier parameters of the kth classifier are respectively set;

the weight of each classifier is:

Thus, the i-th sample is classified into class J summary after weighting by the classifier

Comprises the following steps:

ith sample selection

As a result of the classification.

In a third aspect, a computer readable storage medium is provided, in which a computer program is stored, and the computer program, when executed by a processor, implements the steps in the intelligent classification method for goods by cooperation of multiple classifiers.

In a fourth aspect, an electronic device is provided, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the computer program is executed by the processor, the electronic device implements the steps in the intelligent classification method for goods by using the cooperation of multiple classifiers.

Compared with the prior art, the invention has the following beneficial effects:

1. according to the invention, the commodities are automatically classified in a unified and standard manner through the commodity description information, so that the labor cost is reduced;

2. the invention only classifies the commodity name and the model specification according to the two parts, and improves the classification effect of the method through a mode of weighted combination of a plurality of models.

Other advantages of the present invention will be described in the detailed description, and those skilled in the art will understand the technical features and technical solutions presented in the description.

Drawings

Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments with reference to the following drawings:

FIG. 1 is a schematic overall flow diagram of the present invention;

FIG. 2 is a diagram illustrating the TF-IDF cumulative contribution;

FIG. 3 is a diagram illustrating TF-IDF growth rate.

Detailed Description

The present invention will be described in detail with reference to specific examples. The following examples will aid those skilled in the art in further understanding the present invention, but are not intended to limit the invention in any manner. It should be noted that it would be obvious to those skilled in the art that various changes and modifications can be made without departing from the spirit of the invention. All falling within the scope of the present invention.

The embodiment of the invention provides an intelligent commodity classification method with multiple classifiers in cooperation, which is used for judging the class of commodities in a platform commodity management system according to description information of the commodities when a purchasing party or a merchant releases commodity information on an E-commerce platform, and providing support for functions of digital management, commodity recommendation and the like of the commodities. Referring to fig. 1, the method specifically includes the following steps:

step S1: and acquiring a training set with uniformly distributed data quantity.

Step S2: and performing word segmentation and word stop on the description information of each commodity in the training set to obtain word segmentation results.

Specifically, data preprocessing: most commodity classifications have the condition that the number of class samples is not uniform, data enhancement is carried out through methods such as synonym replacement, classes with few samples are expanded, and a training set with relatively uniform data size distribution is obtained. The word segmentation is performed on the description information of each commodity in the training set to obtain a basic unit-word group of semantic analysis, and some words with a small sample classification effect exist in the word segmentation result, for example: the type specification, application occasion, material and the like, and the words are used as stop words to be processed so as to achieve the purpose of further cleaning data.

And step S3: after word segmentation, performing feature coding on each segmented word, calculating TF-IDF values of the segmented words, and taking the TF-IDF values of the segmented words as coding weight values of the vocabulary.

And step S4: and taking the product of the feature code of each participle multiplied by the weight value as the weighted feature of the participle under the category to which the participle belongs, and taking the sum of all the weighted features of the participles in the commodity as the feature code of the commodity.

Specifically, feature engineering: and converting the character description information of the commodity into a characteristic vector which can be processed by a machine through a word vector coding mode for the result after word segmentation. The coding mode can adopt word2vector, deep learning and other coding modes. Considering that some words in the word groups after word segmentation have little classification effect and may interfere with classification, in order to further remove redundant information, reduce algorithm complexity and improve algorithm efficiency, the invention calculates TF-IDF values of the words to obtain the importance degree of each word in classification, only reserves the words with larger influence on classification, simultaneously takes the TF-IDF values of each word as the coding weight value of the word, and takes the result of weighting and averaging all codes of the commodity as the characteristic code of the commodity.

The calculation of TF-IDF mainly comprises two parts: TF (word frequency) and IDF (inverse file frequency). Wherein TF represents the frequency of occurrence of a certain vocabulary in a certain document, and IDF is a measure of the general importance of the vocabulary, that is, if the fewer documents containing the certain vocabulary are, the larger IDF is, the better category distinguishing capability of the vocabulary is shown. If a certain vocabulary appears in a document with a high frequency TF and rarely appears in other documents, the vocabulary is considered to have good category distinguishing capability and is suitable for classification. The ith vocabulary t _i With respect to the jth document d _i The TF-IDF of (A) is calculated as follows:

wherein n is _ij Representing the ith word t _i The number of times of occurrence of the jth document dj; s is the total number of the documents; k represents the number of words in the jth document; i represents the inclusion of t _i Has a collection of documents.

For the classification task of the invention, each category represents a document, TF-IDF is calculated, TF-IDF of all words is summed according to rows, after sorting, the accumulated contribution degree of each column of words is calculated according to columns, and the calculation result is shown in figure 2 and figure 3 by using self-training samples. And 5, 000 vocabulary with the highest classification importance is selected for classification prediction, and the cumulative contribution degree is about 80 percent, so that the purpose of further reducing the dimension is achieved.

Step S5: and dividing all commodity feature codes into a training set and a testing set for training the classifiers, and respectively training a plurality of classifiers.

Step S6: and calculating the weight value of each classifier, and weighting and summing the results of each classifier.

In particular, multiple classifiers act in concert. All data are divided into a training set and a testing set according to the ratio of 4: 1, and a plurality of classifiers with better classification results, such as SVM, XGBOOST, RANDOMFOREST, ADAMBOOST, DNN and other multi-classifiers, are trained respectively. And inputting the feature codes and the real labels corresponding to the samples into the classifier models, and adjusting the hyper-parameters of each model to ensure that each model achieves a good classification effect.

In order to obtain an ideal classification result, considering that the overall classification results of different models are similar, but the classification precision of specific classes has a certain difference, the results of the plurality of classifiers are subjected to model averaging to synthesize the advantages of the different classifiers. The invention uses a model averaging method to carry out weighted averaging on the prediction results of the various algorithms according to certain weight. The weight selection method can adopt common AIC or BIC information criteria and other weight acquisition methods. AIC is predicted by selecting a good model from a prediction perspective, and BIC is predicted by selecting a model that best fits existing data from a fitting perspective. In order to obtain good prediction effect, the invention takes AIC as an example,

AIC _K ＝-2logl _k +2λ _k

wherein l _k And λ _k Respectively the maximum likelihood function and the model parameters of the kth model;

the weight of each model is:

Thus, after model weighting, the ith sample is classified into a summary of class j

Comprises the following steps:

ith sample selection

As a result of the classification.

The invention also provides a goods intelligent classification system with multiple classifiers in cooperation, which can be realized by executing the flow steps of the goods intelligent classification method with multiple classifiers in cooperation, namely, the person skilled in the art can understand the goods intelligent classification method with multiple classifiers in cooperation as the preferred embodiment of the goods intelligent classification system with multiple classifiers in cooperation.

A module M1: and acquiring a training set with uniformly distributed data quantity.

A module M2: and performing word segmentation and word stop on the description information of each commodity in the training set to obtain word segmentation results.

Specifically, data preprocessing: most commodity classifications have the condition that the number of class samples is not uniform, data enhancement is carried out through methods such as synonym replacement, classes with few samples are expanded, and a training set with relatively uniform data size distribution is obtained. The word segmentation is performed on the description information of each commodity in the training set to obtain a basic unit, namely a word group of semantic analysis, and some words with small sample classification effects exist in the word segmentation result, such as: the type specification, application occasion, material and the like, and the words are used as stop words to be processed so as to achieve the purpose of further cleaning the data.

A module M3: after word segmentation, performing feature coding on each segmented word, calculating TF-IDF values of the segmented words, and taking the TF-IDF values of the segmented words as coding weight values of the vocabulary.

A module M4: and the product of the feature code of each participle multiplied by the weight value is used as the weighting feature of the participle under the class to which the participle belongs, and the sum of all participle weighting features in the commodity is used as the feature code of the commodity.

Specifically, feature engineering: and converting the word description information of the commodity into a characteristic vector which can be processed by a machine through a word vector coding mode for the result after word segmentation. The coding mode can adopt word2vector, deep learning and other coding modes. Considering that some words in the word groups after word segmentation have little classification effect and may interfere with classification, in order to further remove redundant information, reduce algorithm complexity and improve algorithm efficiency, the invention calculates TF-IDF values of the words to obtain the importance degree of each word in classification, only the words with larger influence on classification are reserved, simultaneously, the TF-IDF values of each word are used as the coding weight values of the words, and the result of weighting and averaging all codes of the commodity is used as the feature code of the commodity.

The calculation of TF-IDF mainly comprises two parts: TF (word frequency) and IDF (inverse document frequency). Wherein TF represents the frequency of occurrence of a word in a document, and IDF is a measure of the general importance of a word, i.e., if the number of documents containing a word is smaller, the IDF is larger, and the word is classified into a good categoryAbility to distinguish. If a certain vocabulary appears in a document with a high frequency TF and rarely appears in other documents, the vocabulary is considered to have a good category distinction capability and is suitable for classification. The ith vocabulary t _i With respect to the jth document d _j The TF-IDF of (A) is calculated as follows:

wherein n is _ij Representing the ith word t _i Appear in the jth document d _i The number of times of (c); s is the total number of the documents; k represents the number of words in the jth document; i represents the inclusion of t _i Has a collection of documents.

For the classification task of the invention, each category represents a document, TF-IDFs of all vocabularies are summed according to rows by calculating TF-IDFs, the accumulated contribution of each row of vocabularies is calculated according to columns after sequencing, and the result is calculated by using self-training samples as shown in FIG. 2 and FIG. 3. And 5, 000 vocabulary with the highest classification importance is selected for classification prediction, and the cumulative contribution degree is about 80 percent, so that the purpose of further reducing the dimension is achieved.

A module M5: all data are divided into a training set and a test set for training classifiers, and a plurality of classifiers are trained respectively.

A module M6: and calculating the weight value of each classifier, and weighting and summing the results of each classifier.

In particular, multiple classifiers function in concert. All data are divided into a training set and a testing set according to the ratio of 4: 1, and a plurality of classifiers with better classification results, such as SVM, XGBOOST, RANDOMFOREST, ADAMBOOST, DNN and other multi-classifiers, are trained respectively. And inputting the feature codes and the real labels corresponding to the samples into the models, and adjusting the hyper-parameters of the models to ensure that each model achieves a good classification effect.

In order to obtain an ideal classification result, considering that the overall classification results of different models are similar, but the classification precision of specific classes has a certain difference, the results of the plurality of classifiers are subjected to model averaging to synthesize the advantages of the different classifiers. The invention uses a model averaging method to carry out weighted averaging on the prediction results of the various algorithms according to certain weight. The weight selection method can adopt common AIC or BIC information criteria and other weight acquisition methods. Taking the example of the AIC as an example,

AIC _K ＝-2logl _k +2λ _k

wherein l _k And λ _k The maximum likelihood function and the model parameter of the kth model are respectively;

the weight of each model is:

Comprises the following steps:

ith sample selection

As a result of the classification. />

The embodiment of the invention provides an intelligent commodity classification method, system, equipment and medium with multiple classifiers in cooperation, which can automatically perform unified and standard classification through commodity description information, reduce labor cost, classify commodities only by means of commodity names and model specifications, and improve the classification effect of the method in a mode of weighted combination of multiple models.

Those skilled in the art will appreciate that, in addition to implementing the system and its various devices, modules, units provided by the present invention as pure computer readable program code, the system and its various devices, modules, units provided by the present invention can be fully implemented by logically programming method steps in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers and the like. Therefore, the system and various devices, modules and units thereof provided by the invention can be regarded as a hardware component, and the devices, modules and units included in the system for realizing various functions can also be regarded as structures in the hardware component; means, modules, units for performing the various functions may also be regarded as structures within both software modules and hardware components for performing the method.

The foregoing description of specific embodiments of the present invention has been presented. It is to be understood that the present invention is not limited to the specific embodiments described above, and that various changes or modifications may be made by one skilled in the art within the scope of the appended claims without departing from the spirit of the invention. The embodiments and features of the embodiments of the present application may be combined with each other arbitrarily without conflict.

Claims

1. A commodity intelligent classification method with multiple classifiers in cooperation is characterized by comprising the following steps:

step S1: acquiring a training set with uniformly distributed data quantity;

step S5: dividing the feature codes of all commodities into a training set and a test set for training classifiers, and respectively training a plurality of classifiers;

2. The intelligent classification method for commodities with multiple classifiers in cooperation according to claim 1, wherein the calculation of the TF-IDF in the step S3 comprises: TF and IDF;

3. The intelligent commodity classification method based on cooperation of multiple classifiers according to claim 2, wherein the ith vocabulary t is _i With respect to the jth document d _j The TF-IDF of (1) is calculated as follows:

4. The intelligent classification method for commodities with multiple classifiers in cooperation according to claim 1, wherein the step S6 adopts AIC information criterion:

AIC _K ＝-2logl _k +2λ _k

the weight of each classifier is:

the probability of classifying each sample i into the category j obtained by the k algorithms is respectively set as

Thus, the i-th sample is classified into a summary of class j after weighting by the classifier

Comprises the following steps:

ith sample selection

As a result of the classification.

5. An intelligent commodity classification system with multiple classifiers in cooperation is characterized by comprising:

a module M1: acquiring a training set with uniformly distributed data quantity;

6. The intelligent classification system for commodities with multiple cooperative classifiers according to claim 5, wherein the calculation of TF-IDF in the module M3 comprises: TF and IDF;

7. The intelligent classification system for commodities with multiple classifiers in cooperation as claimed in claim 6, wherein the ith vocabulary t _i With respect to the jth document d _j The TF-IDF of (A) is calculated as follows:

wherein n is _ij Indicates the ith word t _i Appear in the jth document d _j The number of times of (c); s is the total number of the documents; k represents the number of words in the jth document; i represents a group containing t _i Has a collection of documents.

8. The intelligent classification system for commodities with multiple coordinated classifiers according to claim 5, wherein said module M6 employs AIC information criterion:

AIC _K ＝-2logl _k +2λ _k

the weight of each classifier is:

Thus, the ith sample is classified into the summary of category j after weighting by the classifier

Comprises the following steps:

ith sample selection

As a result of the classification.

9. A computer-readable storage medium storing a computer program, wherein the computer program, when executed by a processor, implements the steps of the intelligent classification method for commodities with the cooperation of a plurality of classifiers according to any one of claims 1 to 4.

10. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the computer program, when executed by the processor, implements the steps of the intelligent classification method for goods by cooperation of a plurality of classifiers as claimed in any one of claims 1 to 4.