CN112463971B

CN112463971B - E-commerce commodity classification method and system based on hierarchical combination model

Info

Publication number: CN112463971B
Application number: CN202110092600.5A
Authority: CN
Inventors: 璐瑰哺; 费岸
Original assignee: Hangzhou Mengma Technology Co ltd
Current assignee: Hangzhou mengma Technology Co.,Ltd.
Priority date: 2020-09-15
Filing date: 2021-01-25
Publication date: 2021-05-28
Anticipated expiration: 2041-01-25
Also published as: CN112463971A

Abstract

The invention belongs to the technical field of internet, and discloses an E-commerce commodity classification method and system based on a hierarchical combination model. The classification method and the classification system effectively avoid the confusion of the classification of the commodities with the same name and different categories on the one hand, and successfully solve the problem of wrong classification caused by the stacking of the nouns in the titles on the other hand.

Description

E-commerce commodity classification method and system based on hierarchical combination model

Technical Field

The invention belongs to the technical field of internet, relates to a commercial data processing technology, and particularly relates to a method and a system for automatically classifying E-commerce platform commodities through a computer.

Background

At present, the development of the e-commerce field is still growing, and the generated data can be said to grow exponentially. Various commodities are layered endlessly, and manual classification is difficult to realize due to the fact that the data size is too large; it is necessary to complete the classification of the e-commerce goods by a computer classification system.

The computer classification system of the invention is not the system for managing properties disclosed in patent applications such as publication numbers CN1920831A and CN102915498A, and the task of the system for managing properties is to establish a commodity classification management system based on set classification rules, which does not perform classification action for classifying a certain commodity. When a user adds a commodity, the user needs to classify the commodity into a proper class according to a classification table given by a system, and marks the attribute of the commodity according to the system requirement. The classification system is an executive system and is applied to an e-commerce platform which is established with a perfect classification management system based on set classification rules. The classification system of the invention is implemented to perform the action of classifying some commodities which are not classified or are not classified properly by some users into some proper categories.

The classification method adopted by the existing computer classification system is a keyword matching classification method based on a one-dimensional model, and comprises the steps of firstly performing word segmentation processing on a title of an e-commerce commodity to be classified (a title set by a user on a platform for the commodity), then performing semantic analysis on each word segmentation, removing interfering words in the words, extracting a keyword which represents the name of the commodity category in the title of the commodity to be classified, for example, a trade name of a high-capacity brand A battery No. 5, removing modification words such as high-capacity brand A and brand 5 based on the word segmentation processing and the semantic analysis, and extracting a battery as the keyword of the commodity to be classified. And then matching with target keywords marked with category labels in the classification word stock, and taking categories marked by the target keywords which are successfully matched as the classifications of the commodities to be classified.

The keyword matching classification method based on the one-dimensional classification model has two serious problems.

The first problem is that this method can classify e-commerce commodities belonging to different categories and applied to different industries but having the same category name into the same category, causing confusion in classification, for example, button batteries (belonging to electronic product accessories), mobile phone batteries (mobile phone accessories), electric bicycle batteries (belonging to electric bicycle accessories), ordinary batteries No. 1-7 (belonging to electric appliance accessories), and automobile batteries (belonging to automobile accessories) into the category of batteries. When a user needs to purchase a normal size 5 battery, he needs to look over among the batteries.

The patent application with publication number CN102915498A (cited document 1) proposes a classification rule for distinguishing different commodities under the same category based on commodity attributes, which can effectively avoid the problem of confusion of classification, but based on this rule, the attributes of each commodity need to be manually set, and automatic classification by a computer cannot be realized.

The second problem is that when the title of the electronic commerce goods is built up by a plurality of nominal keywords, keyword extraction errors easily occur, and eventually classification errors are caused. For example, a commodity title is "B brand button cell CR2016 remote controller electronic scale car key body weight scale 5 particle thing allies oneself with corresponds the battery", which includes numerous nouns such as "button", "battery", "remote controller", "electronic scale", "car key", "body weight scale", if based on the frequency of appearing in the e-commerce platform to select, it is very likely that one of "button", "remote controller", "electronic scale" and "body weight scale" will be extracted as a keyword. Thereby causing a classification error.

Disclosure of Invention

The invention provides a hierarchical combination model-based e-commerce commodity classification method aiming at the problems that the existing one-dimensional model-based keyword matching classification method is easy to cause article confusion and classification errors, and the method can relatively accurately identify the large categories to which e-commerce commodities belong and the application industries thereof, and effectively avoid article confusion and classification errors.

In order to achieve the above object, the present invention provides a first classification method: a classification method of E-commerce commodities based on a hierarchical combination model is characterized in that each hierarchical classification lexicon consists of a plurality of keywords, each upper-level keyword (the keyword contained in the upper-level classification lexicon) corresponds to a lower-level classification lexicon, and the classification of the E-commerce commodities is determined by hierarchical classification labels formed by sequentially combining all levels of classification labels;

when the commodities to be classified are classified, the commodity titles (including a word segmentation set of a plurality of words) subjected to word segmentation are matched with the multistage classification word banks from the top stage to the bottom stage step by step, the only lowest stage keywords which are finally matched are used as the lowest stage classification labels of the commodities, the upper stage keywords which correspond to the lowest stage keywords to the top stage step by step are respectively used as the corresponding stage classification labels of the commodities, and then the classification labels of all stages are combined according to the hierarchical sequence to form the hierarchical classification labels of the commodities.

Taking a three-layer combination model as an example, a primary classification word bank is set to correspond to a large class to which a commodity belongs, a secondary classification word bank corresponds to the application industry of the commodity, and a tertiary classification word bank corresponds to the class of the commodity. After a commodity is classified by adopting the classification method, the obtained hierarchical classification labels of the commodity comprise three-level classification labels, wherein the first-level classification label is a large-class label of a large class to which the commodity belongs, the second-level classification label is an industry label of an industry to which the commodity is applied, and the third-level classification label is a class label of the commodity. The classification method effectively avoids the confusion of the classification of the commodities with the same name and different categories on the one hand, and also successfully solves the problem of wrong classification caused by the stacking of the nouns in the titles on the other hand.

Further, the rule of the step-by-step matching is as follows:

if the title is matched with the upper-level classified lexicon to form the unique keyword, matching the title with the lower-level classified lexicon corresponding to the upper-level keyword when the title is matched with the lower-level classified lexicon;

if the title is matched with zero keywords in the upper-level classified lexicon, respectively matching the title with all the lower-level classified lexicons when the title is matched with the lower-level classified lexicons;

if the title is matched with a plurality of keywords in the upper-level classified lexicon, calculating the weight of each matched keyword relative to the title, selecting the keyword with the maximum weight as the matching result of the title and the lower-level classified lexicon, and matching the title with the lower-level classified lexicon corresponding to the upper-level keyword with the maximum weight when the keyword is matched with the lower-level classified lexicon;

if the title and the lowest classified lexicon are matched with a plurality of keywords, calculating the weight of each matched keyword relative to the title, and selecting the keyword with the maximum weight as the lowest classified label of the commodity;

if the title is matched with the lowest classification lexicon to obtain a unique keyword, taking the keyword as a lowest classification label of the commodity;

if the title and the lowest classified lexicon are matched with zero keywords, calculating the similarity between the title and all the keywords in the corresponding lowest classified lexicon, and taking the keywords with the highest similarity and larger than a set threshold value as the lowest classified label of the commodity; and if the highest similarity is smaller than a set threshold, requesting manual classification and amplifying a classified word bank.

The weight calculation of the keyword relative to the title is based on one or more parameters of the position sequence of the keyword in the title, the occurrence frequency, the TFIDF value, the correlation with other participles in the title and the attribute of the word. For example: the position sequence of each matched keyword in the title is used as a weight calculation basis, the weight values of 1 st, 2 nd and 3 rd keys 3 … … n are respectively set to be 10, 9 and 8 … … x, and if the keyword A and the keyword B matched with a certain level classification word bank are respectively the 1 st and 3 rd participles in the title, the weight of the keyword A relative to the title is 10 and the weight of the keyword B relative to the title is 8, so that the keyword A is selected as the level classification label of the commodity.

The first classification method is simple in rule, small in operation amount and high in accuracy. But has the following disadvantages: in the multi-level classification word bank of the method, one classification at each level only has one corresponding keyword, and when the keywords are matched step by step, as long as no word matched with the classified keyword at the level exists in the title, the classification cannot be classified into the classification, so that when the keywords in the current term word bank are few (an initial word bank is constructed based on a batch of sample commodity titles, for example, 5000 commodity titles are randomly read from a platform database of an e-commerce platform, and all commodity titles (perhaps millions or tens of millions) in the platform cannot be read to construct the word bank), the classification fails frequently, so that the frequency of manually inserting and manually expanding the word bank is high, but the automation degree is gradually improved along with the continuous expansion of the keywords in the word bank.

Aiming at the defects of the first classification method, the invention provides a second classification method based on the same inventive concept: a classification method of E-commerce commodities based on a hierarchical combination model is characterized in that the hierarchical classification word bank comprises a plurality of hierarchical classification nodes with hierarchical corresponding relations, each upper classification node corresponds to a plurality of lower classification nodes, each classification node is provided with a node label, node information (information under the classification nodes is called node information) of each classification node comprises at least one keyword belonging to the classification node and weight of each keyword, and node information of the lowest classification node also comprises a path weight threshold of a path where the lowest classification node is located; the classification of the goods is determined by a path composed of classification nodes of various levels.

When the commodities to be classified are classified, matching the commodity title (subjected to word segmentation processing) (including a word segmentation set of a plurality of words) with the hierarchical classification word bank from the top level to the bottom level step by step, calculating the path weight of the commodity title in each path formed by the classification nodes matched with the keywords, and taking the path as the classification result of the commodity if the result is greater than the path weight threshold of the path.

Further, the rule of the step-by-step matching is as follows:

and if the commodity title is matched with the upper-level classification node to form the keyword, respectively matching the commodity title with each lower-level classification node corresponding to the upper-level classification node matched with the keyword when performing keyword matching with the lower-level classification node.

If the commodity title is matched with any one of the grade classification nodes to zero keywords, the commodity title data is saved for manual classification.

Further, the path weight calculation rule of the titles of the commodities to be classified is as follows: and averaging the weights of all the keywords matched with the commodity title under each classification node in the path to obtain the average weight of the commodity title at each classification node in the path, and then carrying out weighted summation on the average weights of all the classification nodes in the path to obtain the path weight of the commodity title under the path.

The second classification method is relatively complex in rule, and has the advantages of high automation degree and less manual intervention.

Preferably, the hierarchical combination model is a three-layer combination model, and the three-layer combination model includes a three-layer classification lexicon composed of large class classification nodes corresponding to large classes to which the commodities belong, industry classification nodes corresponding to application industries of the commodities, and class classification nodes corresponding to own classes of the commodities. The node information of the large-class classification nodes is used for performing large-class keyword matching on the titles of the commodities to be classified; the node information of the industry classification nodes is used for matching industry keywords with the titles of the commodities to be classified; and the node information of the category classification nodes is used for performing category keyword matching on the titles of the commodities to be classified.

Further, the classification method specifically comprises the following steps:

s1, establishing a hierarchical combination model according to the classification of the E-commerce platform, and establishing a hierarchical classification word bank based on the hierarchical combination model;

s2, automatically acquiring data of commodities to be classified from the commodity database; standardizing the commodity title data in the commodity title field of the commodity data to be classified, removing invalid characters and stop words in the commodity title, and then performing word segmentation on the standardized commodity title;

s3, matching the commodity title data after word segmentation step by step with keywords in node information of each classification node of each level in a constructed level classification word library, taking the classification node matched with the keywords as a candidate classification node of the commodity title in the level of the classification node, and combining the obtained candidate classification nodes of all levels of the commodity title according to the level corresponding relation to form a plurality of candidate paths;

s4, calculating the path weight of the commodity title under each candidate path, comparing the path weight with a path weight threshold value stored in the node information of the lowest-level classification node of the path in the hierarchical classification word library, and selecting the candidate path exceeding the path weight threshold value as the final classification result of the commodity;

the path weight calculation method comprises the following steps: averaging the weights of all keywords matched with the commodity title under each classification node in the path to obtain the average weight of the commodity title at each classification node in the path, and then carrying out weighted summation on the average weights of all classification nodes in the path to obtain the path weight of the commodity title under the path;

s5, in steps S3 and S4, when the commodity is in a certain level and is not matched with the candidate classification node or all the candidate path weights do not reach the threshold value and the classification result cannot be obtained, putting the commodity titles into a classification auxiliary module; the classification auxiliary module stores all commodity title data without classification results, is used for screening missed commodities which should be classified manually in the follow-up process, accumulates a certain amount of manually labeled data, and then adds the manually labeled data as a sample into the sample database mentioned in S1.

Further, step S1 includes the following sub-steps:

s1-1, hierarchical division: establishing a hierarchical combination model according to the actual product classification of the e-commerce platform, wherein each upper classification node corresponds to a plurality of lower classification nodes, each lower classification node only has one upper classification node corresponding to the upper classification node, each classification node has a respective node label, all levels of classification nodes with corresponding relations are combined to form a path, and the classification result of the commodity is finally represented by one or more paths;

s1-2, obtaining a sample: and aiming at the established hierarchical combination model, acquiring the title data of a certain number of commodities for each path from the commodity database, and marking the node labels of all the classified nodes in the paths as the title data of all the commodities acquired by the path, wherein the commodity title data comprises commodity titles and commodity IDs. The concrete acquisition of the sample can be realized by using a crawler technology, for example, one path of hierarchical division is 'digital accessories-automobile batteries', the 'automobile batteries' can be searched through an e-commerce platform, the searched commodity titles are captured, three node labels of 'digital accessories', 'automobile' and 'automobile batteries' are marked at the same time, and the commodities of all category classification nodes are searched in such a way to obtain samples of all paths; taking the titles and commodity IDs of the commodities acquired for each path and the node labels of all levels of classification nodes of the paths to which the commodities belong marked by the title data of each commodity as sample data, and writing the sample data into a sample database one by one, wherein each sample data comprises the commodity ID of one commodity, the commodity title and the node labels of all levels of classification nodes in the path to which the commodity belongs; taking a three-layer classification structure with a primary classification node as a major classification node, a secondary classification node as an industry classification node and a tertiary classification node as a grade classification node as an example, one sample datum comprises five fields of a commodity ID, a commodity title, a major label, an industry label and a grade label of a commodity;

s1-3, keyword extraction: firstly, standardizing the commodity title field of each piece of sample data in a sample database, and removing invalid characters and stop words. Then, performing word segmentation processing on the standardized commodity title to obtain a word segmentation set of which all the word segments are contained in one title, wherein the word segmentation set contains the content of each word segment and two pieces of information of word segmentation positions (the second word segments in the title) (different positions in the title may contain two or more word segments with the same content, and the two or more word segments with the same content are respectively stored in the word segmentation set); finally, merging the participle sets of the commodity titles with the same node labels in the sample data to obtain a participle set of each node label (there may be participles with the same content, even participle positions are also the same and are respectively stored), carrying out word frequency statistics on all keywords (participles with the same content) in the participle set of each node label and calculating a TFIDF value to obtain the word frequency and the TFIDF value of each keyword under the node label, averaging the positions of the keywords in all commodity titles under the node label to serve as the position information of the keywords under the node label, and thus obtaining the keywords, the keyword word frequency, the keyword TFIDF value and the keyword position information after all the same participle contents under the node label are merged;

s1-4, keyword weight calculation: respectively calculating the weight of each keyword under each node label according to the keywords, the keyword word frequency, the TFIDF value and the keyword position information under all the node labels in each hierarchy obtained in the last step to obtain the weight of each keyword under each node label, screening n keywords with high weights and the weight of each keyword to store the weight of each keyword into the node information of the classification node corresponding to each node label in the hierarchy classification word library;

the method for calculating the weight of the keyword comprises the following steps: carrying out weighted summation on the keyword frequency, the TFIDF value and the position information of the keywords;

s1-5, calculating a path weight threshold value: after the last step is finished, three node information of labels, keywords and keyword weights of all classification nodes are obtained in the hierarchical classification word bank, each lowest-level classification node is pushed back to the upper level to obtain a unique path, and one information needs to be stored in the lowest-level classification node, namely the weight threshold of the path. Therefore, in sample data, matching each participled commodity title with the same path (three labels of each commodity data in the sample data are combined according to the sequence of the upper level and the lower level to form a path) with a keyword of each classification node in the path in a hierarchical classification word stock, extracting the weight of the matched keyword, wherein the commodity title may be matched with a plurality of keywords in each classification node of the path, calculating the weight average value of the keywords as the weight value of the commodity title at the classification node, then carrying out weighted summation on the weights of the commodity at all the classification nodes to obtain the weight of the commodity under the path, thus obtaining the weight of all the commodity titles under each path in the sample data under the path, selecting the minimum value of the path weight of the commodity title under the path as a path weight threshold value, storing the node information of the lowest classification node of the path in the hierarchical classification word library; thus, the hierarchical classification word library construction is completed.

The second classification method of the present invention has two major characteristics different from the existing classification method.

The first major feature is that, independent of information other than the title, classification of the product is analyzed and judged only from text content contained in the title, and there is no need to configure other auxiliary information (for example, the product attribute mentioned in cited document 1) for the classification process. Therefore, the requirement for the platform user to match with the platform classification is reduced to the minimum (the platform user only needs to fill in the title when adding the commodity, and does not need to do other matching work for the commodity classification).

Because the text information contained in the commodity title is limited, how to fully utilize the limited information in the commodity title to realize relatively accurate and efficient classification is the second major characteristic of the method, and the characteristic can effectively reduce the manual intervention amount, thereby not only forming the difference between the method and the existing classification method, but also forming the maximum difference between the method and the first classification method of the invention.

When a platform user sets a title for a commodity, in order to enable a commodity requiring party to easily search for the commodity, the platform user inputs various terms which can search for the commodity in the commodity title. Although the words are not the classification names of the upper classes (such as the major class name: home life), the comprehensive analysis based on the words can often analyze which upper class the commodity belongs to. In other words, when we see a commodity title manually, although a large number of words (the number of words of the commodity title allowed by the e-commerce platform can reach 32 characters) stacked in the title do not include the top class name, we can identify the top class to which the commodity belongs by manually analyzing the contents of the words comprehensively. This indicates that: the term in the product title is useful for identifying the upper classification of the product, even if it is not a word of the upper classification name.

Furthermore, when a title is manually seen, it can be classified directly, and besides comprehensively considering all texts in the title, personal experience (for example, it is manually known which industries a certain noun represents, and then one word is ahead of another word in the title, and it is known that the former word represents a higher-level classification more probably and the latter word represents a lower-level classification more probably according to common sense) is also combined.

That is, in the process of manually classifying the hierarchy of one product, an empirical analysis is performed in addition to a text analysis.

Therefore, in the second classification method of the present invention, on one hand, two concepts of "classification node" and "path" are introduced, the classification node "(no longer being a keyword) is used to correspond to the classification hierarchy, the path composed of the classification nodes corresponding to the hierarchy" (no longer using the hierarchy combination keywords) is used to express the classification result of the commodity, and then a large number of keywords are stored in the classification node for matching the words, the keywords are from sample commodity titles, and the keywords may be the classification names themselves of the classification of the class, or the names, or the adjectives, etc., as long as the words are useful for identifying the class of the commodity, and the words are placed in the classification node corresponding to the class. And then, when step-by-step matching is carried out, step-by-step classification is not carried out on the matching of a single keyword, but all keywords in the classification nodes are matched, and as long as any keyword in the classification nodes is matched, the commodity title is considered to be matched with the classification node. Therefore, the text information in the commodity title is fully utilized, and compared with the first method, the probability of classification failure can be greatly reduced under the same sample number, and the early manual intervention amount for executing the classification process is effectively reduced (the word bank is manually amplified after the classification failure).

On the other hand, the invention also introduces two concepts of 'keyword weight' and 'path weight', and plays a role in the experience of reading in the manual classification process through weight analysis. In fact, in the manual classification process, the individual experience of reading can be reflected from the side of the product title. For example, the first, manual work can analyze which industry the product belongs to by a word through experience because he often sees the word in the industry, and actually, the experience information can also be analyzed by a computer by changing the frequency of the occurrence of keywords in the product title of the existing industry. Secondly, the empirical analysis of which terms are more likely to represent which class of classification is judged manually according to the sequence positions of the terms, and the empirical analysis can also be realized by setting some rules through a computer. And none of these analysis processes depart from the title of the good.

Therefore, the second method of the present invention, based on these analysis processes that can be implemented by a computer, sets a weight analysis step to complete the empirical analysis work in the manual classification process, so that the classification method utilizes the information contained in the header as much as possible, and improves the success probability of classification. Further reducing the amount of pre-manual intervention to perform the classification process.

Based on the second classification method, the invention further provides an e-commerce commodity classification system based on a hierarchical combination model, which is characterized in that: the system comprises a word stock construction module, a data processing module, a classification matching module and a classification auxiliary module.

The data processing module comprises a standardization module and a word segmentation module. The normalization module: the method is used for removing invalid characters and stop words in the commodity titles of the commodities to be classified. The word segmentation module: the method is used for performing word segmentation processing on the standardized commodity title field to obtain a plurality of words, and marking the word position for each word.

The word stock building module is used for building a hierarchical classification word stock and comprises the following steps: the system comprises a hierarchy dividing module, a sample obtaining module, a keyword extracting module, a keyword weight calculating module and a path weight calculating module.

A hierarchy dividing module: the hierarchical combination model is established according to the classification of the E-commerce platform, each upper classification node corresponds to a plurality of lower classification nodes, each lower classification node only has one upper classification node corresponding to the upper classification node, each classification node has a respective node label, and all levels of classification nodes with corresponding relations are combined to form a path.

A sample acquisition module: the system is used for acquiring the title data of a certain number of commodities for each path from the e-commerce platform according to the hierarchical combination model established by the hierarchical dividing module and generating a sample database. In the sample database, each sample data contains a commodity ID of a commodity, a commodity title and a node label of each level of classification node in a path to which the commodity belongs.

A keyword extraction module: the method is used for extracting keywords from the sample data, and counting and calculating the word frequency, the TFIDF value and the position information of the keywords.

The keyword weight calculation module: and the system is used for calculating the weight of the keywords extracted by the keyword extraction module and storing the keywords meeting the weight condition and the weight thereof into corresponding classification nodes of the hierarchical combination model.

A path weight calculation module: and the method is used for calculating the weight of each path in the hierarchical combination model and storing the weight into the classification node at the lowest level of the path.

The classification matching module: and the method is used for matching the titles of the commodities to be classified with the hierarchical classification word bank and taking the paths meeting the matching conditions as the final classification of the commodities.

The classification auxiliary module: for processing the title of the goods which fail to match.

Further, the sample acquisition module analyzes enough sample data into a DataFrame containing two fields of a commodity ID and a commodity title, performs abnormal data investigation on the sample data, and checks the proportion of each sample data; acquiring commodity data with enough quantity, marking the commodity data as a marking sample, writing the marking sample into a DataFrame to enable the marking sample to become sample data containing two fields of a commodity ID and a commodity title, then, carrying out visual analysis on the sample data to check abnormal data, checking the proportion of each sample, and adjusting;

further, the keyword extraction module is used for segmenting the prepared commodity title field of the labeled sample, then counting the keyword frequency under each hierarchical classification node, generating the keywords and the keyword frequency of each hierarchical classification node and storing the keywords and the keyword frequency into hierarchical classification word bank node information; counting the keyword frequencies of all commodities under all paths, and storing the keyword frequencies of all paths into a hierarchy classification word bank path keyword information field;

further, the keyword weight calculation module performs weighted summation on the keyword word frequency, the keyword TFIDF value and the keyword position information of each keyword under each node label in each hierarchy to obtain weights of all keywords of the classification node corresponding to the node label, selects a plurality of keywords with the highest weights from all keywords of the classification node to obtain high-weight keywords under the classification node and the weight value of each keyword, and stores the keywords and the weight value into the node information corresponding to the classification node in the lexicon.

Further, the path weight calculation rule is: and averaging the weights of all the keywords matched with the commodity title under each classification node in the path to obtain the average weight of the commodity title at each classification node in the path, and then carrying out weighted summation on the average weights of all the classification nodes in the path to obtain the path weight of the commodity title under the path.

Further, the classification matching module matches the commodity title data after word segmentation processing step by step with keywords in the node information of each classification node of each hierarchy in the constructed hierarchical classification lexicon, and takes the classification node matched with the keyword as a candidate classification node of the commodity title in the hierarchy where the classification node is located. And combining the obtained candidate classification nodes of all the hierarchies of the commodity title according to the corresponding relationship of the hierarchies to form a plurality of candidate paths.

Alternatively, the classification auxiliary module stores the commodity titles which fail to be matched, and the commodity titles are used as samples when a manual intervention is used for updating the hierarchical classification word bank.

As another option, the classification assisting module performs similarity matching between the title of the commodity which fails to be matched and all the titles of the classified commodities, and selects the classification of the classified commodities with similarity exceeding a threshold value as a classification result.

Further, the hierarchical combination model has a three-layer structure, wherein the three layers are used for sequentially classifying major categories, industries and categories, the hierarchical classification word library comprises three layers of classification node information which are major category node information, industry node information and category node information, wherein the major category node information is used for performing major category keyword matching on titles of commodities to be classified, and if a certain major category keyword is matched, the major category node information is stored; the industry node information is used for matching industry keywords of the titles of the commodities to be classified matched with the large category of keywords, and if the industry keywords are matched with certain industry keywords, the industry node information is stored; and the category node information is used for matching the category keywords of the titles of the commodities to be classified matched with the industry keywords, and if the category keywords are matched with a certain category keyword, the category node information is stored. Thus, each hierarchical classification node combination of each commodity title obtains a plurality of paths and each classification node keyword information of each path. And finally, calculating the path weight of the commodity title under each path, comparing the path weight with the path weight threshold of the corresponding path in the hierarchical classification word bank, and taking the path exceeding the threshold as a classification result.

The invention has the beneficial effects that: compared with the prior art, the classification method based on the hierarchical combination model combines the E-commerce and the natural language processing, and compared with the traditional classification method of the NLP, the classification method based on the hierarchical combination model can solve the problems in the prior art from the following aspects: 1. a large amount of labeled data is not needed, and an accurate result can be obtained only by analyzing a small amount of samples and data; 2. unified classification standards of various industries are designed, so that analysis and statistics of E-commerce data are facilitated; 3. through a hierarchical combined model, the classification of commodities can be analyzed from multiple dimensions of the title, and confusion is reduced; 4. different from the existing machine learning and deep learning classification model, the model is not a black box model, and the middle design can have a large optimization and adjustment space, unlike the traditional model which can only adjust parameters and optimize.

In addition, the invention also has the following advantages: 1. the system is weakly coupled with a database, can be used by any database needing commodity classification, only needs a Redis module to transmit data and communicate messages, and does not need to be on the same server with the database; 2. continuous iterative optimization can be performed, including hierarchical classification word banks, keyword weight updating and weight calculation models can be continuously optimized, and the accuracy rate is slowly approached to 100% through data volume accumulation; 3. the memory occupation is small, if the training task is not received, only a small amount of memory is needed, and if the data volume in the Redis module is very large, the data can be processed according to batches; 4. after the whole classification process is finished, the original field is not influenced, the original data is not changed, and only the classified new field is added; 5. as long as the memory is enough, tens of thousands or even hundreds of thousands of commodity data can be processed at one time.

Drawings

FIG. 1 is a schematic structural diagram of a hierarchical composition model according to the present invention.

FIG. 2 is a schematic diagram of information contained in each node in the hierarchical combination model according to the present invention.

Detailed Description

The technical solution proposed by the present invention is further explained below with reference to the accompanying drawings.

The E-commerce commodity classification system based on the hierarchical combination model is constructed through a python environment. The classification system mainly comprises four functional modules, namely a hierarchical classification word bank construction module, a data processing module, a classification matching module and a classification auxiliary module. The four modules respectively complete four flow steps of construction of a hierarchical classification word bank, data processing, classification matching and auxiliary classification. The four process steps together form the complete classification process of the classification system (hereinafter referred to as system or classification system) for the E-commerce commodities.

Before automatic classification, the classification system firstly needs to construct a hierarchical classification word bank through a hierarchical classification word bank construction module. The method comprises the following steps of constructing a hierarchical classification word library, namely, hierarchical division, obtaining a sample, extracting a keyword, calculating a keyword weight and calculating a path weight threshold.

Step one, hierarchical division: according to the actual class division of the platform, three levels, namely a large class, an industry and a class, are defined from top to bottom in the text, a level combination model (the structure is shown as an attached drawing 1) is generated, in the level combination model, each upper-level classification node corresponds to a plurality of lower-level classification nodes (the node structure is shown as an attached drawing 2), each lower-level classification node only has one upper-level classification node corresponding to the upper-level classification node, each classification node has a respective node label, each uppermost-level classification node corresponds to the lower level layer by layer, all combinations of the three levels of classification nodes are obtained and are called paths, and the classification result of a commodity is finally represented by one or a plurality of paths. For example:

the broad class name: home life, electronic communication, business office, outdoor sports, transportation, industrial supplies, and the like;

the trade name: mobile phones, cameras, bicycles, automobiles, home appliances, and the like;

name of class: button cells, automobile cells, mobile phone cells, and the like;

in the three-layer classification structure, each layer is provided with a plurality of classification nodes, and each classification node is a sub-node of a certain classification node of the previous layer;

the household life-household appliance-button cell is one of the paths; the traffic trip, the automobile and the automobile battery are another path, and the electronic communication, the mobile phone and the mobile phone battery are another path. Namely, a button battery class node is arranged under a household appliance industry node under a household life class node, an automobile battery class node is arranged under an automobile industry node under a traffic trip class node, a mobile phone battery class node is arranged under a mobile phone industry node under an electronic communication class node, and the three nodes are batteries but belong to three classes with different paths.

Step two, obtaining a sample: according to the result of the hierarchy division, enough commodity title data (commodity titles and commodity IDs) of each path are obtained, and each level node label is marked to be used as a sample. The concrete acquisition of the sample can be realized by using a crawler technology, for example, one hierarchical path is 'traffic travel-car battery', we can search 'car battery' on an e-commerce platform, pick up the searched commodity title, and simultaneously mark three node labels of 'traffic travel', 'car' and 'car battery', so that the samples of all paths are obtained by searching commodities of all category nodes. And writing the acquired commodity titles and commodity ID data of all the paths and the node labels marked on each commodity title into a sample database one by one as sample data, wherein each sample data comprises five fields of a commodity ID, a commodity title, a large label, an industry label and a product label.

Step three, keyword extraction: firstly, standardizing the commodity title field of each piece of sample data in a sample database, and removing invalid characters and stop words. And then performing word segmentation processing on the standardized commodity titles to obtain a word segmentation set which takes all the words contained in each title as the title, wherein the word segmentation set contains the content of each word and the information of the position of the word in the title, and the two words with the same content in the title are stored in the set as two because the position information is different. And finally merging the word segmentation sets of the commodity titles with the same large class node labels in the sample data to obtain the word segmentation set of each large class node label, wherein the word segmentation sets may have the same content and the same position information, and are not merged, performing word frequency statistics on all keywords (the words with the same content) in the word segmentation set of each large class node label after merging and calculating a TFIDF value to obtain the word frequency and the TFIDF value of each keyword under the large class node label, averaging the positions of the keywords in all the commodity titles under the large class node label to serve as the position information of the keywords under the node label, and thus obtaining the keywords, the word frequency, the TFIDF value and the keyword position information after merging all the same word segmentation contents under the large class node label. Similarly, the same operation is carried out on the commodity titles with the same industry node labels in the sample data, and finally the same operation is carried out on the commodity titles with the same class node labels in the sample data.

Fourthly, calculating the weight of the keywords: calculating the weight of the keywords of each classification node according to the keywords, the keyword frequency, the TFIDF value and the keyword position information under all the node labels in each hierarchy obtained in the last step, finally obtaining the keywords and the weight of the keywords of each classification node, screening n keywords with high weight and the weight of the keywords, and storing the keywords and the weight of the keywords in the lexicon node information;

the specific calculation method comprises the following steps: the keyword word frequency, the keyword TFIDF value and the keyword position information of each keyword under each node label in each hierarchy are weighted and summed to obtain the weight of all keywords of the classification node corresponding to the node label, 16 keywords (assuming that the title is 32 words at the longest) with the highest weight are selected from all keywords of the classification node, and the high-weight keywords under the classification node and the weight value of each keyword can be obtained and stored in the node information corresponding to the node in the word stock.

And fifthly, calculating a path weight threshold value: after the last step is finished, three node information of labels, keywords and keyword weights of all classification nodes are obtained in the hierarchical classification word bank, each lowest-level classification node is pushed back to the upper level to obtain a unique path, and one information needs to be stored in the lowest-level classification node, namely the weight threshold of the path. Therefore, in sample data, matching each participled commodity title with the same path (three labels of each commodity data in the sample data are combined according to the sequence of the upper level and the lower level to form a path) with a keyword of each classification node in the path in a hierarchical classification word stock, extracting the weight of the matched keyword, wherein the commodity title may be matched with a plurality of keywords in each classification node of the path, calculating the weight average value of the keywords as the weight value of the commodity title at the classification node, then carrying out weighted summation on the weights of the commodity at all the classification nodes to obtain the weight of the commodity under the path, thus obtaining the weight of all the commodity titles under each path in the sample data under the path, selecting the minimum value of the path weight of the commodity title under the path as a path weight threshold value, and storing the node information of the lowest classification node of the path in the hierarchical classification word library. Thus, the hierarchical classification word library construction is completed.

The data of the E-commerce commodities are stored in a platform database, and the database and the classification system are communicated and transmitted through a Redis module. When data of commodities (the data of the commodities comprise commodity IDs and commodity titles) in the database need to be classified, the commodity IDs and the commodity titles which need to be classified can be written into a Redis module in a hash mode, and a hash table stores a piece of data and stores the data in a dictionary form. And sending a notice to the classification system after the storage is finished, and the classification system starts to read data from the Redis module. After the data is read, each piece of commodity data is subjected to structured processing, then classification is carried out, and classification labels of major categories, industries and categories are marked. And after the classification system finishes classification, rewriting the data with the labels into the Redis module, and informing the database end. And after receiving the information of finishing the classification task, the database starts to read the commodity data finished in the Redis module, and analyzes and stores the commodity data in a warehouse to finish the whole classification process.

Specifically, the classification system reads data from the Redis module, extracts fields in the data from the Redis module through the data processing module, and performs structured processing. The method comprises the following steps: (1) each piece of data comprises a commodity ID and a commodity title, and each piece of data is read in a dictionary form and stored in a list to form a commodity data list; (2) converting the commodity data list into a DataFrame in the pans, and automatically analyzing the commodity data list into corresponding fields and data when creating a DataFrame object, so that the DataFrames with column names respectively being a commodity ID and a commodity title are formed; (3) and (3) using a re module regular expression to remove illegal fields, stop words and the like for each piece of data of the commodity title field corresponding to the column name in the DataFrame, and then using a jieba (Chinese word segmentation tool) to perform word segmentation processing on the commodity title. So far, the data processing module ends.

Then, the classification matching module calculates the path weight threshold of the commodity title after word segmentation. The method comprises the steps that each piece of participled commodity title data processed by a data processing module is matched with keywords in node information of each classification node of each hierarchy in a constructed hierarchical classification word bank step by step, if the keywords in the node information of a certain classification node in a higher-level classification node are matched, a node label of a possible classification node of the commodity title data and all matched keywords and keyword weights are obtained, the classification node is used as a candidate classification node of the commodity title data in the hierarchy, and 0 to n candidate classification nodes are possible in the same hierarchy. Similarly, the same operation is performed on all classification nodes of the next level corresponding to each candidate classification node of the previous level, so that 0 to n candidate classification nodes of the next level of the candidate classification node of the previous level are obtained. And finally obtaining candidate classification nodes of all the hierarchies of the commodity title, and obtaining 0 to n candidate paths according to the corresponding relation of the hierarchy nodes. After all candidate paths of all the commodity titles are obtained, the weight of each candidate path is calculated for each commodity title according to a path weight calculation method in the fifth step of the hierarchical classification word bank construction module, the weight is compared with a path weight threshold value stored in node information of a classification node at the lowest level of the path in the hierarchical classification word bank, and the candidate path exceeding the path weight threshold value is selected as a final classification result of the commodity.

After the steps are finished, the classification result is not obtained because some commodities are not matched with the candidate classification node at a certain level or all the candidate path weights do not reach the threshold value, and the commodity titles are put into a classification auxiliary module; the classification auxiliary module stores all commodity title data without classification results, is used for screening missed commodities which should be classified manually in the follow-up process, and can be added into the sample database mentioned in the first step of the hierarchical classification word stock construction module as a sample again after accumulating a certain amount of manually labeled data.

After the major categories, industries and categories corresponding to all the commodity title data are predicted, classification results (hierarchical classification labels) are added to the DataFrame storing the commodity ID and the commodity title, then the DataFrame is converted into a dictionary list again through a to _ direct (origin = 'records') function of the DataFrame, and then the DataFrame is written back to the Redis module, and the database end is informed that the classification is completed.

And finally, the database end receives the classification completion message, reads data from the Redis module and writes the data back to the database.

The present invention is not limited to the above embodiments, and those skilled in the art can implement the present invention in other embodiments according to the disclosure of the present invention, or make simple changes or modifications on the design structure and idea of the present invention, and fall into the protection scope of the present invention.

Claims

1. A classification method of E-commerce commodities based on a hierarchical combination model is characterized in that the hierarchical classification lexicon comprises a plurality of hierarchical classification nodes with hierarchical corresponding relations, each upper classification node corresponds to a plurality of lower classification nodes, each classification node is provided with a node label, node information of each classification node comprises at least one keyword belonging to the classification node and the weight of each keyword, and node information of the lowest classification node also comprises a path weight threshold of a path where the lowest classification node is located; the classification of the commodities is determined by a path consisting of classification nodes at all levels;

when the commodities to be classified are classified, the commodity titles are matched with the hierarchical classification word bank step by step from the top to the bottom, the path weight of the commodity titles in each path formed by the classification nodes matched with the keywords is calculated, and if the result is greater than the path weight threshold value of the path, the path is used as the classification result of the commodity;

the hierarchical classification word library is constructed by the following method:

s1-2, obtaining a sample: aiming at the established hierarchical combination model, acquiring the title data of a certain number of commodities for each path from a commodity database, and marking the node labels of all classification nodes in the path as the title data of all the commodities acquired by the path, wherein the commodity title data comprises a commodity title and a commodity ID; taking the titles and commodity IDs of the commodities acquired for each path and the node labels of all levels of classification nodes of the paths to which the commodities belong marked by the title data of each commodity as sample data, and writing the sample data into a sample database one by one, wherein each sample data comprises the commodity ID of one commodity, the commodity title and the node labels of all levels of classification nodes in the path to which the commodity belongs;

s1-3, keyword extraction: firstly, standardizing the commodity title field of each piece of sample data in a sample database, and removing invalid characters and stop words; then, performing word segmentation processing on the standardized commodity title to obtain a word segmentation set which takes all the words contained in one title as the title, wherein the word segmentation set contains two information of the content and the position of each word; finally, merging the word sets of the commodity titles with the same node labels in the sample data to obtain a word set of each node label, carrying out word frequency statistics on all keywords in the word set of each node label and calculating a TFIDF value to obtain the word frequency and the TFIDF value of each keyword under the node label, averaging the positions of the keywords in all the commodity titles under the node label to serve as the position information of the keywords under the node label, and thus obtaining the keywords, the keyword word frequency, the TFIDF value and the keyword position information after all the same word contents under the node label are merged;

s1-5, calculating a path weight threshold value: matching each commodity title after word segmentation with the same path in sample data with a keyword of each classification node in the path in a hierarchical classification word stock, extracting the weight of the matched keyword, calculating the weight average value of the keywords as the weight value of the commodity title at the classification node, then carrying out weighted summation on the weights of the commodity at all the classification nodes to obtain the weight of the commodity under the path, thus obtaining the weight of all the commodity titles under the path in the sample data, selecting the minimum value of the path weight of the commodity title under the path as a path weight threshold value, and storing the minimum value into the node information of the lowest classification node of the path in the hierarchical classification word stock, so that the hierarchical classification word stock is constructed.

2. The e-commerce commodity classification method according to claim 1, wherein the rules of stage-by-stage matching are as follows:

if the commodity title is matched with the upper-level classification node to form the keyword, respectively matching the commodity title with each lower-level classification node corresponding to the upper-level classification node matched with the keyword when performing keyword matching with the lower-level classification node;

3. The e-commerce commodity classification method according to claim 1, wherein the path weight calculation rule of the title of the commodity to be classified is as follows:

and averaging the weights of all the keywords matched with the commodity title under each classification node in the path to obtain the average weight of the commodity title at each classification node in the path, and then carrying out weighted summation on the average weights of all the classification nodes in the path to obtain the path weight of the commodity title under the path.

4. The e-commerce commodity classification method according to claim 1, specifically comprising the steps of: