CN103294798A - Automatic merchandise classifying method on the basis of binary word segmentation and support vector machine - Google Patents

Automatic merchandise classifying method on the basis of binary word segmentation and support vector machine Download PDF

Info

Publication number
CN103294798A
CN103294798A CN2013102013228A CN201310201322A CN103294798A CN 103294798 A CN103294798 A CN 103294798A CN 2013102013228 A CN2013102013228 A CN 2013102013228A CN 201310201322 A CN201310201322 A CN 201310201322A CN 103294798 A CN103294798 A CN 103294798A
Authority
CN
China
Prior art keywords
commodity
word
classification
binary
vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2013102013228A
Other languages
Chinese (zh)
Other versions
CN103294798B (en
Inventor
许大伦
毛颖
张立群
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Lele Kaihang (Beijing) Education Technology Co., Ltd.
Original Assignee
BEIJING SHANGYOU TONGDA INFORMATION TECHNOLOGY Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by BEIJING SHANGYOU TONGDA INFORMATION TECHNOLOGY Co Ltd filed Critical BEIJING SHANGYOU TONGDA INFORMATION TECHNOLOGY Co Ltd
Priority to CN201310201322.8A priority Critical patent/CN103294798B/en
Publication of CN103294798A publication Critical patent/CN103294798A/en
Application granted granted Critical
Publication of CN103294798B publication Critical patent/CN103294798B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention discloses an automatic merchandise classifying method on the basis of binary word segmentation and a support vector machine. The method mainly includes: subjecting all merchandise titles in a training set to binary word segmentation processing to construct a feature word library; constructing merchandise classification sets, expressing the merchandise titles as specific vectors according to the feature word library, generating training data by the aid of the specific vectors and the merchandise classification sets, and performing parameter optimization on the training data by a sequential dual method to obtain optimal classification vectors; calculating inner products of the optimal classification vectors and the specific vectors expressed by titles of merchandises to be classified, and selecting the classification corresponding to the maximum inner product as classification which the merchandises belong to. The automatic merchandise classifying method solves the problems that a product feature information base is hard to construct, and an automatic merchandise classifying method is long in training time and unsatisfactory in effect due to a feature space construction in the prior art.

Description

Cut the commodity automatic classification method of word and support vector machine based on binary
Technical field
The present invention relates to the data mining field, specifically, relate to a kind of commodity automatic classification method of cutting word and support vector machine (Support Vector Machine, SVM, a kind of automatic learning-oriented sorting algorithm) based on binary.
Background technology
Data mining (Data mining) generally refers to automatic process of searching for the information that special relationship is arranged wherein that is hidden in from lot of data.Classification then is an important step in the data mining.
Along with the develop rapidly of electronic information technology, data mining has been deep into every field, and especially for e-commerce field, the commodity automatic classification method is most important to the merchandise news of magnanimity in the managing electronic commercial affairs efficiently.At present, multiple commodity automatic classification method is arranged, as: the traditional decision-tree of logic-based rule, based on the naive Bayesian of statistical correlation or Bayesian network method, based on the k near neighbor method of the neural net method of perceptron, instance-based learning and based on support vector machine method of vector space etc., according to the literature, the classification accuracy of above-mentioned common method is about 80%.
In the prior art, since support vector machine method have the advantage that classification speed is fast, result precision is high and extensively used.
Problems such as but this method effect in actual applications depends primarily on the structure of feature space, and the data linearity is inseparable if feature space is too little, just must adopt non-linear kernel function, and this can cause the training time long, and effect is undesirable.
Simultaneously, the Chinese title of commodity has comprised many-sided characteristic information (as producer's brand, trade name, specifications and models and price), the correlativity of they and commodity classification varies in size, and makes the accuracy rate that differentiated treatment can be conducive to improve commodity classification in theory.But because quantity of information is huge, make up and safeguard that the cost in such Product Feature Information storehouse is very high, calculated amount is huge, and actual operation is poor.
Therefore, how to solve and be difficult to make up the Product Feature Information storehouse in the prior art and owing to feature space structure causes the commodity automatic classification method training time long and effect is undesirable, just become the technical matters that needs to be resolved hurrily.
Summary of the invention
Technical matters to be solved by this invention provides a kind ofly cuts the commodity automatic classification method of word and support vector machine based on binary, is difficult to make up the Product Feature Information storehouse in the prior art and causes the problem that the commodity automatic classification method training time is long and effect is undesirable because feature space is constructed to solve.
For solving the problems of the technologies described above, the invention provides and a kind ofly cut the commodity automatic classification method of word and support vector machine based on binary, it is characterized in that, comprising:
Carry out binary for all the commodity titles in the training set and cut word processing structural attitude dictionary;
The set of structure commodity classification, simultaneously according to described feature dictionary the commodity header sheet is shown specific vector, generate training data by classification under this specific vector and the commodity, adopt sequential Dual Method to carry out parameter optimization at this training data and obtain the optimal classification vector;
Calculate the inner product of the represented specific vector of described optimal classification vector and the title of commodity to be sorted, select the maximum inner product classification of correspondence as a result as the classification under these commodity.
Preferably, wherein, describedly the commodity title is carried out binary cut word and handle the structural attitude dictionary, further be: all the commodity titles in the training set are carried out adding up word frequency after binary is cut word, select the higher word structural attitude dictionary of frequency.
Preferably, wherein, described training set further comprises commodity titles all in a certain e-commerce website; Described feature dictionary further comprises through binary and cuts the feature word that the resulting reflection merchandise news in back handled in word.
Preferably, wherein, describedly according to described feature dictionary the commodity header sheet is shown specific vector, further is: arbitrary commodity title in the training set is carried out binary cut that the number of times combination table of resulting feature word is shown n-dimensional vector behind the word.
Preferably, wherein, the inner product of the specific vector that the title of the described optimal classification vector of described calculating and commodity to be sorted is represented, further be: commodity title to be sorted is carried out binary cut that the number of times combination table of resulting feature word is shown n-dimensional vector behind the word, calculate the inner product of this n-dimensional vector and described optimal classification vector.
Compared with prior art, of the present inventionly a kind ofly cut the commodity automatic classification method of word and support vector machine based on binary, reached following effect:
1) the present invention carries out binary to the commodity title and cuts word and handle, and has greatly promoted the conveniency that the characteristic information storehouse makes up.
2) the present invention uses the feature word that the commodity header sheet is shown specific vector in the feature space, greatly promoted the property distinguished of commodity, thereby efficiently solved owing to the feature space structure causes the problem that the commodity automatic classification method training time is long and effect is undesirable.
Description of drawings
Accompanying drawing described herein is used to provide further understanding of the present invention, constitutes the application's a part, and illustrative examples of the present invention and explanation thereof are used for explaining the present invention, do not constitute improper restriction of the present invention.In the accompanying drawings:
Fig. 1 is the described schematic process flow diagram of cutting the commodity automatic classification method of word and support vector machine based on binary of the embodiment of the invention.
Embodiment
Censure specific components as in the middle of instructions and claim, having used some vocabulary.Those skilled in the art should understand, and hardware manufacturer may be called same assembly with different nouns.This specification and claims are not used as distinguishing the mode of assembly with the difference of title, but the criterion that is used as distinguishing with the difference of assembly on function.Be an open language as " comprising " mentioned in the middle of instructions and the claim in the whole text, so should be construed to " comprise but be not limited to ".Refer in acceptable error range that " roughly " those skilled in the art can solve the technical problem, and reach described technique effect substantially in certain error range.In addition, " couple " word and comprise any indirect electric property coupling means that directly reach at this.Therefore, be coupled to one second device if describe one first device in the literary composition, then represent described first device and can directly be electrically coupled to described second device, or be electrically coupled to described second device indirectly by other devices or the means that couple.The instructions subsequent descriptions is for implementing preferred embodiments of the present invention, and right described description is to illustrate that rule of the present invention is purpose, is not in order to limit scope of the present invention.Protection scope of the present invention is as the criterion when looking the claims person of defining.
Below in conjunction with accompanying drawing the present invention is described in further detail, but not as a limitation of the invention.
As shown in Figure 1, be the described a kind of commodity automatic classification method flow process of cutting word and support vector machine based on binary of the embodiment of the invention.
Step 101 is carried out binary for all the commodity titles in the training set and is cut word processing structural attitude dictionary;
Wherein, described training set may also be referred to as the commodity head stack, comprises commodity titles all in a certain e-commerce website in the set; Described feature dictionary may also be referred to as the characteristic information storehouse, includes in it through binary to cut the feature word that the resulting reflection merchandise news in back handled in word.
Further, the commodity title is carried out binary cut word processing structural attitude information bank, be specially: all the commodity titles in the training set are carried out adding up word frequency after binary is cut word, select the higher word structural attitude dictionary of frequency.
Further, step 101 is specially:
At first, be L at this hypothesis commodity title, concrete form is: by C 1C 2C 3C K-1C kConstitute, wherein C iBe a Chinese character or English word, k is heading character length;
Afterwards, title L is carried out binary cut word, obtain set of words { C 1C 2, C 2C 3..., C K-1C k, in this set of words, with C iC I+1Be considered as a word, and represent with W;
Afterwards, all commodity titles in the traversal training set are added up the number of times Count (W) that each word W occurs
Then, set a threshold value C TIf, Count (W) 〉=C T(that is, the number of times of word W appearance is greater than the threshold values C that sets T), then W is the feature word;
Thereby, all feature word W constitutive characteristic dictionary { W that obtain 1, W 2..., W n.
Step 102, the set of structure commodity classification, simultaneously according to described feature dictionary the commodity header sheet is shown specific vector, generates training data by classification under this specific vector and the commodity, adopt sequential Dual Method to carry out parameter optimization at this training data and obtain the optimal classification vector.
Further, according to described feature dictionary the commodity header sheet is shown specific vector, is specially: with arbitrary commodity title L in the training set iCarry out binary and cut that the number of times combination table of resulting feature word W is shown n-dimensional vector behind the word.
Further, step 102 is specially:
To all commodity classification numberings (the concrete classification of commodity can be: clothes, trousers, footwear, food or articles for daily use etc.), establishing m is total number of categories, then the classification set can be expressed as: { Y 1, Y 2..., Y m;
With arbitrary commodity title L in the training set iBe expressed as n-dimensional vector X i=(x I, 1, x I, 2...., x I, n), x wherein I, jFor the Li binary being cut the number of times of resultant feature word Wj behind the word;
Inquire about the affiliated classification of these commodity Y i, Y i∈ 1,2 ..., m} obtains training data { X i, Y i;
To described training data { X i, Y iCarry out sequential Dual Method optimization and obtain optimal classification vector V k, wherein, V kCan be expressed as (V K, 1, V K, 2..., V K, n), k=1,2 ..., n.
Step 103 is calculated the inner product of the represented specific vector of described optimal classification vector and the title of commodity to be sorted, selects the maximum inner product classification of correspondence as a result as the classification under these commodity.
Further, commodity title L to be sorted is carried out binary cuts that the number of times combination table of resulting feature word W is shown n-dimensional vector X behind the word, calculate the inner product of this n-dimensional vector X and described optimal classification vector, and with the classification of inner product maximum as the classification under these commodity.
Further, described step 103 is specially:
The title L of commodity to be sorted is expressed as n-dimensional vector X=(x 1, x 2...., x n), x wherein iAfter the L binary being cut word, obtain feature word W mNumber of times;
Calculate the inner product of X and all optimal classification vectors:
S k = Σ i = 1 n V k , i X i
Get inner product the maximum for the prediction classification, if namely
S k * = Max { S 1 , S 2 , . . . , S m }
Then these commodity belong to classification Y k
Above-mentioned sorting technique is carried out binary to the commodity title and is cut word, reject the rare words that the frequency of occurrences is lower than certain threshold value, structural attitude dictionary, the quantity of its feature word are about 70,000, and each commodity title is represented as a sparse vector in the high-dimensional feature space according to its quantity that comprises the feature word; This commodity feature extraction and method for expressing are not only easy and simple to handle, and make inhomogeneous commodity have the well property distinguished.Adopt linear kernel function, support vector machine is trained, obtained good classification results: with all commodity of Jingdone district, half does training, and half does test, and accuracy rate is 94%.
Compared with prior art, of the present inventionly a kind ofly cut the commodity automatic classification method of word and support vector machine based on binary, reached following effect:
1) the present invention carries out binary to the commodity title and cuts word and handle, and has greatly promoted the conveniency that the characteristic information storehouse makes up.
2) the present invention uses the feature word that the commodity header sheet is shown specific vector in the feature space, greatly promoted the property distinguished of commodity, thereby efficiently solved owing to the feature space structure causes the problem that the commodity automatic classification method training time is long and effect is undesirable.
Above-mentioned explanation illustrates and has described some preferred embodiments of the present invention, but as previously mentioned, be to be understood that the present invention is not limited to the disclosed form of this paper, should not regard the eliminating to other embodiment as, and can be used for various other combinations, modification and environment, and can in invention contemplated scope described herein, change by technology or the knowledge of above-mentioned instruction or association area.And the change that those skilled in the art carry out and variation do not break away from the spirit and scope of the present invention, then all should be in the protection domain of claims of the present invention.

Claims (5)

1. cut the commodity automatic classification method of word and support vector machine based on binary for one kind, it is characterized in that, comprising:
Carry out binary for all the commodity titles in the training set and cut word processing structural attitude dictionary;
The set of structure commodity classification, simultaneously according to described feature dictionary the commodity header sheet is shown specific vector, generate training data by classification under this specific vector and the commodity, adopt sequential Dual Method to carry out parameter optimization at this training data and obtain the optimal classification vector;
Calculate the inner product of the represented specific vector of described optimal classification vector and the title of commodity to be sorted, select the maximum inner product classification of correspondence as a result as the classification under these commodity.
2. the commodity automatic classification method of cutting word and support vector machine based on binary as claimed in claim 1, it is characterized in that, describedly the commodity title is carried out binary cut word and handle the structural attitude dictionary, further be: all the commodity titles in the training set are carried out adding up word frequency after binary is cut word, select the higher word structural attitude dictionary of frequency.
3. as claimed in claim 2ly cut the commodity automatic classification method of word and support vector machine based on binary, it is characterized in that, described training set further comprises commodity titles all in a certain e-commerce website; Described feature dictionary further comprises through binary and cuts the feature word that the resulting reflection merchandise news in back handled in word.
4. the commodity automatic classification method of cutting word and support vector machine based on binary as claimed in claim 1, it is characterized in that, describedly according to described feature dictionary the commodity header sheet is shown specific vector, further is: arbitrary commodity title in the training set is carried out binary cut that the number of times combination table of resulting feature word is shown n-dimensional vector behind the word.
5. the commodity automatic classification method of cutting word and support vector machine based on binary as claimed in claim 1, it is characterized in that, the inner product of the specific vector that the title of the described optimal classification vector of described calculating and commodity to be sorted is represented, further be: commodity title to be sorted is carried out binary cut that the number of times combination table of resulting feature word is shown n-dimensional vector behind the word, calculate the inner product of this n-dimensional vector and described optimal classification vector.
CN201310201322.8A 2013-05-27 2013-05-27 Commodity automatic classification method based on binary word segmentation and support vector machine Active CN103294798B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310201322.8A CN103294798B (en) 2013-05-27 2013-05-27 Commodity automatic classification method based on binary word segmentation and support vector machine

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310201322.8A CN103294798B (en) 2013-05-27 2013-05-27 Commodity automatic classification method based on binary word segmentation and support vector machine

Publications (2)

Publication Number Publication Date
CN103294798A true CN103294798A (en) 2013-09-11
CN103294798B CN103294798B (en) 2016-08-31

Family

ID=49095660

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310201322.8A Active CN103294798B (en) 2013-05-27 2013-05-27 Commodity automatic classification method based on binary word segmentation and support vector machine

Country Status (1)

Country Link
CN (1) CN103294798B (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103605815A (en) * 2013-12-11 2014-02-26 焦点科技股份有限公司 Automatic commodity information classifying and recommending method applicable to B2B (Business to Business) e-commerce platform
CN103778205A (en) * 2014-01-13 2014-05-07 北京奇虎科技有限公司 Commodity classifying method and system based on mutual information
CN104063428A (en) * 2014-06-09 2014-09-24 国家计算机网络与信息安全管理中心 Method for detecting unexpected hot topics in Chinese microblogs
CN104268134A (en) * 2014-09-28 2015-01-07 苏州大学 Subjective and objective classifier building method and system
CN110245800A (en) * 2019-06-19 2019-09-17 南京大学金陵学院 A method of based on superior vector spatial model goods made to order information class indication
CN110334306A (en) * 2019-06-21 2019-10-15 无线生活(北京)信息技术有限公司 Label processing method and device
WO2019205319A1 (en) * 2018-04-25 2019-10-31 平安科技(深圳)有限公司 Commodity information format processing method and apparatus, and computer device and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102184262A (en) * 2011-06-15 2011-09-14 悠易互通(北京)广告有限公司 Web-based text classification mining system and web-based text classification mining method
CN102193936A (en) * 2010-03-09 2011-09-21 阿里巴巴集团控股有限公司 Data classification method and device
CN102289522A (en) * 2011-09-19 2011-12-21 北京金和软件股份有限公司 Method of intelligently classifying texts

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102193936A (en) * 2010-03-09 2011-09-21 阿里巴巴集团控股有限公司 Data classification method and device
CN102184262A (en) * 2011-06-15 2011-09-14 悠易互通(北京)广告有限公司 Web-based text classification mining system and web-based text classification mining method
CN102289522A (en) * 2011-09-19 2011-12-21 北京金和软件股份有限公司 Method of intelligently classifying texts

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103605815A (en) * 2013-12-11 2014-02-26 焦点科技股份有限公司 Automatic commodity information classifying and recommending method applicable to B2B (Business to Business) e-commerce platform
CN103605815B (en) * 2013-12-11 2016-08-31 焦点科技股份有限公司 A kind of merchandise news being applicable to B2B E-commerce platform is classified recommendation method automatically
CN103778205A (en) * 2014-01-13 2014-05-07 北京奇虎科技有限公司 Commodity classifying method and system based on mutual information
CN103778205B (en) * 2014-01-13 2018-07-06 北京奇虎科技有限公司 A kind of commodity classification method and system based on mutual information
CN104063428A (en) * 2014-06-09 2014-09-24 国家计算机网络与信息安全管理中心 Method for detecting unexpected hot topics in Chinese microblogs
CN104268134A (en) * 2014-09-28 2015-01-07 苏州大学 Subjective and objective classifier building method and system
WO2019205319A1 (en) * 2018-04-25 2019-10-31 平安科技(深圳)有限公司 Commodity information format processing method and apparatus, and computer device and storage medium
CN110245800A (en) * 2019-06-19 2019-09-17 南京大学金陵学院 A method of based on superior vector spatial model goods made to order information class indication
CN110334306A (en) * 2019-06-21 2019-10-15 无线生活(北京)信息技术有限公司 Label processing method and device

Also Published As

Publication number Publication date
CN103294798B (en) 2016-08-31

Similar Documents

Publication Publication Date Title
Rathi et al. Sentiment analysis of tweets using machine learning approach
CN103294798A (en) Automatic merchandise classifying method on the basis of binary word segmentation and support vector machine
Kong et al. Fake news detection using deep learning
US11100283B2 (en) Method for detecting deceptive e-commerce reviews based on sentiment-topic joint probability
Gokulakrishnan et al. Opinion mining and sentiment analysis on a twitter data stream
Shahana et al. Evaluation of features on sentimental analysis
AU2021269302B2 (en) System and method for coupled detection of syntax and semantics for natural language understanding and generation
CN107862046B (en) A kind of tax commodity code classification method and system based on short text similarity
CN111897970A (en) Text comparison method, device and equipment based on knowledge graph and storage medium
Patra et al. A survey report on text classification with different term weighing methods and comparison between classification algorithms
Bayot et al. Multilingual author profiling using word embedding averages and svms
US20180293294A1 (en) Similar Term Aggregation Method and Apparatus
KR20160121382A (en) Text mining system and tool
CN103778205A (en) Commodity classifying method and system based on mutual information
KR20150037924A (en) Information classification based on product recognition
US11157540B2 (en) Search space reduction for knowledge graph querying and interactions
CN109408802A (en) A kind of method, system and storage medium promoting sentence vector semanteme
CN106681985A (en) Establishment system of multi-field dictionaries based on theme automatic matching
US9256669B2 (en) Stochastic document clustering using rare features
CN112579729A (en) Training method and device for document quality evaluation model, electronic equipment and medium
Rani et al. Study and comparision of vectorization techniques used in text classification
US20120076416A1 (en) Determining correlations between slow stream and fast stream information
CN106204053A (en) The misplaced recognition methods of categories of information and device
CN110874408B (en) Model training method, text recognition device and computing equipment
Diwakar et al. Proposed machine learning classifier algorithm for sentiment analysis

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20190925

Address after: 100088 Beijing Haidian District Garden Road No. 13 Courtyard 7 Floor 12, 1203-1

Patentee after: Lele Kaihang (Beijing) Education Technology Co., Ltd.

Address before: 100085, room 2, building 5, building 1, No. 516, ten Street, Haidian District, Beijing

Patentee before: Beijing Shangyou Tongda Information Technology Co., Ltd.