CN103294798A - Automatic merchandise classifying method on the basis of binary word segmentation and support vector machine - Google Patents
Automatic merchandise classifying method on the basis of binary word segmentation and support vector machine Download PDFInfo
- Publication number
- CN103294798A CN103294798A CN2013102013228A CN201310201322A CN103294798A CN 103294798 A CN103294798 A CN 103294798A CN 2013102013228 A CN2013102013228 A CN 2013102013228A CN 201310201322 A CN201310201322 A CN 201310201322A CN 103294798 A CN103294798 A CN 103294798A
- Authority
- CN
- China
- Prior art keywords
- commodity
- word
- classification
- binary
- vector
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Abstract
The invention discloses an automatic merchandise classifying method on the basis of binary word segmentation and a support vector machine. The method mainly includes: subjecting all merchandise titles in a training set to binary word segmentation processing to construct a feature word library; constructing merchandise classification sets, expressing the merchandise titles as specific vectors according to the feature word library, generating training data by the aid of the specific vectors and the merchandise classification sets, and performing parameter optimization on the training data by a sequential dual method to obtain optimal classification vectors; calculating inner products of the optimal classification vectors and the specific vectors expressed by titles of merchandises to be classified, and selecting the classification corresponding to the maximum inner product as classification which the merchandises belong to. The automatic merchandise classifying method solves the problems that a product feature information base is hard to construct, and an automatic merchandise classifying method is long in training time and unsatisfactory in effect due to a feature space construction in the prior art.
Description
Technical field
The present invention relates to the data mining field, specifically, relate to a kind of commodity automatic classification method of cutting word and support vector machine (Support Vector Machine, SVM, a kind of automatic learning-oriented sorting algorithm) based on binary.
Background technology
Data mining (Data mining) generally refers to automatic process of searching for the information that special relationship is arranged wherein that is hidden in from lot of data.Classification then is an important step in the data mining.
Along with the develop rapidly of electronic information technology, data mining has been deep into every field, and especially for e-commerce field, the commodity automatic classification method is most important to the merchandise news of magnanimity in the managing electronic commercial affairs efficiently.At present, multiple commodity automatic classification method is arranged, as: the traditional decision-tree of logic-based rule, based on the naive Bayesian of statistical correlation or Bayesian network method, based on the k near neighbor method of the neural net method of perceptron, instance-based learning and based on support vector machine method of vector space etc., according to the literature, the classification accuracy of above-mentioned common method is about 80%.
In the prior art, since support vector machine method have the advantage that classification speed is fast, result precision is high and extensively used.
Problems such as but this method effect in actual applications depends primarily on the structure of feature space, and the data linearity is inseparable if feature space is too little, just must adopt non-linear kernel function, and this can cause the training time long, and effect is undesirable.
Simultaneously, the Chinese title of commodity has comprised many-sided characteristic information (as producer's brand, trade name, specifications and models and price), the correlativity of they and commodity classification varies in size, and makes the accuracy rate that differentiated treatment can be conducive to improve commodity classification in theory.But because quantity of information is huge, make up and safeguard that the cost in such Product Feature Information storehouse is very high, calculated amount is huge, and actual operation is poor.
Therefore, how to solve and be difficult to make up the Product Feature Information storehouse in the prior art and owing to feature space structure causes the commodity automatic classification method training time long and effect is undesirable, just become the technical matters that needs to be resolved hurrily.
Summary of the invention
Technical matters to be solved by this invention provides a kind ofly cuts the commodity automatic classification method of word and support vector machine based on binary, is difficult to make up the Product Feature Information storehouse in the prior art and causes the problem that the commodity automatic classification method training time is long and effect is undesirable because feature space is constructed to solve.
For solving the problems of the technologies described above, the invention provides and a kind ofly cut the commodity automatic classification method of word and support vector machine based on binary, it is characterized in that, comprising:
Carry out binary for all the commodity titles in the training set and cut word processing structural attitude dictionary;
The set of structure commodity classification, simultaneously according to described feature dictionary the commodity header sheet is shown specific vector, generate training data by classification under this specific vector and the commodity, adopt sequential Dual Method to carry out parameter optimization at this training data and obtain the optimal classification vector;
Calculate the inner product of the represented specific vector of described optimal classification vector and the title of commodity to be sorted, select the maximum inner product classification of correspondence as a result as the classification under these commodity.
Preferably, wherein, describedly the commodity title is carried out binary cut word and handle the structural attitude dictionary, further be: all the commodity titles in the training set are carried out adding up word frequency after binary is cut word, select the higher word structural attitude dictionary of frequency.
Preferably, wherein, described training set further comprises commodity titles all in a certain e-commerce website; Described feature dictionary further comprises through binary and cuts the feature word that the resulting reflection merchandise news in back handled in word.
Preferably, wherein, describedly according to described feature dictionary the commodity header sheet is shown specific vector, further is: arbitrary commodity title in the training set is carried out binary cut that the number of times combination table of resulting feature word is shown n-dimensional vector behind the word.
Preferably, wherein, the inner product of the specific vector that the title of the described optimal classification vector of described calculating and commodity to be sorted is represented, further be: commodity title to be sorted is carried out binary cut that the number of times combination table of resulting feature word is shown n-dimensional vector behind the word, calculate the inner product of this n-dimensional vector and described optimal classification vector.
Compared with prior art, of the present inventionly a kind ofly cut the commodity automatic classification method of word and support vector machine based on binary, reached following effect:
1) the present invention carries out binary to the commodity title and cuts word and handle, and has greatly promoted the conveniency that the characteristic information storehouse makes up.
2) the present invention uses the feature word that the commodity header sheet is shown specific vector in the feature space, greatly promoted the property distinguished of commodity, thereby efficiently solved owing to the feature space structure causes the problem that the commodity automatic classification method training time is long and effect is undesirable.
Description of drawings
Accompanying drawing described herein is used to provide further understanding of the present invention, constitutes the application's a part, and illustrative examples of the present invention and explanation thereof are used for explaining the present invention, do not constitute improper restriction of the present invention.In the accompanying drawings:
Fig. 1 is the described schematic process flow diagram of cutting the commodity automatic classification method of word and support vector machine based on binary of the embodiment of the invention.
Embodiment
Censure specific components as in the middle of instructions and claim, having used some vocabulary.Those skilled in the art should understand, and hardware manufacturer may be called same assembly with different nouns.This specification and claims are not used as distinguishing the mode of assembly with the difference of title, but the criterion that is used as distinguishing with the difference of assembly on function.Be an open language as " comprising " mentioned in the middle of instructions and the claim in the whole text, so should be construed to " comprise but be not limited to ".Refer in acceptable error range that " roughly " those skilled in the art can solve the technical problem, and reach described technique effect substantially in certain error range.In addition, " couple " word and comprise any indirect electric property coupling means that directly reach at this.Therefore, be coupled to one second device if describe one first device in the literary composition, then represent described first device and can directly be electrically coupled to described second device, or be electrically coupled to described second device indirectly by other devices or the means that couple.The instructions subsequent descriptions is for implementing preferred embodiments of the present invention, and right described description is to illustrate that rule of the present invention is purpose, is not in order to limit scope of the present invention.Protection scope of the present invention is as the criterion when looking the claims person of defining.
Below in conjunction with accompanying drawing the present invention is described in further detail, but not as a limitation of the invention.
As shown in Figure 1, be the described a kind of commodity automatic classification method flow process of cutting word and support vector machine based on binary of the embodiment of the invention.
Wherein, described training set may also be referred to as the commodity head stack, comprises commodity titles all in a certain e-commerce website in the set; Described feature dictionary may also be referred to as the characteristic information storehouse, includes in it through binary to cut the feature word that the resulting reflection merchandise news in back handled in word.
Further, the commodity title is carried out binary cut word processing structural attitude information bank, be specially: all the commodity titles in the training set are carried out adding up word frequency after binary is cut word, select the higher word structural attitude dictionary of frequency.
Further, step 101 is specially:
At first, be L at this hypothesis commodity title, concrete form is: by C
1C
2C
3C
K-1C
kConstitute, wherein C
iBe a Chinese character or English word, k is heading character length;
Afterwards, title L is carried out binary cut word, obtain set of words { C
1C
2, C
2C
3..., C
K-1C
k, in this set of words, with C
iC
I+1Be considered as a word, and represent with W;
Afterwards, all commodity titles in the traversal training set are added up the number of times Count (W) that each word W occurs
Then, set a threshold value C
TIf, Count (W) 〉=C
T(that is, the number of times of word W appearance is greater than the threshold values C that sets
T), then W is the feature word;
Thereby, all feature word W constitutive characteristic dictionary { W that obtain
1, W
2..., W
n.
Further, according to described feature dictionary the commodity header sheet is shown specific vector, is specially: with arbitrary commodity title L in the training set
iCarry out binary and cut that the number of times combination table of resulting feature word W is shown n-dimensional vector behind the word.
Further, step 102 is specially:
To all commodity classification numberings (the concrete classification of commodity can be: clothes, trousers, footwear, food or articles for daily use etc.), establishing m is total number of categories, then the classification set can be expressed as: { Y
1, Y
2..., Y
m;
With arbitrary commodity title L in the training set
iBe expressed as n-dimensional vector X
i=(x
I, 1, x
I, 2...., x
I, n), x wherein
I, jFor the Li binary being cut the number of times of resultant feature word Wj behind the word;
Inquire about the affiliated classification of these commodity Y
i, Y
i∈ 1,2 ..., m} obtains training data { X
i, Y
i;
To described training data { X
i, Y
iCarry out sequential Dual Method optimization and obtain optimal classification vector V
k, wherein, V
kCan be expressed as (V
K, 1, V
K, 2..., V
K, n), k=1,2 ..., n.
Further, commodity title L to be sorted is carried out binary cuts that the number of times combination table of resulting feature word W is shown n-dimensional vector X behind the word, calculate the inner product of this n-dimensional vector X and described optimal classification vector, and with the classification of inner product maximum as the classification under these commodity.
Further, described step 103 is specially:
The title L of commodity to be sorted is expressed as n-dimensional vector X=(x
1, x
2...., x
n), x wherein
iAfter the L binary being cut word, obtain feature word W
mNumber of times;
Calculate the inner product of X and all optimal classification vectors:
Get inner product the maximum for the prediction classification, if namely
Then these commodity belong to classification Y
k
Above-mentioned sorting technique is carried out binary to the commodity title and is cut word, reject the rare words that the frequency of occurrences is lower than certain threshold value, structural attitude dictionary, the quantity of its feature word are about 70,000, and each commodity title is represented as a sparse vector in the high-dimensional feature space according to its quantity that comprises the feature word; This commodity feature extraction and method for expressing are not only easy and simple to handle, and make inhomogeneous commodity have the well property distinguished.Adopt linear kernel function, support vector machine is trained, obtained good classification results: with all commodity of Jingdone district, half does training, and half does test, and accuracy rate is 94%.
Compared with prior art, of the present inventionly a kind ofly cut the commodity automatic classification method of word and support vector machine based on binary, reached following effect:
1) the present invention carries out binary to the commodity title and cuts word and handle, and has greatly promoted the conveniency that the characteristic information storehouse makes up.
2) the present invention uses the feature word that the commodity header sheet is shown specific vector in the feature space, greatly promoted the property distinguished of commodity, thereby efficiently solved owing to the feature space structure causes the problem that the commodity automatic classification method training time is long and effect is undesirable.
Above-mentioned explanation illustrates and has described some preferred embodiments of the present invention, but as previously mentioned, be to be understood that the present invention is not limited to the disclosed form of this paper, should not regard the eliminating to other embodiment as, and can be used for various other combinations, modification and environment, and can in invention contemplated scope described herein, change by technology or the knowledge of above-mentioned instruction or association area.And the change that those skilled in the art carry out and variation do not break away from the spirit and scope of the present invention, then all should be in the protection domain of claims of the present invention.
Claims (5)
1. cut the commodity automatic classification method of word and support vector machine based on binary for one kind, it is characterized in that, comprising:
Carry out binary for all the commodity titles in the training set and cut word processing structural attitude dictionary;
The set of structure commodity classification, simultaneously according to described feature dictionary the commodity header sheet is shown specific vector, generate training data by classification under this specific vector and the commodity, adopt sequential Dual Method to carry out parameter optimization at this training data and obtain the optimal classification vector;
Calculate the inner product of the represented specific vector of described optimal classification vector and the title of commodity to be sorted, select the maximum inner product classification of correspondence as a result as the classification under these commodity.
2. the commodity automatic classification method of cutting word and support vector machine based on binary as claimed in claim 1, it is characterized in that, describedly the commodity title is carried out binary cut word and handle the structural attitude dictionary, further be: all the commodity titles in the training set are carried out adding up word frequency after binary is cut word, select the higher word structural attitude dictionary of frequency.
3. as claimed in claim 2ly cut the commodity automatic classification method of word and support vector machine based on binary, it is characterized in that, described training set further comprises commodity titles all in a certain e-commerce website; Described feature dictionary further comprises through binary and cuts the feature word that the resulting reflection merchandise news in back handled in word.
4. the commodity automatic classification method of cutting word and support vector machine based on binary as claimed in claim 1, it is characterized in that, describedly according to described feature dictionary the commodity header sheet is shown specific vector, further is: arbitrary commodity title in the training set is carried out binary cut that the number of times combination table of resulting feature word is shown n-dimensional vector behind the word.
5. the commodity automatic classification method of cutting word and support vector machine based on binary as claimed in claim 1, it is characterized in that, the inner product of the specific vector that the title of the described optimal classification vector of described calculating and commodity to be sorted is represented, further be: commodity title to be sorted is carried out binary cut that the number of times combination table of resulting feature word is shown n-dimensional vector behind the word, calculate the inner product of this n-dimensional vector and described optimal classification vector.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310201322.8A CN103294798B (en) | 2013-05-27 | 2013-05-27 | Commodity automatic classification method based on binary word segmentation and support vector machine |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310201322.8A CN103294798B (en) | 2013-05-27 | 2013-05-27 | Commodity automatic classification method based on binary word segmentation and support vector machine |
Publications (2)
Publication Number | Publication Date |
---|---|
CN103294798A true CN103294798A (en) | 2013-09-11 |
CN103294798B CN103294798B (en) | 2016-08-31 |
Family
ID=49095660
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201310201322.8A Active CN103294798B (en) | 2013-05-27 | 2013-05-27 | Commodity automatic classification method based on binary word segmentation and support vector machine |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103294798B (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103605815A (en) * | 2013-12-11 | 2014-02-26 | 焦点科技股份有限公司 | Automatic commodity information classifying and recommending method applicable to B2B (Business to Business) e-commerce platform |
CN103778205A (en) * | 2014-01-13 | 2014-05-07 | 北京奇虎科技有限公司 | Commodity classifying method and system based on mutual information |
CN104063428A (en) * | 2014-06-09 | 2014-09-24 | 国家计算机网络与信息安全管理中心 | Method for detecting unexpected hot topics in Chinese microblogs |
CN104268134A (en) * | 2014-09-28 | 2015-01-07 | 苏州大学 | Subjective and objective classifier building method and system |
CN110245800A (en) * | 2019-06-19 | 2019-09-17 | 南京大学金陵学院 | A method of based on superior vector spatial model goods made to order information class indication |
CN110334306A (en) * | 2019-06-21 | 2019-10-15 | 无线生活(北京)信息技术有限公司 | Label processing method and device |
WO2019205319A1 (en) * | 2018-04-25 | 2019-10-31 | 平安科技(深圳)有限公司 | Commodity information format processing method and apparatus, and computer device and storage medium |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102184262A (en) * | 2011-06-15 | 2011-09-14 | 悠易互通(北京)广告有限公司 | Web-based text classification mining system and web-based text classification mining method |
CN102193936A (en) * | 2010-03-09 | 2011-09-21 | 阿里巴巴集团控股有限公司 | Data classification method and device |
CN102289522A (en) * | 2011-09-19 | 2011-12-21 | 北京金和软件股份有限公司 | Method of intelligently classifying texts |
-
2013
- 2013-05-27 CN CN201310201322.8A patent/CN103294798B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102193936A (en) * | 2010-03-09 | 2011-09-21 | 阿里巴巴集团控股有限公司 | Data classification method and device |
CN102184262A (en) * | 2011-06-15 | 2011-09-14 | 悠易互通(北京)广告有限公司 | Web-based text classification mining system and web-based text classification mining method |
CN102289522A (en) * | 2011-09-19 | 2011-12-21 | 北京金和软件股份有限公司 | Method of intelligently classifying texts |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103605815A (en) * | 2013-12-11 | 2014-02-26 | 焦点科技股份有限公司 | Automatic commodity information classifying and recommending method applicable to B2B (Business to Business) e-commerce platform |
CN103605815B (en) * | 2013-12-11 | 2016-08-31 | 焦点科技股份有限公司 | A kind of merchandise news being applicable to B2B E-commerce platform is classified recommendation method automatically |
CN103778205A (en) * | 2014-01-13 | 2014-05-07 | 北京奇虎科技有限公司 | Commodity classifying method and system based on mutual information |
CN103778205B (en) * | 2014-01-13 | 2018-07-06 | 北京奇虎科技有限公司 | A kind of commodity classification method and system based on mutual information |
CN104063428A (en) * | 2014-06-09 | 2014-09-24 | 国家计算机网络与信息安全管理中心 | Method for detecting unexpected hot topics in Chinese microblogs |
CN104268134A (en) * | 2014-09-28 | 2015-01-07 | 苏州大学 | Subjective and objective classifier building method and system |
WO2019205319A1 (en) * | 2018-04-25 | 2019-10-31 | 平安科技(深圳)有限公司 | Commodity information format processing method and apparatus, and computer device and storage medium |
CN110245800A (en) * | 2019-06-19 | 2019-09-17 | 南京大学金陵学院 | A method of based on superior vector spatial model goods made to order information class indication |
CN110334306A (en) * | 2019-06-21 | 2019-10-15 | 无线生活(北京)信息技术有限公司 | Label processing method and device |
Also Published As
Publication number | Publication date |
---|---|
CN103294798B (en) | 2016-08-31 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Rathi et al. | Sentiment analysis of tweets using machine learning approach | |
CN103294798A (en) | Automatic merchandise classifying method on the basis of binary word segmentation and support vector machine | |
Kong et al. | Fake news detection using deep learning | |
US11100283B2 (en) | Method for detecting deceptive e-commerce reviews based on sentiment-topic joint probability | |
Gokulakrishnan et al. | Opinion mining and sentiment analysis on a twitter data stream | |
Shahana et al. | Evaluation of features on sentimental analysis | |
AU2021269302B2 (en) | System and method for coupled detection of syntax and semantics for natural language understanding and generation | |
CN107862046B (en) | A kind of tax commodity code classification method and system based on short text similarity | |
CN111897970A (en) | Text comparison method, device and equipment based on knowledge graph and storage medium | |
Patra et al. | A survey report on text classification with different term weighing methods and comparison between classification algorithms | |
Bayot et al. | Multilingual author profiling using word embedding averages and svms | |
US20180293294A1 (en) | Similar Term Aggregation Method and Apparatus | |
KR20160121382A (en) | Text mining system and tool | |
CN103778205A (en) | Commodity classifying method and system based on mutual information | |
KR20150037924A (en) | Information classification based on product recognition | |
US11157540B2 (en) | Search space reduction for knowledge graph querying and interactions | |
CN109408802A (en) | A kind of method, system and storage medium promoting sentence vector semanteme | |
CN106681985A (en) | Establishment system of multi-field dictionaries based on theme automatic matching | |
US9256669B2 (en) | Stochastic document clustering using rare features | |
CN112579729A (en) | Training method and device for document quality evaluation model, electronic equipment and medium | |
Rani et al. | Study and comparision of vectorization techniques used in text classification | |
US20120076416A1 (en) | Determining correlations between slow stream and fast stream information | |
CN106204053A (en) | The misplaced recognition methods of categories of information and device | |
CN110874408B (en) | Model training method, text recognition device and computing equipment | |
Diwakar et al. | Proposed machine learning classifier algorithm for sentiment analysis |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
TR01 | Transfer of patent right | ||
TR01 | Transfer of patent right |
Effective date of registration: 20190925 Address after: 100088 Beijing Haidian District Garden Road No. 13 Courtyard 7 Floor 12, 1203-1 Patentee after: Lele Kaihang (Beijing) Education Technology Co., Ltd. Address before: 100085, room 2, building 5, building 1, No. 516, ten Street, Haidian District, Beijing Patentee before: Beijing Shangyou Tongda Information Technology Co., Ltd. |