CN103646343A - Text based commodity classification treatment method and system - Google Patents
Text based commodity classification treatment method and system Download PDFInfo
- Publication number
- CN103646343A CN103646343A CN201310701215.1A CN201310701215A CN103646343A CN 103646343 A CN103646343 A CN 103646343A CN 201310701215 A CN201310701215 A CN 201310701215A CN 103646343 A CN103646343 A CN 103646343A
- Authority
- CN
- China
- Prior art keywords
- classification
- commodity
- data
- commodity data
- sorter
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention provides a text based commodity classification treatment method and system. The text based commodity classification treatment method comprises extracting text format of commodity data from a commodity database; grouping the commodity data and performing feature extraction; constructing out a classifier through a training document according to a probability model; performing classification on the commodity data. According to the text based commodity classification treatment method and system, the automatic classification on the commodity data is achieved and the classification speed is high and the text based commodity classification treatment method and system is suitable for the commodity classification treatment of high-volume electronic commerce websites due to the fact that the machine learning mode is adopted.
Description
Technical field
The present invention relates to internet electronic business field, relate in particular to computer realization text based commodity classification disposal route and the system thereof utilized.
Background technology
Internet development is swift and violent, and increasing commodity are sold by the mode of ecommerce.The division of fast and accurately sold commodity being carried out to classification seems more important.Commodity classification is the needs in order to facilitate consumer to buy, and by the suitable group indication of commodity selection, systematically divides step by step large class, middle class, and group is down to the process of kind, pattern, specification etc.Classification is accurately conducive to more Ordering management commodity of e-commerce venture.And specific to e-commerce website, commodity classification be exactly some newly upper products incorporate into as the existing process of classification under certain.
In current internet electronic business website, still there are a lot of picking systems still by web editor or network seller self, to carry out manual sort.There is following drawback in such mode: 1, a large amount of commodity classification consumption in operation too much human cost.2, along with the continuous expansion of website scale, because artificial selection classification speed is slow, cause a lot of commodity can not upload in time website, thereby missed Transaction apparatus meeting.3, thus manual sort can be because different people causes the inconsistent of classification results to the different understanding of the feature of commodity.
Summary of the invention
For solving the existing problem of above-mentioned prior art, the present invention proposes a kind of text based commodity classification disposal route, and be based upon the system in the method.Proposed further by the existing goods related data in e-commerce website station, use Words partition system and Naive Bayes Classification Algorithm to classify, then indirect labor has proofreaded the picking system being optimized.Method and system of the present invention uses the mode of machine learning, and classification speed is fast, and the commodity classification that is applicable to large capacity e-commerce website is processed.
The present invention adopts following technical scheme: from merchandising database, extract text formatting commodity data, commodity data is divided into groups and carries out feature extraction, utilize training file to go out sorter according to Construction of probability model, commodity data is classified.
Preferably, wherein commodity classification data comprise: classification information and merchandise news.
Wherein classification information comprises: classification ID, category name, parent order ID.
Merchandise news comprises: classification number under commodity ID, descriptive labelling, commodity.
Preferably, the present invention is further comprising the steps: before commodity data is divided into groups, commodity data is carried out to data check.
Preferably, wherein commodity data is carried out to feature extraction and comprise: descriptive labelling is carried out to machine word segmentation processing, form to be sorted that comprises effective vocabulary.
Preferably, wherein to commodity data divide into groups to comprise by commodity data in proportion random packet be training file and test file.
Preferably, wherein probability model is naive Bayesian conversion, and naive Bayesian transformation calculations formula is:
Bayes's total probability formula
Wherein C represents classification set, F
irepresent a lexical item in descriptive labelling; P (C|F1 ..., Fn) be posterior probability; P (F1 ..., Fn|C) be likelihood function; P (F1 ..., Fn) be evidence; Utilize conditional independence assumption F
ifor condition independently obtains following probability model
P (C=c) is the ratio of this classification commodity amount and entire service quantity in training file, and p (Fi=fi|C=c) is F for training the key word that in file, this classification comprises
iquantity and this classification in whole quantity ratio of key words, by calculating, most probable classification is the classification of sorter probable value maximum.
Preferably, the present invention also further comprises to commodity classification result is carried out error correction, renewal and sorter is carried out to timing training.
The invention provides a kind of text based commodity classification disposal system, it is characterized in that comprising: data extraction module, for extracting text formatting commodity data; Packet module, for dividing into groups to commodity data; Characteristic extracting module, for carrying out feature extraction to commodity data; Sorter, utilizes training file according to probability model, commodity data to be classified.
Preferably, wherein commodity classification data comprise: classification information and merchandise news.
Wherein classification information comprises: classification ID, category name, parent order ID.
Merchandise news comprises: classification number under commodity ID, descriptive labelling, commodity.
Preferably, above-mentioned commodity classification disposal system, further comprises data preprocessing module, before commodity data is divided into groups, commodity data is carried out to data check.
Preferably, in above-mentioned commodity classification disposal system, commodity data is carried out to feature extraction and comprise: descriptive labelling is carried out to machine word segmentation processing, form to be sorted that comprises effective vocabulary.
Preferably, above-mentioned commodity classification disposal system, wherein to commodity data divide into groups to comprise by commodity data in proportion random packet be training file and test file.
Preferably, above-mentioned commodity classification disposal system, further comprises to commodity classification result is carried out error correction, renewal and sorter is carried out to timing training.
Preferably, above-mentioned commodity classification disposal system, further comprises sort interface module, is used to other classification application that interface interchange is provided.
The present invention has realized commodity classification computer processing method and has been based upon the system in the method.Than prior art, technical scheme of the present invention do not need artificial a large amount of participations, utilize Construction of probability model sorter, take full advantage of machine learning ability, can to commodity data, classify quickly and accurately.
Accompanying drawing explanation
The present invention is further illustrated in conjunction with the embodiments with reference to the accompanying drawings.
Fig. 1 shows the commodity classification process flow figure of the preferred embodiment according to the present invention.
Fig. 2 shows the commodity classification disposal system block diagram of the preferred embodiment according to the present invention.
Embodiment
Below with diagram the principle of the invention accompanying drawing together with the detailed description to one or more embodiment of the present invention is provided.In conjunction with such embodiment, describe the present invention, but the invention is not restricted to any embodiment.Scope of the present invention is only defined by the claims, and the present invention contain manyly substitute, modification and equivalent.Set forth in the following description many details to provide thorough understanding of the present invention.These details are provided for exemplary purposes, and also can realize the present invention according to claims without some or all details in these details.
The basis of the technical program is: e-commerce website itself has accumulated a large amount of commodity, and has the institutional framework of commodity classification.With reference to figure 1, specific embodiment of the invention scheme is as follows:
1.1 merchandise newss are collected
From merchandising database, extract classification information and merchandise news generation text file format, the operation of convenient program afterwards, directly from local file reading out data, is accelerated travelling speed.Classification packets of information contains 3 column information < classification ID, category name, and parent order ID>, wherein parent order ID is 0 expression root classification.For example items for information can be following form:
<014006002,Bras,014006>
Can be directly in a tree structure of internal memory structure according to this file.
Merchandise news comprises 3 column informations: < commodity ID, and descriptive labelling, classification > under commodity, for example merchandise news can be following form:
<136202000,Car?nail?clippers?Cartoon?nail?clipper?ABS+steel?Materials?OPP?packing,018009005>
1.2 raw information pre-service
In the information of collecting in 1.1, need to be to data check.If classification information gives up and can not find the forlorn classification of parent object and the infull classification of data, to guarantee to generate complete classification information.In merchandise news, carry out equally data check, give up the merchandise news entry that data are incomplete.
1.3 merchandise news groupings
Commodity data after 1.2 step process is assigned to two groups in proportion at random, 90% commodity data is generated to commodity data training file product.train.10% commodity data is generated to commodity data test file product.test.Wherein aforementioned proportion can according to circumstances be set arbitrarily.
1.4 feature extraction
Merchandise news in 1.3 is carried out to feature extraction: product description is carried out to machine word segmentation processing, and word segmentation processing can incorporate the affiliated industry dictionary of website accumulation.Filter the invalid vocabulary unit such as stop-word.Word segmentation processing is to be mainly severally can explain semantic word by continuous text message cutting, for example, descriptive labelling: " Car nail clippers Cartoon nail clipper ABS+steel Materials OPP packing " will provide following word by word segmentation processing and filtration: " Car; nail clipper; Cartoon; ABS+steel, OPP ".Finally form to be sorted of comprising each effective key word.
1.5 Naive Bayes Classification
As follows according to Bayesian formula and conversion thereof:
1.6 sorter training
According to the commodity data product.train file of 1.3 generations, it is training sample, the naive Bayesian formula of application 1.5 can generate sorter, and groundwork is to calculate the frequency of occurrences and each characteristic attribute of each classification in training sample to divide the conditional probability estimation to each classification.Its input is characteristic attribute and training sample, and output is sorter.
1.7 sorter evaluation and tests
According to the commodity data product.test file of 1.3 generations, it is test sample book, suppose that commodity data place is categorized as correct classification, apply above-mentioned sorter data are carried out to verification, through test, on the commodity data of millions and the basis of three Chiba subcategories, get in most probable 5 classification situations, conventionally can reach more than 96% accuracy rate.
1.8 artificial error correction, regularly training
Along with the operation of website and the passing of time, website itself can produce the commodity of a part of misclassification, and the department that need to have similar products to patrol and examine, can repartition correct class now by the commodity of misclassification at any time.In addition, classification structure is transition at any time also, when the transition of classification information, need to record the corresponding relation of old classification and new classification, so that sorter can be transformed into new classification while pushing old classification in time.Sorter itself also will regularly be trained, and according to the data variation rate of website product, a generalized case training in month once.
1.9 produce utilization disaggregated model
According to the disaggregated model of above-mentioned generation, in Web App, add disaggregated model, be written into internal memory, provide and need the application of online classification business that interface interchange is provided to other.
The present invention also provides a kind of text based commodity classification disposal system in addition, as shown in Figure 2, and comprising data extraction module, for extracting text formatting commodity data; Packet module, for dividing into groups to commodity data; Characteristic extracting module, for carrying out feature extraction to commodity data; Sorter, utilizes training file according to probability model, commodity data to be classified.
Above-mentioned commodity classification disposal system also comprises data preprocessing module, before commodity data is divided into groups, commodity data is carried out to data check.And comprise sort interface module, be used to other classification application that interface interchange is provided.
Disclosed content is only preferably embodiment of the present invention above; but protection scope of the present invention is not limited to this; anyly be familiar with those skilled in the art in the technical scope that the present invention discloses; the variation that can expect easily or replacement, within all should being encompassed in protection scope of the present invention.Therefore, protection scope of the present invention should be as the criterion with the protection domain of claim.
Claims (21)
1. a text based commodity classification disposal route, is characterized in that,
From merchandising database, extract text formatting commodity data, commodity data is divided into groups and carries out feature extraction, utilize training file to go out sorter according to Construction of probability model, commodity data is classified.
2. method according to claim 1, wherein said text formatting commodity data comprises: classification information and merchandise news.
3. method according to claim 2, wherein said classification information comprises classification ID, category name and parent order ID.
4. method according to claim 2, wherein said merchandise news comprises classification number under commodity ID, descriptive labelling and commodity.
5. method according to claim 1, further comprising the steps: before commodity data is divided into groups, commodity data to be carried out to pre-service.
6. method according to claim 4, wherein carries out feature extraction to commodity data and comprises: descriptive labelling is carried out to machine word segmentation processing, form to be sorted that comprises effective vocabulary.
7. method according to claim 1, wherein to commodity data divide into groups to comprise by commodity data in proportion random packet be training file and test file.
8. method according to claim 1, wherein probability model is naive Bayesian conversion.
9. method according to claim 8, wherein naive Bayesian transformation calculations formula is:
Bayes's total probability formula
Wherein C represents classification set, F
irepresent a lexical item in descriptive labelling; P (C|F1 ..., Fn) be posterior probability; P (F1 ..., Fn|C) be likelihood function; P (F1 ..., Fn) be evidence; Utilize conditional independence assumption F
ifor condition independently obtains following probability model
10. method according to claim 1, further comprises to commodity classification result is carried out error correction, renewal and sorter is carried out to timing training.
11. 1 kinds of text based commodity classification disposal systems, is characterized in that comprising:
Data extraction module, for extracting text formatting commodity data;
Packet module, for dividing into groups to commodity data;
Characteristic extracting module, for carrying out feature extraction to commodity data;
Sorter, utilizes training file according to probability model, commodity data to be classified.
12. systems according to claim 11, described text formatting commodity data comprises: classification information and merchandise news.
13. systems according to claim 12, wherein said classification information comprises classification ID, category name and parent order ID.
14. systems according to claim 12, wherein said merchandise news comprises the affiliated classification number of commodity ID, descriptive labelling and commodity.
15. systems according to claim 11, further comprise data preprocessing module, before commodity data is divided into groups, commodity data are carried out to data check.
16. systems according to claim 14, wherein carry out feature extraction to commodity data and comprise: descriptive labelling is carried out to machine word segmentation processing, form to be sorted that comprises effective vocabulary.
17. systems according to claim 1, wherein to commodity data divide into groups to comprise by commodity data in proportion random packet be training file and test file.
18. systems according to claim 1, wherein probability model is naive Bayesian conversion.
19. systems according to claim 18, wherein naive Bayesian transformation calculations formula is:
Bayes's total probability formula
Wherein C represents classification set, F
irepresent a lexical item in descriptive labelling; P (C|F1 ..., Fn) be posterior probability; P (F1 ..., Fn|C) be likelihood function; P (F1 ..., Fn) be evidence; Utilize conditional independence assumption F
ifor condition independently obtains following probability model
20. systems according to claim 11, further comprise to commodity classification result are carried out error correction, renewal and sorter is carried out to timing training.
21. systems according to claim 11, further comprise sort interface module, are used to other classification application that interface interchange is provided.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310701215.1A CN103646343A (en) | 2013-12-18 | 2013-12-18 | Text based commodity classification treatment method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310701215.1A CN103646343A (en) | 2013-12-18 | 2013-12-18 | Text based commodity classification treatment method and system |
Publications (1)
Publication Number | Publication Date |
---|---|
CN103646343A true CN103646343A (en) | 2014-03-19 |
Family
ID=50251553
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201310701215.1A Pending CN103646343A (en) | 2013-12-18 | 2013-12-18 | Text based commodity classification treatment method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103646343A (en) |
Cited By (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104361010A (en) * | 2014-10-11 | 2015-02-18 | 北京中搜网络技术股份有限公司 | Automatic classification method for correcting news classification |
CN104616198A (en) * | 2015-02-12 | 2015-05-13 | 哈尔滨工业大学 | P2P (peer-to-peer) network lending risk prediction system based on text analysis |
CN105205081A (en) * | 2014-06-27 | 2015-12-30 | 华为技术有限公司 | Article recommendation method and device |
CN105677677A (en) * | 2014-11-20 | 2016-06-15 | 阿里巴巴集团控股有限公司 | Information classification and device |
CN105956083A (en) * | 2016-04-29 | 2016-09-21 | 广州优视网络科技有限公司 | Application software classification system, application software classification method and server |
CN106021350A (en) * | 2016-05-10 | 2016-10-12 | 湖北工程学院 | An artwork collection and management method and an artwork collection and management system |
CN106650783A (en) * | 2015-10-30 | 2017-05-10 | 李静涛 | Method, device and system for mobile terminal data classifying, generating and matching |
CN107704892A (en) * | 2017-11-07 | 2018-02-16 | 宁波爱信诺航天信息有限公司 | A kind of commodity code sorting technique and system based on Bayesian model |
CN108564443A (en) * | 2018-04-13 | 2018-09-21 | 广东星外星文化传播有限公司 | Commodity ranking method and device |
CN109509014A (en) * | 2018-09-06 | 2019-03-22 | 微梦创科网络科技(中国)有限公司 | A kind of put-on method and device of media information |
CN109598517A (en) * | 2017-09-29 | 2019-04-09 | 阿里巴巴集团控股有限公司 | Commodity clearance processing, the processing of object and its class prediction method and apparatus |
CN109766440A (en) * | 2018-12-17 | 2019-05-17 | 航天信息股份有限公司 | A kind of method and system for for the determining default categories information of object text description |
CN109858027A (en) * | 2019-01-22 | 2019-06-07 | 北京万诚信用评价有限公司 | One tool method for identifying and classifying of internet four product of electric business merchandise news |
CN110009796A (en) * | 2019-04-11 | 2019-07-12 | 北京邮电大学 | Invoice category recognition methods, device, electronic equipment and readable storage medium storing program for executing |
CN110135463A (en) * | 2019-04-18 | 2019-08-16 | 微梦创科网络科技(中国)有限公司 | A kind of commodity method for pushing and device |
CN111274504A (en) * | 2020-01-20 | 2020-06-12 | 浙江中国轻纺城网络有限公司 | Commodity classification method, device and equipment for e-commerce platform |
CN111353838A (en) * | 2018-12-21 | 2020-06-30 | 北京京东尚科信息技术有限公司 | Method and device for automatically checking commodity category |
-
2013
- 2013-12-18 CN CN201310701215.1A patent/CN103646343A/en active Pending
Cited By (23)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105205081B (en) * | 2014-06-27 | 2019-11-05 | 华为技术有限公司 | Item recommendation method and device |
CN105205081A (en) * | 2014-06-27 | 2015-12-30 | 华为技术有限公司 | Article recommendation method and device |
CN104361010A (en) * | 2014-10-11 | 2015-02-18 | 北京中搜网络技术股份有限公司 | Automatic classification method for correcting news classification |
CN105677677A (en) * | 2014-11-20 | 2016-06-15 | 阿里巴巴集团控股有限公司 | Information classification and device |
CN104616198A (en) * | 2015-02-12 | 2015-05-13 | 哈尔滨工业大学 | P2P (peer-to-peer) network lending risk prediction system based on text analysis |
CN104616198B (en) * | 2015-02-12 | 2018-01-26 | 哈尔滨工业大学 | A kind of P2P network loan Risk Forecast Systems based on text analyzing |
CN106650783A (en) * | 2015-10-30 | 2017-05-10 | 李静涛 | Method, device and system for mobile terminal data classifying, generating and matching |
CN105956083A (en) * | 2016-04-29 | 2016-09-21 | 广州优视网络科技有限公司 | Application software classification system, application software classification method and server |
CN106021350A (en) * | 2016-05-10 | 2016-10-12 | 湖北工程学院 | An artwork collection and management method and an artwork collection and management system |
CN109598517A (en) * | 2017-09-29 | 2019-04-09 | 阿里巴巴集团控股有限公司 | Commodity clearance processing, the processing of object and its class prediction method and apparatus |
CN107704892A (en) * | 2017-11-07 | 2018-02-16 | 宁波爱信诺航天信息有限公司 | A kind of commodity code sorting technique and system based on Bayesian model |
CN108564443A (en) * | 2018-04-13 | 2018-09-21 | 广东星外星文化传播有限公司 | Commodity ranking method and device |
CN109509014A (en) * | 2018-09-06 | 2019-03-22 | 微梦创科网络科技(中国)有限公司 | A kind of put-on method and device of media information |
CN109509014B (en) * | 2018-09-06 | 2021-07-27 | 微梦创科网络科技(中国)有限公司 | Media information delivery method and device |
CN109766440A (en) * | 2018-12-17 | 2019-05-17 | 航天信息股份有限公司 | A kind of method and system for for the determining default categories information of object text description |
CN109766440B (en) * | 2018-12-17 | 2023-09-01 | 航天信息股份有限公司 | Method and system for determining default classification information for object text description |
CN111353838A (en) * | 2018-12-21 | 2020-06-30 | 北京京东尚科信息技术有限公司 | Method and device for automatically checking commodity category |
CN109858027A (en) * | 2019-01-22 | 2019-06-07 | 北京万诚信用评价有限公司 | One tool method for identifying and classifying of internet four product of electric business merchandise news |
CN110009796A (en) * | 2019-04-11 | 2019-07-12 | 北京邮电大学 | Invoice category recognition methods, device, electronic equipment and readable storage medium storing program for executing |
CN110009796B (en) * | 2019-04-11 | 2020-12-04 | 北京邮电大学 | Invoice category identification method and device, electronic equipment and readable storage medium |
CN110135463A (en) * | 2019-04-18 | 2019-08-16 | 微梦创科网络科技(中国)有限公司 | A kind of commodity method for pushing and device |
CN111274504A (en) * | 2020-01-20 | 2020-06-12 | 浙江中国轻纺城网络有限公司 | Commodity classification method, device and equipment for e-commerce platform |
CN111274504B (en) * | 2020-01-20 | 2023-09-26 | 浙江中国轻纺城网络有限公司 | Commodity classification method, device and equipment of e-commerce platform |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103646343A (en) | Text based commodity classification treatment method and system | |
CN103605815B (en) | A kind of merchandise news being applicable to B2B E-commerce platform is classified recommendation method automatically | |
CN104408093A (en) | News event element extracting method and device | |
Nagamma et al. | An improved sentiment analysis of online movie reviews based on clustering for box-office prediction | |
CN110377696A (en) | A kind of commodity future news the analysis of public opinion method and system | |
CN103914478A (en) | Webpage training method and system and webpage prediction method and system | |
CN103310343A (en) | Commodity information issuing method and device | |
CN103778214A (en) | Commodity property clustering method based on user comments | |
CN103324666A (en) | Topic tracing method and device based on micro-blog data | |
US9104761B2 (en) | Document analysis device, document analysis method, and computer readable recording medium | |
CN111414520B (en) | Intelligent mining system for sensitive information in public opinion information | |
CN101706812B (en) | Method and device for searching documents | |
US9639818B2 (en) | Creation of event types for news mining for enterprise resource planning | |
CN104965931A (en) | Big data based public opinion analysis method | |
CN106557558A (en) | A kind of data analysing method and device | |
CN103984705A (en) | Search result displaying method, device and system | |
CN104834651A (en) | Method and apparatus for providing answers to frequently asked questions | |
CN105138577A (en) | Big data based event evolution analysis method | |
CN109710725A (en) | A kind of Chinese table column label restoration methods and system based on text classification | |
CN101853282A (en) | System and method for extracting information of cross-site shopping mode of user | |
CN103577472A (en) | Method and system for obtaining and presuming personal information as well as method and system for classifying and retrieving commodities | |
CN104133913B (en) | A kind of city retail shop information bank automatic build system being polymerized with search based on video analysis and method | |
CN105824915A (en) | Method and system for generating commenting digest of online shopped product | |
US9443214B2 (en) | News mining for enterprise resource planning | |
CN102799666B (en) | Method for automatically categorizing texts of network news based on frequent term set |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20140319 |