CN107590156A - A kind of polytypic method of text based on training set cyclic extension - Google Patents
A kind of polytypic method of text based on training set cyclic extension Download PDFInfo
- Publication number
- CN107590156A CN107590156A CN201610535646.9A CN201610535646A CN107590156A CN 107590156 A CN107590156 A CN 107590156A CN 201610535646 A CN201610535646 A CN 201610535646A CN 107590156 A CN107590156 A CN 107590156A
- Authority
- CN
- China
- Prior art keywords
- text
- classification
- training set
- polytypic
- model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Abstract
The present invention relates to the more categorizing system technical fields of text, the method for more particularly to a kind of application program classification based on machine learning.By applying cyclic extension training set, the polytypic classifying quality of text is carried out using computer to improve.Concrete technical scheme includes:Consider scheme using the title and content of text messages of text message and carry out the polytypic method of text;Build keyword, the method for stop words dictionary;The method for manually adding rule improves the effect of small category classification.In the correct classification conclusion using equivalent amount(The correct conclusion manually marked), the method for cyclic extension training set can effectively lift the accuracy rate that text Polyphenols is carried out using computer.Meanwhile the classification effectiveness of this method is far above manual sort.
Description
Technical field
The present invention relates to the more categorizing system technical fields of text, more particularly to a kind of application program based on machine learning point
The method of class.
Background technology
Classify for text at this stage has two kinds of solutions more.One is manual sort, although this method accuracy rate
It is higher, but need substantial amounts of time and human cost.The efficiency of classification and algorithm classification gap are excessive.When needs are carried out greatly
During the classification work of amount, artificial system does not apply to.Meanwhile the accuracy of manual sort and the subjective factor of people are closely bound up.
People can not have monolithic discrimination standard within all working times.Second of solution is carried out for appliance computer
Classification.Because computer needs enough correct classification conclusions, as training set, training set is more abundant, computer classes result
Better.And original correct grouped data is often not enough to the quantity for reaching computer needs.To lift the effect of computer classes
Fruit, need to carry out substantial amounts of manual sort's work again.
The content of the invention
The purpose of invention
Using the method for cyclic extension training set, increase the polytypic training set of text, text is carried out using computer to improve
Polytypic classifying quality.
The technical solution of invention
A kind of polytypic method of text, including:
The heading message of text to be sorted is obtained, arranges text message in itself;Rule based on Keywords matching are established based on general knowledge
Then model;Manually mark a small amount of sample to be used to train, and short text disaggregated model is established using text message, utilize text header
Information establishes another short text disaggregated model.Use the accurate of the method validation of cross validation each textual classification model
Rate;Classification application program is treated using rule model and two short text disaggregated models to be classified, by three category of model knots
The consistent sample of fruit adds training set, then re -training textual classification model, uses the method validation text point of cross validation
The accuracy rate of class model, new textual classification model and rule model is recycled to divide remaining application program to be sorted
Class, said process is repeated, until the accuracy rate of textual classification model is higher than threshold value set in advance;Use the text point of final version
Class model is treated classifying text and classified.
Preferably, the short text disaggregated model is based on following at least one model:It is support vector cassification model, random
Forest classified model, logistic regression disaggregated model, Adaboost disaggregated models, by being modeled respectively using a variety of text description informations
The built-up pattern that obtained multiple single models combine.
A kind of device of application program classification, including:
Rule model establishes unit, for establishing the rule model based on Keywords matching based on general knowledge;
Short text disaggregated model establishes unit, for manually mark a small amount of sample be used for train, and using text header information with
And text establishes short text disaggregated model respectively in itself, the accuracy rate of the method validation textual classification model of cross validation is used;
Training set cyclic extension module, classified for treating classifying text using rule model and short text disaggregated model,
The consistent sample of three category of model results is added into training set, then re -training textual classification model, uses cross validation
Method validation textual classification model accuracy rate, recycle new textual classification model and rule model to remaining to be sorted
Application program is classified, and said process is repeated, until the accuracy rate of textual classification model is higher than threshold value set in advance;
Text classification module, classified for treating classification application program using the textual classification model of final version.Specific stream
Journey is shown in that explanation says the more disaggregated model training flow charts of accompanying drawing Fig. 1 texts.
The effect of invention
Use the correct classification conclusion of equivalent amount(The correct conclusion manually marked), the method for cyclic extension training set can have
Effect lifting carries out the polytypic accuracy rate of text using computer.Meanwhile the classification effectiveness of this method is far above manual sort.
Brief description of the drawings
Fig. 1 is the more disaggregated model training flow charts of text;Fig. 2 is model training flow chart.
Embodiment
The background information of mobile phone app classified instances
With the technological change and application innovation of information industry, ended for the end of the year 2015, Chinese smart mobile phone accounts for whole mobile phone sales volumes
Ratio already exceed 70%;Apple App Store total application number has broken through 1,500,000, and application program is downloaded secondary
Number is broken through 100,000,000,000 times, per second to have 850 application programs to be downloaded.Millions of application programs are used by cellphone subscriber,
And there are many new application programs to come into operation daily.
On the one hand, major mobile phone application market needs to manage these millions of application programs, and application program is entered
Row classification, facilitate user to download and use;On the other hand, major telecom operators have got what cellphone subscriber used by flow
Application name, in order to provide more precisely finer service, it is also desirable to classify to the application program being collected into.
The application information being collected into is classified, the preference of user can precisely be analyzed, is deeply excavated
User uses the record data of application program, and then completes user's portrait, excavates user interest preference.
At present, the sorting technique being generally used is manual sort, very labor intensive resource.Assuming that each staff
The all complete application program of a Name and Description can be classified within every 10 seconds, then 1,000,000 application programs are carried out
The time of classification is 2778 hours, equivalent to 116 consecutive days, by daily 8 man-hour calculations, then needs 347 working days.
When the title for the application program being collected into is lack of standardization and describes imperfect, the time and efforts that staff spends is with regard to more
.
In order that can be used by the polytypic method of text, incited somebody to action for millions of individual application program Fast Classifications with computer
Application program classification problem is converted into the more classification problems of text, and specific method is the description information using application program as one section of text
This, uses text handling method for this section of text, extracts its keyword, be converted into term vector, afterwards using common point
Class model(Such as logistic regression, decision tree, SVMs etc.)Carry out classify more.
It is existing by computer to application class of procedures method the advantages of be that classification speed has compared with manual sort
It is obviously improved, but the defects of due to existing disaggregated model itself, and Chinese language processing technology is immature, to application program point
The effect of class is poor, and accuracy rate is relatively low.
Implementing procedure
1st, app lists
Directly obtain user mobile phone actual installation app lists.
App lists are cleaned, remove wrong data, merge homogeneous data, most preliminary classification is carried out to app types
2nd, reptile
Find out certain accurate app that classifies and apply website, its app descriptions and app grouped datas are crawled, as model training collection.
Using the app lists after cleaning as target, all app description information is crawled.
Crawling description information mainly has two steps, first, from mainstream applications website(360 Tengxuns apply precious Baidu's mobile phone
Assistant)Crawl, for the information that can not be crawled on application website, can be completed by Baidu search.
The data that reptile crawls are cleaned, improve the quality of app descriptions
3rd, classifying rules
(1)Classifying rules establishes standard:
(2)Can be with the strong rule of Direct Classification for app titles.
(3)Order between rule be present, can be adjusted according to business demand and effect.
Rule has to accurately.
The reason for regular is established for app titles rather than description:App describe it is various, rule it is uncontrollable and not accurate.
So there is a strategy, most strong regular classification is done, it is necessary to assure accurate.
4th, disaggregated model
In classification problem, the quality of data set almost determines the upper limit of disaggregated model, and algorithm and Feature Engineering simply allow
Effect is close to this upper limit.The training set quality extreme difference taken from mainstream applications website, include obvious wrong point and repeatedly
Classification.
So these work are being done with energy, it is necessary to expend a great deal of time in the processing of training set and Feature Engineering,
Modelling effect is become better and better.
(1)Formulate the process of training set:
Some are selected from the data set crawled as training set
Poor quality's is removed in adjusting training data set
Some classification are manually labelled with, and utilize artificial rule adjustment and increase training set.
App classification results recycle, by the use of all correct training set of the part after in first round classification results, with
Improve the precision of classification results.
The problem of existing:Training set quantity is few and extremely uneven, and existing training set is of poor quality(Initial training collection is in itself not
Accurately)
Solution:The less classification of training set is, it is necessary to do the work of many rules.Enable in the seldom situation of training set
Under possess good effect.
(2)Feature Engineering:
For text, most important feature is exactly the composition of the word of text.Therefore need to do the feature of many work let us
It is more accurate.It is applied to this domain lexicon etc. including structure.Finally we, which construct, disables dictionary and crucial dictionary, closes
Keyword allusion quotation is mainly the foundation that we are used for classifying, and it is the mark that we do to meaningless word to disable dictionary.
It is separately added into according to the different rules of each keyword and disables dictionary or user-oriented dictionary, is combined and divided according to each keyword
The corresponding relation of class, as APP criteria for classifications.
(3)Dictionary
Disable dictionary:For removing stop words.Generic word+stop word.
Need repeatedly to check and update during participle.
Crucial dictionary:
With reference to the algorithm of a variety of extraction text features(TF-IDF, TEXTRANK integrated application), in conjunction with the domain features of business,
Carry out the extraction and screening of keyword.
Add tag along sort word(The word of all kinds of classification).
In the embodiment of the present invention, by obtaining the description information of application program, knot from mainstream applications market and search engine
Close artificial rule and short text disaggregated model automatic cycle expands training set, greatly save the cost manually marked, realize
The apparatus for automatically sorting of the whole network application program, excavate and lay a good foundation for telecom operators' user preference.
Idiographic flow is shown in Figure of description Fig. 2 model training flow charts.
Claims (4)
1. the method for cyclic extension training set.
2. considering scheme using the title and content of text messages of text message carries out the polytypic method of text.
3. build keyword, the method for stop words dictionary.
4. the method for manually adding rule improves the effect of small category classification.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610535646.9A CN107590156A (en) | 2016-07-09 | 2016-07-09 | A kind of polytypic method of text based on training set cyclic extension |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610535646.9A CN107590156A (en) | 2016-07-09 | 2016-07-09 | A kind of polytypic method of text based on training set cyclic extension |
Publications (1)
Publication Number | Publication Date |
---|---|
CN107590156A true CN107590156A (en) | 2018-01-16 |
Family
ID=61045783
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610535646.9A Pending CN107590156A (en) | 2016-07-09 | 2016-07-09 | A kind of polytypic method of text based on training set cyclic extension |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107590156A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108596338A (en) * | 2018-05-09 | 2018-09-28 | 四川斐讯信息技术有限公司 | A kind of acquisition methods and its system of neural metwork training collection |
CN112185571A (en) * | 2020-09-17 | 2021-01-05 | 吾征智能技术(北京)有限公司 | Disease auxiliary diagnosis system, device and storage medium based on oral acid |
CN112749530A (en) * | 2021-01-11 | 2021-05-04 | 北京光速斑马数据科技有限公司 | Text encoding method, device, equipment and computer readable storage medium |
CN113691492A (en) * | 2021-06-11 | 2021-11-23 | 杭州安恒信息安全技术有限公司 | Method, system, device and readable storage medium for determining illegal application program |
-
2016
- 2016-07-09 CN CN201610535646.9A patent/CN107590156A/en active Pending
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108596338A (en) * | 2018-05-09 | 2018-09-28 | 四川斐讯信息技术有限公司 | A kind of acquisition methods and its system of neural metwork training collection |
CN112185571A (en) * | 2020-09-17 | 2021-01-05 | 吾征智能技术(北京)有限公司 | Disease auxiliary diagnosis system, device and storage medium based on oral acid |
CN112185571B (en) * | 2020-09-17 | 2024-01-16 | 吾征智能技术(北京)有限公司 | Disease auxiliary diagnosis system, equipment and storage medium based on orotic acid |
CN112749530A (en) * | 2021-01-11 | 2021-05-04 | 北京光速斑马数据科技有限公司 | Text encoding method, device, equipment and computer readable storage medium |
CN112749530B (en) * | 2021-01-11 | 2023-12-19 | 北京光速斑马数据科技有限公司 | Text encoding method, apparatus, device and computer readable storage medium |
CN113691492A (en) * | 2021-06-11 | 2021-11-23 | 杭州安恒信息安全技术有限公司 | Method, system, device and readable storage medium for determining illegal application program |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109189901B (en) | Method for automatically discovering new classification and corresponding corpus in intelligent customer service system | |
CN108985293A (en) | A kind of image automation mask method and system based on deep learning | |
CN106960063A (en) | A kind of internet information crawl and commending system for field of inviting outside investment | |
CN107169001A (en) | A kind of textual classification model optimization method based on mass-rent feedback and Active Learning | |
CN107590156A (en) | A kind of polytypic method of text based on training set cyclic extension | |
CN107220295A (en) | A kind of people's contradiction reconciles case retrieval and mediation strategy recommends method | |
CN106021410A (en) | Source code annotation quality evaluation method based on machine learning | |
CN107194617B (en) | App software engineer soft skill classification system and method | |
CN109299271A (en) | Training sample generation, text data, public sentiment event category method and relevant device | |
CN106022708A (en) | Method for predicting employee resignation | |
CN110334214A (en) | A kind of method of false lawsuit in automatic identification case | |
CN107844558A (en) | The determination method and relevant apparatus of a kind of classification information | |
CN108153895A (en) | A kind of building of corpus method and system based on open data | |
CN108491388A (en) | Data set acquisition methods, sorting technique, device, equipment and storage medium | |
CN107066548B (en) | A kind of method that web page interlinkage is extracted in double dimension classification | |
CN113495959B (en) | Financial public opinion identification method and system based on text data | |
CN109598307A (en) | Data screening method, apparatus, server and storage medium | |
CN103246655A (en) | Text categorizing method, device and system | |
CN107465643A (en) | A kind of net flow assorted method of deep learning | |
CN108280164A (en) | A kind of short text filtering and sorting technique based on classification related words | |
CN106951565A (en) | File classification method and the text classifier of acquisition | |
CN106569996A (en) | Chinese-microblog-oriented emotional tendency analysis method | |
CN112347254A (en) | News text classification method and device, computer equipment and storage medium | |
CN111860981A (en) | Enterprise national industry category prediction method and system based on LSTM deep learning | |
CN102214227A (en) | Automatic public opinion monitoring method based on internet hierarchical structure storage |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20180116 |
|
WD01 | Invention patent application deemed withdrawn after publication |