CN107590156A

CN107590156A - A kind of polytypic method of text based on training set cyclic extension

Info

Publication number: CN107590156A
Application number: CN201610535646.9A
Authority: CN
Inventors: 李雪鹏; 田昊枢; 毛智愚; 欧高炎
Original assignee: Beijing Letter To Princeton Technology Co Ltd
Current assignee: Beijing Letter To Princeton Technology Co Ltd
Priority date: 2016-07-09
Filing date: 2016-07-09
Publication date: 2018-01-16

Abstract

The present invention relates to the more categorizing system technical fields of text, the method for more particularly to a kind of application program classification based on machine learning.By applying cyclic extension training set, the polytypic classifying quality of text is carried out using computer to improve.Concrete technical scheme includes：Consider scheme using the title and content of text messages of text message and carry out the polytypic method of text；Build keyword, the method for stop words dictionary；The method for manually adding rule improves the effect of small category classification.In the correct classification conclusion using equivalent amount（The correct conclusion manually marked）, the method for cyclic extension training set can effectively lift the accuracy rate that text Polyphenols is carried out using computer.Meanwhile the classification effectiveness of this method is far above manual sort.

Description

A kind of polytypic method of text based on training set cyclic extension

Technical field

The present invention relates to the more categorizing system technical fields of text, more particularly to a kind of application program based on machine learning point The method of class.

Background technology

Classify for text at this stage has two kinds of solutions more.One is manual sort, although this method accuracy rate It is higher, but need substantial amounts of time and human cost.The efficiency of classification and algorithm classification gap are excessive.When needs are carried out greatly During the classification work of amount, artificial system does not apply to.Meanwhile the accuracy of manual sort and the subjective factor of people are closely bound up. People can not have monolithic discrimination standard within all working times.Second of solution is carried out for appliance computer Classification.Because computer needs enough correct classification conclusions, as training set, training set is more abundant, computer classes result Better.And original correct grouped data is often not enough to the quantity for reaching computer needs.To lift the effect of computer classes Fruit, need to carry out substantial amounts of manual sort's work again.

The content of the invention

The purpose of invention

Using the method for cyclic extension training set, increase the polytypic training set of text, text is carried out using computer to improve Polytypic classifying quality.

The technical solution of invention

A kind of polytypic method of text, including：

The heading message of text to be sorted is obtained, arranges text message in itself；Rule based on Keywords matching are established based on general knowledge Then model；Manually mark a small amount of sample to be used to train, and short text disaggregated model is established using text message, utilize text header Information establishes another short text disaggregated model.Use the accurate of the method validation of cross validation each textual classification model Rate；Classification application program is treated using rule model and two short text disaggregated models to be classified, by three category of model knots The consistent sample of fruit adds training set, then re -training textual classification model, uses the method validation text point of cross validation The accuracy rate of class model, new textual classification model and rule model is recycled to divide remaining application program to be sorted Class, said process is repeated, until the accuracy rate of textual classification model is higher than threshold value set in advance；Use the text point of final version Class model is treated classifying text and classified.

Preferably, the short text disaggregated model is based on following at least one model：It is support vector cassification model, random Forest classified model, logistic regression disaggregated model, Adaboost disaggregated models, by being modeled respectively using a variety of text description informations The built-up pattern that obtained multiple single models combine.

A kind of device of application program classification, including：

Rule model establishes unit, for establishing the rule model based on Keywords matching based on general knowledge；

Short text disaggregated model establishes unit, for manually mark a small amount of sample be used for train, and using text header information with And text establishes short text disaggregated model respectively in itself, the accuracy rate of the method validation textual classification model of cross validation is used；

Training set cyclic extension module, classified for treating classifying text using rule model and short text disaggregated model, The consistent sample of three category of model results is added into training set, then re -training textual classification model, uses cross validation Method validation textual classification model accuracy rate, recycle new textual classification model and rule model to remaining to be sorted Application program is classified, and said process is repeated, until the accuracy rate of textual classification model is higher than threshold value set in advance；

Text classification module, classified for treating classification application program using the textual classification model of final version.Specific stream Journey is shown in that explanation says the more disaggregated model training flow charts of accompanying drawing Fig. 1 texts.

The effect of invention

Use the correct classification conclusion of equivalent amount（The correct conclusion manually marked）, the method for cyclic extension training set can have Effect lifting carries out the polytypic accuracy rate of text using computer.Meanwhile the classification effectiveness of this method is far above manual sort.

Brief description of the drawings

Fig. 1 is the more disaggregated model training flow charts of text；Fig. 2 is model training flow chart.

Embodiment

The background information of mobile phone app classified instances

With the technological change and application innovation of information industry, ended for the end of the year 2015, Chinese smart mobile phone accounts for whole mobile phone sales volumes Ratio already exceed 70%；Apple App Store total application number has broken through 1,500,000, and application program is downloaded secondary Number is broken through 100,000,000,000 times, per second to have 850 application programs to be downloaded.Millions of application programs are used by cellphone subscriber, And there are many new application programs to come into operation daily.

On the one hand, major mobile phone application market needs to manage these millions of application programs, and application program is entered Row classification, facilitate user to download and use；On the other hand, major telecom operators have got what cellphone subscriber used by flow Application name, in order to provide more precisely finer service, it is also desirable to classify to the application program being collected into.

The application information being collected into is classified, the preference of user can precisely be analyzed, is deeply excavated User uses the record data of application program, and then completes user's portrait, excavates user interest preference.

At present, the sorting technique being generally used is manual sort, very labor intensive resource.Assuming that each staff The all complete application program of a Name and Description can be classified within every 10 seconds, then 1,000,000 application programs are carried out The time of classification is 2778 hours, equivalent to 116 consecutive days, by daily 8 man-hour calculations, then needs 347 working days. When the title for the application program being collected into is lack of standardization and describes imperfect, the time and efforts that staff spends is with regard to more .

In order that can be used by the polytypic method of text, incited somebody to action for millions of individual application program Fast Classifications with computer Application program classification problem is converted into the more classification problems of text, and specific method is the description information using application program as one section of text This, uses text handling method for this section of text, extracts its keyword, be converted into term vector, afterwards using common point Class model（Such as logistic regression, decision tree, SVMs etc.）Carry out classify more.

It is existing by computer to application class of procedures method the advantages of be that classification speed has compared with manual sort It is obviously improved, but the defects of due to existing disaggregated model itself, and Chinese language processing technology is immature, to application program point The effect of class is poor, and accuracy rate is relatively low.

Implementing procedure

1st, app lists

Directly obtain user mobile phone actual installation app lists.

App lists are cleaned, remove wrong data, merge homogeneous data, most preliminary classification is carried out to app types

2nd, reptile

Find out certain accurate app that classifies and apply website, its app descriptions and app grouped datas are crawled, as model training collection.

Using the app lists after cleaning as target, all app description information is crawled.

Crawling description information mainly has two steps, first, from mainstream applications website（360 Tengxuns apply precious Baidu's mobile phone Assistant）Crawl, for the information that can not be crawled on application website, can be completed by Baidu search.

The data that reptile crawls are cleaned, improve the quality of app descriptions

3rd, classifying rules

（1）Classifying rules establishes standard：

（2）Can be with the strong rule of Direct Classification for app titles.

（3）Order between rule be present, can be adjusted according to business demand and effect.

Rule has to accurately.

The reason for regular is established for app titles rather than description：App describe it is various, rule it is uncontrollable and not accurate. So there is a strategy, most strong regular classification is done, it is necessary to assure accurate.

4th, disaggregated model

In classification problem, the quality of data set almost determines the upper limit of disaggregated model, and algorithm and Feature Engineering simply allow Effect is close to this upper limit.The training set quality extreme difference taken from mainstream applications website, include obvious wrong point and repeatedly Classification.

So these work are being done with energy, it is necessary to expend a great deal of time in the processing of training set and Feature Engineering, Modelling effect is become better and better.

（1）Formulate the process of training set：

Some are selected from the data set crawled as training set

Poor quality's is removed in adjusting training data set

Some classification are manually labelled with, and utilize artificial rule adjustment and increase training set.

App classification results recycle, by the use of all correct training set of the part after in first round classification results, with Improve the precision of classification results.

The problem of existing：Training set quantity is few and extremely uneven, and existing training set is of poor quality（Initial training collection is in itself not Accurately）

Solution：The less classification of training set is, it is necessary to do the work of many rules.Enable in the seldom situation of training set Under possess good effect.

（2）Feature Engineering：

For text, most important feature is exactly the composition of the word of text.Therefore need to do the feature of many work let us It is more accurate.It is applied to this domain lexicon etc. including structure.Finally we, which construct, disables dictionary and crucial dictionary, closes Keyword allusion quotation is mainly the foundation that we are used for classifying, and it is the mark that we do to meaningless word to disable dictionary.

It is separately added into according to the different rules of each keyword and disables dictionary or user-oriented dictionary, is combined and divided according to each keyword The corresponding relation of class, as APP criteria for classifications.

（3）Dictionary

Disable dictionary：For removing stop words.Generic word+stop word.

Need repeatedly to check and update during participle.

Crucial dictionary：

With reference to the algorithm of a variety of extraction text features（TF-IDF, TEXTRANK integrated application）, in conjunction with the domain features of business, Carry out the extraction and screening of keyword.

Add tag along sort word（The word of all kinds of classification）.

In the embodiment of the present invention, by obtaining the description information of application program, knot from mainstream applications market and search engine Close artificial rule and short text disaggregated model automatic cycle expands training set, greatly save the cost manually marked, realize The apparatus for automatically sorting of the whole network application program, excavate and lay a good foundation for telecom operators' user preference.

Idiographic flow is shown in Figure of description Fig. 2 model training flow charts.

Claims

1. the method for cyclic extension training set.

2. considering scheme using the title and content of text messages of text message carries out the polytypic method of text.

3. build keyword, the method for stop words dictionary.

4. the method for manually adding rule improves the effect of small category classification.