CN107590156A - A kind of polytypic method of text based on training set cyclic extension - Google Patents

A kind of polytypic method of text based on training set cyclic extension Download PDF

Info

Publication number
CN107590156A
CN107590156A CN201610535646.9A CN201610535646A CN107590156A CN 107590156 A CN107590156 A CN 107590156A CN 201610535646 A CN201610535646 A CN 201610535646A CN 107590156 A CN107590156 A CN 107590156A
Authority
CN
China
Prior art keywords
text
classification
training set
polytypic
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610535646.9A
Other languages
Chinese (zh)
Inventor
李雪鹏
田昊枢
毛智愚
欧高炎
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Letter To Princeton Technology Co Ltd
Original Assignee
Beijing Letter To Princeton Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Letter To Princeton Technology Co Ltd filed Critical Beijing Letter To Princeton Technology Co Ltd
Priority to CN201610535646.9A priority Critical patent/CN107590156A/en
Publication of CN107590156A publication Critical patent/CN107590156A/en
Pending legal-status Critical Current

Links

Abstract

The present invention relates to the more categorizing system technical fields of text, the method for more particularly to a kind of application program classification based on machine learning.By applying cyclic extension training set, the polytypic classifying quality of text is carried out using computer to improve.Concrete technical scheme includes:Consider scheme using the title and content of text messages of text message and carry out the polytypic method of text;Build keyword, the method for stop words dictionary;The method for manually adding rule improves the effect of small category classification.In the correct classification conclusion using equivalent amount(The correct conclusion manually marked), the method for cyclic extension training set can effectively lift the accuracy rate that text Polyphenols is carried out using computer.Meanwhile the classification effectiveness of this method is far above manual sort.

Description

A kind of polytypic method of text based on training set cyclic extension
Technical field
The present invention relates to the more categorizing system technical fields of text, more particularly to a kind of application program based on machine learning point The method of class.
Background technology
Classify for text at this stage has two kinds of solutions more.One is manual sort, although this method accuracy rate It is higher, but need substantial amounts of time and human cost.The efficiency of classification and algorithm classification gap are excessive.When needs are carried out greatly During the classification work of amount, artificial system does not apply to.Meanwhile the accuracy of manual sort and the subjective factor of people are closely bound up. People can not have monolithic discrimination standard within all working times.Second of solution is carried out for appliance computer Classification.Because computer needs enough correct classification conclusions, as training set, training set is more abundant, computer classes result Better.And original correct grouped data is often not enough to the quantity for reaching computer needs.To lift the effect of computer classes Fruit, need to carry out substantial amounts of manual sort's work again.
The content of the invention
The purpose of invention
Using the method for cyclic extension training set, increase the polytypic training set of text, text is carried out using computer to improve Polytypic classifying quality.
The technical solution of invention
A kind of polytypic method of text, including:
The heading message of text to be sorted is obtained, arranges text message in itself;Rule based on Keywords matching are established based on general knowledge Then model;Manually mark a small amount of sample to be used to train, and short text disaggregated model is established using text message, utilize text header Information establishes another short text disaggregated model.Use the accurate of the method validation of cross validation each textual classification model Rate;Classification application program is treated using rule model and two short text disaggregated models to be classified, by three category of model knots The consistent sample of fruit adds training set, then re -training textual classification model, uses the method validation text point of cross validation The accuracy rate of class model, new textual classification model and rule model is recycled to divide remaining application program to be sorted Class, said process is repeated, until the accuracy rate of textual classification model is higher than threshold value set in advance;Use the text point of final version Class model is treated classifying text and classified.
Preferably, the short text disaggregated model is based on following at least one model:It is support vector cassification model, random Forest classified model, logistic regression disaggregated model, Adaboost disaggregated models, by being modeled respectively using a variety of text description informations The built-up pattern that obtained multiple single models combine.
A kind of device of application program classification, including:
Rule model establishes unit, for establishing the rule model based on Keywords matching based on general knowledge;
Short text disaggregated model establishes unit, for manually mark a small amount of sample be used for train, and using text header information with And text establishes short text disaggregated model respectively in itself, the accuracy rate of the method validation textual classification model of cross validation is used;
Training set cyclic extension module, classified for treating classifying text using rule model and short text disaggregated model, The consistent sample of three category of model results is added into training set, then re -training textual classification model, uses cross validation Method validation textual classification model accuracy rate, recycle new textual classification model and rule model to remaining to be sorted Application program is classified, and said process is repeated, until the accuracy rate of textual classification model is higher than threshold value set in advance;
Text classification module, classified for treating classification application program using the textual classification model of final version.Specific stream Journey is shown in that explanation says the more disaggregated model training flow charts of accompanying drawing Fig. 1 texts.
The effect of invention
Use the correct classification conclusion of equivalent amount(The correct conclusion manually marked), the method for cyclic extension training set can have Effect lifting carries out the polytypic accuracy rate of text using computer.Meanwhile the classification effectiveness of this method is far above manual sort.
Brief description of the drawings
Fig. 1 is the more disaggregated model training flow charts of text;Fig. 2 is model training flow chart.
Embodiment
The background information of mobile phone app classified instances
With the technological change and application innovation of information industry, ended for the end of the year 2015, Chinese smart mobile phone accounts for whole mobile phone sales volumes Ratio already exceed 70%;Apple App Store total application number has broken through 1,500,000, and application program is downloaded secondary Number is broken through 100,000,000,000 times, per second to have 850 application programs to be downloaded.Millions of application programs are used by cellphone subscriber, And there are many new application programs to come into operation daily.
On the one hand, major mobile phone application market needs to manage these millions of application programs, and application program is entered Row classification, facilitate user to download and use;On the other hand, major telecom operators have got what cellphone subscriber used by flow Application name, in order to provide more precisely finer service, it is also desirable to classify to the application program being collected into.
The application information being collected into is classified, the preference of user can precisely be analyzed, is deeply excavated User uses the record data of application program, and then completes user's portrait, excavates user interest preference.
At present, the sorting technique being generally used is manual sort, very labor intensive resource.Assuming that each staff The all complete application program of a Name and Description can be classified within every 10 seconds, then 1,000,000 application programs are carried out The time of classification is 2778 hours, equivalent to 116 consecutive days, by daily 8 man-hour calculations, then needs 347 working days. When the title for the application program being collected into is lack of standardization and describes imperfect, the time and efforts that staff spends is with regard to more .
In order that can be used by the polytypic method of text, incited somebody to action for millions of individual application program Fast Classifications with computer Application program classification problem is converted into the more classification problems of text, and specific method is the description information using application program as one section of text This, uses text handling method for this section of text, extracts its keyword, be converted into term vector, afterwards using common point Class model(Such as logistic regression, decision tree, SVMs etc.)Carry out classify more.
It is existing by computer to application class of procedures method the advantages of be that classification speed has compared with manual sort It is obviously improved, but the defects of due to existing disaggregated model itself, and Chinese language processing technology is immature, to application program point The effect of class is poor, and accuracy rate is relatively low.
Implementing procedure
1st, app lists
Directly obtain user mobile phone actual installation app lists.
App lists are cleaned, remove wrong data, merge homogeneous data, most preliminary classification is carried out to app types
2nd, reptile
Find out certain accurate app that classifies and apply website, its app descriptions and app grouped datas are crawled, as model training collection.
Using the app lists after cleaning as target, all app description information is crawled.
Crawling description information mainly has two steps, first, from mainstream applications website(360 Tengxuns apply precious Baidu's mobile phone Assistant)Crawl, for the information that can not be crawled on application website, can be completed by Baidu search.
The data that reptile crawls are cleaned, improve the quality of app descriptions
3rd, classifying rules
(1)Classifying rules establishes standard:
(2)Can be with the strong rule of Direct Classification for app titles.
(3)Order between rule be present, can be adjusted according to business demand and effect.
Rule has to accurately.
The reason for regular is established for app titles rather than description:App describe it is various, rule it is uncontrollable and not accurate. So there is a strategy, most strong regular classification is done, it is necessary to assure accurate.
4th, disaggregated model
In classification problem, the quality of data set almost determines the upper limit of disaggregated model, and algorithm and Feature Engineering simply allow Effect is close to this upper limit.The training set quality extreme difference taken from mainstream applications website, include obvious wrong point and repeatedly Classification.
So these work are being done with energy, it is necessary to expend a great deal of time in the processing of training set and Feature Engineering, Modelling effect is become better and better.
(1)Formulate the process of training set:
Some are selected from the data set crawled as training set
Poor quality's is removed in adjusting training data set
Some classification are manually labelled with, and utilize artificial rule adjustment and increase training set.
App classification results recycle, by the use of all correct training set of the part after in first round classification results, with Improve the precision of classification results.
The problem of existing:Training set quantity is few and extremely uneven, and existing training set is of poor quality(Initial training collection is in itself not Accurately)
Solution:The less classification of training set is, it is necessary to do the work of many rules.Enable in the seldom situation of training set Under possess good effect.
(2)Feature Engineering:
For text, most important feature is exactly the composition of the word of text.Therefore need to do the feature of many work let us It is more accurate.It is applied to this domain lexicon etc. including structure.Finally we, which construct, disables dictionary and crucial dictionary, closes Keyword allusion quotation is mainly the foundation that we are used for classifying, and it is the mark that we do to meaningless word to disable dictionary.
It is separately added into according to the different rules of each keyword and disables dictionary or user-oriented dictionary, is combined and divided according to each keyword The corresponding relation of class, as APP criteria for classifications.
(3)Dictionary
Disable dictionary:For removing stop words.Generic word+stop word.
Need repeatedly to check and update during participle.
Crucial dictionary:
With reference to the algorithm of a variety of extraction text features(TF-IDF, TEXTRANK integrated application), in conjunction with the domain features of business, Carry out the extraction and screening of keyword.
Add tag along sort word(The word of all kinds of classification).
In the embodiment of the present invention, by obtaining the description information of application program, knot from mainstream applications market and search engine Close artificial rule and short text disaggregated model automatic cycle expands training set, greatly save the cost manually marked, realize The apparatus for automatically sorting of the whole network application program, excavate and lay a good foundation for telecom operators' user preference.
Idiographic flow is shown in Figure of description Fig. 2 model training flow charts.

Claims (4)

1. the method for cyclic extension training set.
2. considering scheme using the title and content of text messages of text message carries out the polytypic method of text.
3. build keyword, the method for stop words dictionary.
4. the method for manually adding rule improves the effect of small category classification.
CN201610535646.9A 2016-07-09 2016-07-09 A kind of polytypic method of text based on training set cyclic extension Pending CN107590156A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610535646.9A CN107590156A (en) 2016-07-09 2016-07-09 A kind of polytypic method of text based on training set cyclic extension

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610535646.9A CN107590156A (en) 2016-07-09 2016-07-09 A kind of polytypic method of text based on training set cyclic extension

Publications (1)

Publication Number Publication Date
CN107590156A true CN107590156A (en) 2018-01-16

Family

ID=61045783

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610535646.9A Pending CN107590156A (en) 2016-07-09 2016-07-09 A kind of polytypic method of text based on training set cyclic extension

Country Status (1)

Country Link
CN (1) CN107590156A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108596338A (en) * 2018-05-09 2018-09-28 四川斐讯信息技术有限公司 A kind of acquisition methods and its system of neural metwork training collection
CN112185571A (en) * 2020-09-17 2021-01-05 吾征智能技术(北京)有限公司 Disease auxiliary diagnosis system, device and storage medium based on oral acid
CN112749530A (en) * 2021-01-11 2021-05-04 北京光速斑马数据科技有限公司 Text encoding method, device, equipment and computer readable storage medium
CN113691492A (en) * 2021-06-11 2021-11-23 杭州安恒信息安全技术有限公司 Method, system, device and readable storage medium for determining illegal application program

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108596338A (en) * 2018-05-09 2018-09-28 四川斐讯信息技术有限公司 A kind of acquisition methods and its system of neural metwork training collection
CN112185571A (en) * 2020-09-17 2021-01-05 吾征智能技术(北京)有限公司 Disease auxiliary diagnosis system, device and storage medium based on oral acid
CN112185571B (en) * 2020-09-17 2024-01-16 吾征智能技术(北京)有限公司 Disease auxiliary diagnosis system, equipment and storage medium based on orotic acid
CN112749530A (en) * 2021-01-11 2021-05-04 北京光速斑马数据科技有限公司 Text encoding method, device, equipment and computer readable storage medium
CN112749530B (en) * 2021-01-11 2023-12-19 北京光速斑马数据科技有限公司 Text encoding method, apparatus, device and computer readable storage medium
CN113691492A (en) * 2021-06-11 2021-11-23 杭州安恒信息安全技术有限公司 Method, system, device and readable storage medium for determining illegal application program

Similar Documents

Publication Publication Date Title
CN109189901B (en) Method for automatically discovering new classification and corresponding corpus in intelligent customer service system
CN108985293A (en) A kind of image automation mask method and system based on deep learning
CN106960063A (en) A kind of internet information crawl and commending system for field of inviting outside investment
CN107169001A (en) A kind of textual classification model optimization method based on mass-rent feedback and Active Learning
CN107590156A (en) A kind of polytypic method of text based on training set cyclic extension
CN107220295A (en) A kind of people's contradiction reconciles case retrieval and mediation strategy recommends method
CN106021410A (en) Source code annotation quality evaluation method based on machine learning
CN107194617B (en) App software engineer soft skill classification system and method
CN109299271A (en) Training sample generation, text data, public sentiment event category method and relevant device
CN106022708A (en) Method for predicting employee resignation
CN110334214A (en) A kind of method of false lawsuit in automatic identification case
CN107844558A (en) The determination method and relevant apparatus of a kind of classification information
CN108153895A (en) A kind of building of corpus method and system based on open data
CN108491388A (en) Data set acquisition methods, sorting technique, device, equipment and storage medium
CN107066548B (en) A kind of method that web page interlinkage is extracted in double dimension classification
CN113495959B (en) Financial public opinion identification method and system based on text data
CN109598307A (en) Data screening method, apparatus, server and storage medium
CN103246655A (en) Text categorizing method, device and system
CN107465643A (en) A kind of net flow assorted method of deep learning
CN108280164A (en) A kind of short text filtering and sorting technique based on classification related words
CN106951565A (en) File classification method and the text classifier of acquisition
CN106569996A (en) Chinese-microblog-oriented emotional tendency analysis method
CN112347254A (en) News text classification method and device, computer equipment and storage medium
CN111860981A (en) Enterprise national industry category prediction method and system based on LSTM deep learning
CN102214227A (en) Automatic public opinion monitoring method based on internet hierarchical structure storage

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20180116

WD01 Invention patent application deemed withdrawn after publication