CN106446230A

CN106446230A - Method for optimizing word classification in machine learning text

Info

Publication number: CN106446230A
Application number: CN201610881132.9A
Authority: CN
Inventors: 郭宇; 李永波; 季统凯
Original assignee: G Cloud Technology Co Ltd
Current assignee: G Cloud Technology Co Ltd
Priority date: 2016-10-08
Filing date: 2016-10-08
Publication date: 2017-02-22

Abstract

The invention relates to the field of data processing and machine learning classification, in particular to a method for optimizing word classification in machine learning text. The method comprises the steps that on the basis of text classification, self-defined and semantically related features are filtered out through a feature selection regulator based on a regular expression, a user customizes classification types in training data after feature selection, and classification based training is conducted by means of the features and the types according to a naive Bayesian model; after training is completed, in the application stage, if statements conforming to the feature selection regulator exist in text needing word classification, classification is completed by combining a trained model. According to the method, the capacity of the model for processing work classification is not limited in word data in a training sample; the method can be applied to optimization and application of machine learning text work classification and derivation functions thereof.

Description

A kind of method optimizing word's kinds in machine learning text

Technical field

The present invention relates to data processing and machine learning classification field, word in especially a kind of optimization machine learning text The method of classification.

Background technology

With the fast development of information technology, modern society's information content is in explosive growth, in the today in big data epoch How to make good use of mass data, really valuable information of excavating becomes a social concerns focus.Machine learning is in data Effect played in excavation is more and more obvious, and in the problem with text classification for the process to natural language, machine learning By traditional rules customization method is replaced come solve problem using statistical method, by facts have proved this way effect not Wrong and in hgher efficiency.On the basis of to text classification it is desirable in further to text, each word, keyword are classified, Extract required key wordses information, this just puts forward higher requirement to machine learning classification.

Content of the invention

Present invention solves the technical problem that being to provide a kind of method optimizing word's kinds in machine learning text, solve The classification problem of self-defined key wordses in current text classification.

The present invention solve above-mentioned technical problem technical scheme be：

Described method is on the basis of text classification, and the feature selecting rule device based on regular expression filters out Self-defining and semantic related feature, the class categories in User Defined training data after feature selecting, and then utilize These features and classification carry out classification based training according to model-naive Bayesian；After completing training, when the application stage, need word In the text of language classification if there is meet feature selecting rule device sentence when, complete point in conjunction with the trained model completing Class.

Described method concretely comprises the following steps：

S1, training set create：Create satisfactory training text data according to actual needs, complete in conjunction with true environment Training set creates；

S2, data prediction：When being related to Chinese in the text of need classification, need the text in training set is carried out point The pretreatment such as word and stop words removal；

S3, feature selecting rule device in, input self-defining regular expression as filter condition, feature selecting advise Then device goes out legal text in training set according to regular expression Rules Filtering, will be in regular expression wildcard in text Participle queue put in word at symbol；

Whether full S4, the characteristic vector model being generated according to feature selecting rule device, compare word in each participle queue The regular expression that each feature of foot is located, calculates the weights of the vector of each word；

The defined classification results completing of user in S5, the characteristic vector according to each word and training set, in conjunction with simplicity Bayes classifier, calculates each class conditional probability and prior probability, then completes the training to disaggregated model；When completing mould After type training, tested using pre-prepd test the set pair analysis model, formed after the contrast of test result and legitimate reading to point The Performance Evaluation of class model, and possible modification is proposed, model is optimized；

S6, using complete train grader the text data being actually needed word's kinds is classified.

Described a characteristic value is represented with feature selecting rule one of the self-defining expression formula of device asterisk wildcard, feature is selected Select regular device when the text to input checks, if there are the sentence meeting expression formula, then this sentence is extracted simultaneously Using the word of asterisk wildcard position or word collection as in the object typing classification queue needing classification；User can customize each and leads to Join the meaning of the representative characteristic value of symbol.

User is processed to training data before train classification models, self-defined first required sorting item, entirely literary composition The word collection meeting feature selecting rule device in this is broadly divided into A, B, C tri- class, and to should in each training text individuality Individual last classification results are labeled.

During described model training, such as certain word meets first regularity, the then characteristic value representated by this regularity It is designated as 1, be otherwise designated as 0；

After the completion of model training, test result is analyzed；Now, feature weight synthesis word position, the frequency of occurrences Calculated as considering index etc. factor.

The invention provides word's kinds in a kind of utilization regular expressions accurately mate semantic optimization machine learning text Method；Can be applicable in the optimization of the classification of text word and its derivation function and related application in machine learning category.

Brief description

The present invention is further described for accompanying drawings below：

Fig. 1 is classification process schematic diagram of the present invention；

Specific embodiment

As shown in figure 1, the present invention is on the basis of traditional machine learning file classification method, using with regular expression Based on feature selecting rule device, filter out self-defining and semantic related feature, User Defined after feature selecting Class categories in training data, and then using these features and classification, classification based training is carried out according to model-naive Bayesian； After completing training, when the application stage, need in the text of word's kinds if there is the sentence meeting feature selecting rule device When, complete classification task in conjunction with the trained model completing.

Feature selector is based on regular expression, represents one in one of self-defining regular expression asterisk wildcard Characteristic value.As：". " in " .* [xyz]+" can represent a specific feature, similar to：" meet regularity at this Word is all national title " or " word meeting regularity at this is all related to religion " etc..One feature selecting rule Then device can comprise one or more rules, and these rules constitute the basis forming characteristic vector model.According to feature selecting The sentence meeting regular expression in text selected by regular device, the set of words that in these sentences, corresponding asterisk wildcard is located it is simply that By the word being classified.When being related to Chinese word's kinds, need to process using Chinese word segmentation instrument and some stop words Flow process is come the word in classification queue that to standardize.

Whether meet, according to these words, a plurality of regularity specified in feature selecting rule device, set up its characteristic vector Model.Number of dimensions in characteristic vector model, is the characteristic in regularity, be expressed as feature 1, feature 2 ... feature N }, such as certain word meets first regularity, then the characteristic value representated by this regularity is designated as 1, is otherwise designated as 0.Thus Understand arbitrarily to need the word of classification, its characteristic vector is all represented by form such similar to { 1,0,0,1... }.Obtaining After getting the characteristic vector of each word in training set, classification based training is carried out according to model-naive Bayesian.

Here assume initially with naive Bayesian theorem separate between each feature, instructed according to preprepared Practice collection, set up the characteristic vector of word the artificially defined class categories belonging to it that each needs classification.I.e. user needs Before train classification models, training data is processed, self-defined required sorting item, such as meet feature selecting in whole text The word collection of regular device is broadly divided into A, B, C tri- class, then will need the classification knot of the word of classification in each training text The artificial mark of fruit draws.May then pass through model-naive Bayesian training and show that the prior probability of each feature and posteriority are general Rate, is referred to as class conditional probability, and the so far training of model completes.

After model training completes, need the data in test set is tested.After completing test, test is tied Fruit is analyzed, thus the performance of assessment models, and carry out the optimization of model as far as possible to a certain extent.As feature weight meter Calculation mode no longer indicates whether to meet regular expression with 0,1, but the factor such as comprehensive word position, frequency of occurrences is as considering Index is calculated.

In actual classification task, feature selecting rule device checks to the text of input, if there are meeting expression formula Sentence, then this sentence is extracted and the word of asterisk wildcard position or word collection is divided as the object typing needing classification In Class Queue.When word of classifying is related to Chinese, using general Chinese word segmentation instrument as solution.Can after completing participle According to demand, stop-word is processed, then each word is carried out in the manner described above its characteristic vector modeling, Ran Houtong Cross and train the sorter model finishing to complete work of classifying.

The concrete steps of scheme can be as follows as described above：

S1, training set create：Create satisfactory training text data according to actual needs, can be complete in conjunction with true environment Training set is become to create.In text in training set, the word of classification is needed to have artificially sorted result.It should be noted that When creating training set, to be created for a certain or several regular expression rules, these rules will be It is cited, for generating corresponding characteristic item in feature selecting rule device.

S2, data prediction：When being related to Chinese in the text of need classification, need the text in training set is carried out point The pretreatment such as word and stop words removal.Chinese word segmentation can be using Chinese word segmentation instrument SCWS or Jcseg currently commonly using etc.. The process of stop words is fairly simple, it is possible to use conventional deactivation vocabulary, as foundation, word corresponding in text is removed.This In it should be noted that when feature selecting rule device in regularity used some stop words when, then will not remove this deactivation Word.

S3, feature selecting rule device in, input self-defining regular expression as filter condition, feature selecting advise Then device can go out legal text in training set according to regular expression Rules Filtering, leads to being in regular expression in text Join the word at symbol and put into participle queue.

Whether full S4, the characteristic vector model being generated according to feature selecting rule device, compare word in each participle queue The regular expression that each feature of foot is located, calculates the weights of the vector of each word.Concrete grammar is：When word place text When meeting regular expression 1 in feature selecting rule device, the weights of the corresponding characteristic vector of this expression formula are set to 1, are otherwise set to 0.Thus the characteristic vector obtaining each word is expressed as form such similar to { 1,0,0,1... }.

The defined classification results completing of user in S5, the characteristic vector according to each word and training set, in conjunction with simplicity Bayes classifier, calculates each class conditional probability and prior probability, then completes the training to disaggregated model.When completing mould After type training, to be tested using pre-prepd test the set pair analysis model, it is right that test result is formed after being contrasted with legitimate reading The Performance Evaluation of disaggregated model, and possible modification is proposed, model is optimized.

S6, using complete train grader the text data being actually needed word's kinds is classified.Here need It should be noted that the text classified has to meet rule in feature selecting rule device, and class to be classified Also not consistent with self-defined classification in feature selecting rule device, otherwise need to redefine regularity and sorting item, again instruct Practice model, just can complete new classification task.

Case study on implementation described above is an example of the present invention and not all, based on the example in the present invention, this Other examples that field those of ordinary skill is obtained on the premise of not making creative work, broadly fall into the guarantor of the present invention Shield scope.

Claims

1. a kind of optimize machine learning text in word's kinds method it is characterised in that：Described method is in text classification On the basis of, the feature selecting rule device based on regular expression filters out self-defining and semantic related feature, in spy Levy the class categories in User Defined training data after selection, so using these features with classification according to naive Bayesian mould Type is carrying out classification based training；After completing training, when the application stage, need in the text of word's kinds if there is meeting feature During the sentence of the regular device of selection, complete to classify in conjunction with the trained model completing.

2. method according to claim 1 it is characterised in that：Described method concretely comprises the following steps：

S1, training set create：Create satisfactory training text data according to actual needs, complete to train in conjunction with true environment Collection creates；

S2, data prediction：When being related to Chinese in the text of need classification, need the text in training set is carried out participle and The pretreatments such as stop words removal；

S3, feature selecting rule device in, input self-defining regular expression as filter condition, feature selecting rule device Legal text in training set is gone out according to regular expression Rules Filtering, will be in text at regular expression asterisk wildcard Word put into participle queue；

S4, the characteristic vector model being generated according to feature selecting rule device, compare whether word in each participle queue meets respectively The regular expression that individual feature is located, calculates the weights of the vector of each word；

The defined classification results completing of user in S5, the characteristic vector according to each word and training set, in conjunction with simple pattra leaves This grader, calculates each class conditional probability and prior probability, then completes the training to disaggregated model；Instruct when completing model After white silk, tested using pre-prepd test the set pair analysis model, formed to classification mould after test result and legitimate reading contrast The Performance Evaluation of type, and possible modification is proposed, model is optimized；

3. method according to claim 1 it is characterised in that：In the described rule self-defining expression formula of device with feature selecting An asterisk wildcard represent a characteristic value, feature selecting rule device when the text to input checks, if there are satisfaction The sentence of expression formula, then extract this sentence and using the word of asterisk wildcard position or word collection as the object needing classification In typing classification queue；User can customize the meaning of the characteristic value representated by each asterisk wildcard.

4. method according to claim 2 it is characterised in that：In the described rule self-defining expression formula of device with feature selecting An asterisk wildcard represent a characteristic value, feature selecting rule device when the text to input checks, if there are satisfaction The sentence of expression formula, then extract this sentence and using the word of asterisk wildcard position or word collection as the object needing classification In typing classification queue；User can customize the meaning of the characteristic value representated by each asterisk wildcard.

5. the method according to any one of Claims 1-4 it is characterised in that：User is before train classification models to training Data is processed, self-defined first required sorting item, meets the word collection of feature selecting rule device substantially in whole text A, B, C tri- class can be divided into, and in each training text individuality, classification results last for this individuality are labeled.

6. the method according to any one of claim 1-4 it is characterised in that：

During described model training, such as certain word meets first regularity, then the characteristic value representated by this regularity is designated as 1, otherwise it is designated as 0；

After the completion of model training, test result is analyzed；Now, feature weight synthesis word position, frequency of occurrences etc. because Element conduct is considered index and is calculated.

7. method according to claim 5 it is characterised in that：