CN110263173A

CN110263173A - A kind of machine learning method and device of fast lifting text classification performance

Info

Publication number: CN110263173A
Application number: CN201910565455.0A
Authority: CN
Inventors: 李宇峰; 石锋
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2019-06-27
Filing date: 2019-06-27
Publication date: 2019-09-20

Abstract

The invention discloses the machine learning method and device of a kind of fast lifting text classification performance, it is mainly used for realizing the fast lifting of text classification performance.The main technical solution of the present invention are as follows: obtain a group model relevant to goal task；This paper sample for needing to inquire is selected by being multiplexed existing model, filters out unnecessary inquiry, help obtains more accurate Active Learning model, saves a large amount of Query Cost；It is updated based on significance level of the markd samples of text to existing model, for preferably filtering unnecessary inquiry.The present invention has the characteristics that easily to realize, is efficient, can be based on a small amount of Query Cost, and the quick performance of implementation model is promoted.

Description

A kind of machine learning method and device of fast lifting text classification performance

Technical field

The present invention relates to the machine learning method and device of a kind of fast lifting text classification performance, the machines of text classification Learning art field.

Background technique

With the development of information technology, magnanimity feature is presented in internet data and resource.For effectively management and use The massive information of these distributions, content-based information retrieval and data mining are increasingly becoming the field being concerned.Wherein, literary This sorting technique is the important foundation of information retrieval and text mining, and main task is in previously given category label set Under, its classification is determined according to content of text.Text classification is in natural language processing and understanding, information Organization And Management, content The fields such as information filtering, which suffer from, to be widely applied.

The file classification method based on machine learning graduallyd mature more focuses on the model automatic mining and dynamic of classifier Optimization ability, on classifying quality and flexibility all than the text classification mode of based on knowledge engineering before and expert system It breaks through, becomes the classical example of related fields research and application.However, still there are many shortcomings.Firstly, to train one Powerful machine learning model needs a large amount of training sample；And a large amount of markd data are collected, in many actual tasks It is highly difficult.Second, once model is trained to, if the environment of actual task changes, which is difficult to show very It is good, it directly abandons and results in waste of resources.

Model Reuse is intended to reduce education resource required for goal task training process, causes widely to close in recent years Note.When goal task marked sample is limited, existing Model Reuse method can obtain significant performance boost.However, The mode that Model Reuse method before obtains marked sample is that passively, this leads to the performance boost speed of machine learning model Degree is limited.This does not adapt to the demand of many text actual tasks --- actual task it is generally desirable to the performances of model can be very fast Ground gets a promotion.

Summary of the invention

Goal of the invention: aiming at the problems existing in the prior art with deficiency, the invention proposes a kind of fast lifting texts The machine learning method and device of classification performance, the training performance for alleviating machine learning model promoted slow problem, were effectively reduced Resource overhead in training process, improves the utilization efficiency to existing model and marked sample.

Technical solution: a kind of machine learning method of fast lifting text classification performance specifically includes:

1) target text categorized data set is obtained, target text data set part samples of text has label；

2) group model relevant to target text classification task, the limited capacity of these models are obtained；

3) samples of text for needing to inquire by being multiplexed existing model selection, help obtain more accurate Active Learning mould Type saves a large amount of Query Cost；

4) principle minimized based on error in classification is updated the significance level of existing model, and then preferably filters Unnecessary inquiry；

5) using final model as the machine learning model on target text data set.

Optionally, described to obtain one group relevant to target text existing model, these models are in existing associated data set On largely there is on retrtieval sample training to obtain.Since data distribution has differences, the performance of these models is often limited.

Optionally, the step of inquiry unnecessary using existing model filter are as follows:

1) samples of text to be checked is selected by Active Learning, inquiry, which refers to, here obtains this paper by domain expert The label of sample；

2) the prediction Confidence of text sample is calculated using existing model:

3) judge whether to need to inquire according to prediction Confidence.Specifically, if prediction Confidence is higher than specified threshold, Label is then provided by existing model；Otherwise, then it is marked by domain expert.

Optionally, the principle minimized based on error in classification is updated the significance level of existing model, also It is to improve the weight for making the existing model of larger contribution to performance boost, at the same time, reduces to lack performance boost and contribute Existing model weight.

Optionally, the existing Model Reuse method includes but is not limited to: realizing SVM model using Adaptive SVM Multiplexing, i.e., by using the weight of existing model be used as regular terms, instruct goal task training text sample complete model；It adopts The multiplexing of Random Forest model, i.e. structural information and text data by utilizing decision tree are realized with STRUT and SER Distributed intelligence instructs the training text sample of goal task to complete modeling；Using deep learning Fine-tune technology, realize deep The part convolutional layer of existing model is freezed in the multiplexing for spending learning model, surplus using technique drills such as linear Logistic recurrence Under convolutional layer and full articulamentum implementation model multiplexing.

A kind of machine learning device for realizing text classification performance fast lifting, described device include:

1) acquiring unit, the part sample data for obtaining target text data set, in the target text data set With label；

2) first selecting unit, for selecting one group relevant to target text data set existing model；

3) the first determination unit, for determining that the performance of existing model meets the requirements；

4) the second selecting unit, for selecting initial samples of text set to be checked；

5) assignment unit selects the samples of text for needing to inquire by being multiplexed existing model, filters out unnecessary look into It askes, saves Query Cost；

6) the second determination unit, is updated for the significance level to existing model, preferably filters unnecessary look into It askes；

7) third determination unit, according to machine learning algorithm, training obtains final engineering on target text data set Practise model.

By above-mentioned technical proposal, the present invention provides a kind of machine learning sides for realizing fast lifting text classification performance Method and device are selected this paper sample urgently inquired, are filtered out a large amount of by being multiplexed existing model to target text data set Unnecessary inquiry effectively increases the efficiency of Active Learning, can realize goal task performance based on less Query Cost Very fast promotion.

Detailed description of the invention

Fig. 1 is the flow chart of Active Learning method in conjunction with Model Reuse；

Fig. 2 is the flow chart for filtering samples of text to be checked；

Fig. 3 is the flow chart for updating existing Model Weight；

Fig. 4 is the composition block diagram for realizing the machine learning device of fast lifting text classification performance；

Fig. 5 is the method for the present invention flow chart.

Specific embodiment

Combined with specific embodiments below, the present invention is furture elucidated.These embodiments are merely to illustrate the present invention and do not have to In limiting the scope of the invention, after the present invention has been read, those skilled in the art are to various equivalent forms of the invention Modification falls within the application range as defined in the appended claims.

The embodiment of the invention provides a kind of machine learning methods of fast lifting text classification performance.This method it is specific Step is as shown in figure 5, specifically include that

101, target text categorized data set is obtained.

Wherein, the part sample data in the target text data set has label.In embodiments of the present invention, described Target text data set can be the various text data sets such as sentiment analysis or spam.Wherein, include in the data set Multiple sample datas in there are flag data and Unlabeled datas, in embodiments of the present invention for have flag data be Know the data of classification results, Unlabeled data is the data of unknown classification results.

In embodiments of the present invention, for obtain target text data set process can according to existing acquisition modes into Row, such as setting are exclusively used in the interface of target text data, carry out the acquisition of target data set.

102, a group model relevant to target text classification task is obtained.

Here, the also referred to as existing model of a group model relevant to target text classification task, these existing models be by A large amount of marked samples training of inter-related task forms.For example, the text categorization task of theme of soccer can usually use for reference basketball, Sport entertains this paper topic model of class to provide existing model；Backwoodsman bank transaction data classification can use hair Help is provided up to the existing model in area.Since data distribution has differences, these models directly apply to target text and appoint Often performance is bad in business.

Fig. 1 is the flow chart of Active Learning method in conjunction with Model Reuse.The method of the present invention is picked out suitable unmarked Samples of text gives domain expert and provides label, that is, inquires.The initial policy selected herein can select any one The active learning strategies known.On this basis, Model Reuse can be picked out in conjunction with the result of Active Learning jointly in the process Suitable unmarked samples of text.Detail as per the description of following 103 steps and the description of Fig. 2.

103, it is multiplexed the existing less necessary inquiry of model filter, saves a large amount of Query Cost.

Fig. 2 show the process for filtering unnecessary query text sample.The present invention is primarily based on existing active learning strategies Select candidate unmarked samples of text.On this basis, candidate unmarked samples of text is calculated by having model Confidence is predicted out.Formula is as follows

θ(x^(t))=(1+ α (x^(t)))^-1 (4)

Wherein, x^(t)T-th candidate of unmarked samples of text is represented,Text sample is represented to obtain on existing model To predictive marker, η_jIndicate the weight of j-th of existing model,Indicate the number of current acquired marked sample, Indicate Active Learning model to sample x^(t)Prediction,Indicate the posterior probability of j-th of existing model.Proposition z As correct, then [[z]]=1.Indicate the existing correct probability of category of model.It can be obtained by above formula derivation It arrives, 0≤α (x^(t))≤1, it indicates that existing model can correct classifying text sample x^(t)Prediction Confidence.Work as Active Learning Model and existing model are to samples of text x^(t)When predicting inconsistent, α (x^(t))=0.Indicate with have label text The increase of this sample data, the Confidence of model can also be increase accordingly.The present invention finally uses θ (x^(t)) calculate inquiry expert's Probability, it is with prediction Confidence α (x^(t)) increase and reduce, it is expressed as θ (x^(t))∝(α(x^(t)))^-1.Obtaining θ (x^(t)) item Under part, the present invention selects threshold value 0 < R < 1, which generates in section [0,1] at random, and the decision of sample queries is by as follows Function determines:

Algorithm decides whether to inquire by domain expert to obtain sample by having the prediction Confidence of model Label.

104, the criterion minimized based on error in classification is updated the significance level of existing model, and then preferably mistake Filter unnecessary inquiry；

Fig. 3 show the process updated to existing Model Weight.Firstly the need of the source of judge mark.If marking source In the prediction of existing model, then existing model work is good, weight is without updating.If label derives from domain expert, need It updates the weight of existing model and then improves its accuracy.Based on the replacement criteria that error in classification minimizes, the present invention considers Following optimization problem:

In above-mentioned formula,Indicate j-th of model in sample x^(t)Upper loss of significance, η_jIndicate the power of j-th of model Weight.D_KL(η‖η^(t)) indicate Kullback-Leibler (KL) distance of current and previous round weight distribution, it is every for controlling One wheel weight obtains smoother change, prevents from being mutated.λ is two project targets compromise parameter, is arranged according to specific tasks.Optimization There are direct forms for the optimal solution of formula (6), are written as follow expression formula:

According to formula (7), the relevant weight of j-th of model is obtainedPass through right value update, this hair It is bright to improve the weight that the existing model of larger contribution is made to performance boost, at the same time, reduces and tribute is lacked to performance boost The weight for the existing model offered.

105, by being multiplexed existing model, help obtains more accurate Active Learning model.

As shown in figure 3, the present invention selects linear logistic to return the realization as Model Reuse method, pass through structure wind Danger minimizes to learn to obtain model f⁺Expression-form:

WhereinIndicate the label of samples of text,Indicate the feature of samples of text, Ω is regularization term, and λ is compromise ginseng Number, the model f that optimized-type (8) obtains⁺The Active Learning model as updated, for selecting the unmarked text sample of next round This.

106, using final model as the machine learning model on target text data set.

After learning process terminates, final model is the machine learning model that the present invention obtains.

As shown in figure 4, the present invention is used for the machine learning device of fast lifting text classification performance, specifically include:

Acquiring unit 41, the part sample data for obtaining target text data set, in the target text data set With label；

First selecting unit 42, for selecting one group relevant to target text data set existing model；

First determination unit 43, for determining that the performance of existing model meets the requirements；

Second selecting unit 44, for selecting initial samples of text set to be checked；

Assignment unit 45 selects the samples of text for needing to inquire by being multiplexed existing model, filters out unnecessary look into It askes, saves Query Cost；

Second determination unit 46, is updated for the significance level to existing model, preferably filters unnecessary look into It askes；

Third determination unit 47, according to machine learning algorithm, training obtains final engineering on target text data set Practise model；

Further, as shown in figure 4, the first determination unit 43 includes:

Module 431 is obtained, for determining the satisfactory existing model of one group of performance；

Module 432 is constructed, for constructing the Active Learning model of performance boost on current text data set.

The present invention is for solving the problems, such as that model training performance boost is excessively slow in conventional machines study, effectively reduces training Resource overhead in the process improves the utilization efficiency to existing model and marked sample.Present invention is primarily based below Observation: existing research shows that in marked sample negligible amounts, the available performance boost of Model Reuse, but this property Energy promotion is limited, because the mode that existing Model Reuse method obtains marked sample is passively；Existing Active Learning Method is using existing model, which results in the performance boost of Active Learning model is slow, or needs a large amount of inquiry generation Valence.The present invention combines Model Reuse with Active Learning, proposes Active model multiplexing method, enables goal task quick Ground obtains performance boost.

It is apparent to those skilled in the art that for convenience and simplicity of description, the system of foregoing description, The specific work process of device and unit, can refer to corresponding processes in the foregoing method embodiment, and details are not described herein.

Claims

1. a kind of machine learning method of fast lifting text classification performance, which is characterized in that specifically include:

2) group model relevant to target text classification task is obtained, these model performances are limited；

3) samples of text for needing to inquire by being multiplexed existing model selection, help obtain more accurate Active Learning model, save Save a large amount of Query Cost；

4) principle minimized based on error in classification is updated the significance level of existing model, and then preferably filtering need not The inquiry wanted；

5) using final model as the machine learning model on target text data set.

2. the machine learning method of fast lifting text classification performance as described in claim 1, which is characterized in that the acquisition Target data set, including the pretreatment to target text data set.

3. the machine learning method of fast lifting text classification performance as claimed in claim 2, which is characterized in that described logical It crosses the existing Construction of A Model of multiplexing and goes out query text sample, existing Model Reuse method includes: to be realized using Adaptive SVM The multiplexing of SVM model, i.e., by instructing the training text sample of goal task to complete using the weight of existing model as regular terms Modeling；Realize the multiplexing of Random Forest model using STRUT and SER, i.e., by structural information using decision tree and Text data distributed intelligence instructs the training text sample of goal task to complete modeling；Using deep learning Fine-tune skill Art realizes the multiplexing of deep learning model, that is, freezes the part convolutional layer of existing model, use linear Logistic regression technique The remaining convolutional layer of training and the multiplexing of full articulamentum implementation model.

4. the machine learning method of fast lifting text classification performance as claimed in claim 3, which is characterized in that the utilization The step of existing model filter unnecessary inquiry are as follows:

1) samples of text to be checked is selected by Active Learning, inquiry, which refers to, here obtains this paper sample by domain expert Label；

2) the prediction Confidence of text sample is calculated using existing model:

3) judge whether to need to inquire according to prediction Confidence；Specifically, if prediction Confidence is higher than specified threshold, lead to It crosses existing model and provides label；Otherwise, then it is marked by domain expert.

5. the machine learning method of fast lifting text classification performance as claimed in claim 4, which is characterized in that described to be based on Markd samples of text is updated the significance level of existing model, makes larger tribute to performance boost it is, improving The weight for the existing model offered reduces the weight for lacking the existing model of contribution to performance boost at the same time.