CN110263173A - A kind of machine learning method and device of fast lifting text classification performance - Google Patents
A kind of machine learning method and device of fast lifting text classification performance Download PDFInfo
- Publication number
- CN110263173A CN110263173A CN201910565455.0A CN201910565455A CN110263173A CN 110263173 A CN110263173 A CN 110263173A CN 201910565455 A CN201910565455 A CN 201910565455A CN 110263173 A CN110263173 A CN 110263173A
- Authority
- CN
- China
- Prior art keywords
- model
- text
- existing
- existing model
- machine learning
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000010801 machine learning Methods 0.000 title claims abstract description 26
- 238000001914 filtration Methods 0.000 claims abstract description 5
- 238000000034 method Methods 0.000 claims description 30
- 238000012549 training Methods 0.000 claims description 15
- 230000003044 adaptive effect Effects 0.000 claims description 2
- 238000003066 decision tree Methods 0.000 claims description 2
- 238000013135 deep learning Methods 0.000 claims description 2
- 238000007637 random forest analysis Methods 0.000 claims description 2
- 238000010276 construction Methods 0.000 claims 1
- 238000013136 deep learning model Methods 0.000 claims 1
- 238000007477 logistic regression Methods 0.000 claims 1
- 230000008569 process Effects 0.000 description 10
- 238000009826 distribution Methods 0.000 description 4
- 238000004422 calculation algorithm Methods 0.000 description 3
- 238000005457 optimization Methods 0.000 description 3
- 241001269238 Data Species 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000005065 mining Methods 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 241000208340 Araliaceae Species 0.000 description 1
- 235000005035 Panax pseudoginseng ssp. pseudoginseng Nutrition 0.000 description 1
- 235000003140 Panax quinquefolius Nutrition 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000007418 data mining Methods 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 238000009795 derivation Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 235000008434 ginseng Nutrition 0.000 description 1
- 239000003550 marker Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 230000008520 organization Effects 0.000 description 1
- 239000002699 waste material Substances 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
- G06N20/10—Machine learning using kernel methods, e.g. support vector machines [SVM]
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Software Systems (AREA)
- Databases & Information Systems (AREA)
- Artificial Intelligence (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Computation (AREA)
- Medical Informatics (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses the machine learning method and device of a kind of fast lifting text classification performance, it is mainly used for realizing the fast lifting of text classification performance.The main technical solution of the present invention are as follows: obtain a group model relevant to goal task;This paper sample for needing to inquire is selected by being multiplexed existing model, filters out unnecessary inquiry, help obtains more accurate Active Learning model, saves a large amount of Query Cost;It is updated based on significance level of the markd samples of text to existing model, for preferably filtering unnecessary inquiry.The present invention has the characteristics that easily to realize, is efficient, can be based on a small amount of Query Cost, and the quick performance of implementation model is promoted.
Description
Technical field
The present invention relates to the machine learning method and device of a kind of fast lifting text classification performance, the machines of text classification
Learning art field.
Background technique
With the development of information technology, magnanimity feature is presented in internet data and resource.For effectively management and use
The massive information of these distributions, content-based information retrieval and data mining are increasingly becoming the field being concerned.Wherein, literary
This sorting technique is the important foundation of information retrieval and text mining, and main task is in previously given category label set
Under, its classification is determined according to content of text.Text classification is in natural language processing and understanding, information Organization And Management, content
The fields such as information filtering, which suffer from, to be widely applied.
The file classification method based on machine learning graduallyd mature more focuses on the model automatic mining and dynamic of classifier
Optimization ability, on classifying quality and flexibility all than the text classification mode of based on knowledge engineering before and expert system
It breaks through, becomes the classical example of related fields research and application.However, still there are many shortcomings.Firstly, to train one
Powerful machine learning model needs a large amount of training sample;And a large amount of markd data are collected, in many actual tasks
It is highly difficult.Second, once model is trained to, if the environment of actual task changes, which is difficult to show very
It is good, it directly abandons and results in waste of resources.
Model Reuse is intended to reduce education resource required for goal task training process, causes widely to close in recent years
Note.When goal task marked sample is limited, existing Model Reuse method can obtain significant performance boost.However,
The mode that Model Reuse method before obtains marked sample is that passively, this leads to the performance boost speed of machine learning model
Degree is limited.This does not adapt to the demand of many text actual tasks --- actual task it is generally desirable to the performances of model can be very fast
Ground gets a promotion.
Summary of the invention
Goal of the invention: aiming at the problems existing in the prior art with deficiency, the invention proposes a kind of fast lifting texts
The machine learning method and device of classification performance, the training performance for alleviating machine learning model promoted slow problem, were effectively reduced
Resource overhead in training process, improves the utilization efficiency to existing model and marked sample.
Technical solution: a kind of machine learning method of fast lifting text classification performance specifically includes:
1) target text categorized data set is obtained, target text data set part samples of text has label;
2) group model relevant to target text classification task, the limited capacity of these models are obtained;
3) samples of text for needing to inquire by being multiplexed existing model selection, help obtain more accurate Active Learning mould
Type saves a large amount of Query Cost;
4) principle minimized based on error in classification is updated the significance level of existing model, and then preferably filters
Unnecessary inquiry;
5) using final model as the machine learning model on target text data set.
Optionally, described to obtain one group relevant to target text existing model, these models are in existing associated data set
On largely there is on retrtieval sample training to obtain.Since data distribution has differences, the performance of these models is often limited.
Optionally, the step of inquiry unnecessary using existing model filter are as follows:
1) samples of text to be checked is selected by Active Learning, inquiry, which refers to, here obtains this paper by domain expert
The label of sample;
2) the prediction Confidence of text sample is calculated using existing model:
3) judge whether to need to inquire according to prediction Confidence.Specifically, if prediction Confidence is higher than specified threshold,
Label is then provided by existing model;Otherwise, then it is marked by domain expert.
Optionally, the principle minimized based on error in classification is updated the significance level of existing model, also
It is to improve the weight for making the existing model of larger contribution to performance boost, at the same time, reduces to lack performance boost and contribute
Existing model weight.
Optionally, the existing Model Reuse method includes but is not limited to: realizing SVM model using Adaptive SVM
Multiplexing, i.e., by using the weight of existing model be used as regular terms, instruct goal task training text sample complete model;It adopts
The multiplexing of Random Forest model, i.e. structural information and text data by utilizing decision tree are realized with STRUT and SER
Distributed intelligence instructs the training text sample of goal task to complete modeling;Using deep learning Fine-tune technology, realize deep
The part convolutional layer of existing model is freezed in the multiplexing for spending learning model, surplus using technique drills such as linear Logistic recurrence
Under convolutional layer and full articulamentum implementation model multiplexing.
A kind of machine learning device for realizing text classification performance fast lifting, described device include:
1) acquiring unit, the part sample data for obtaining target text data set, in the target text data set
With label;
2) first selecting unit, for selecting one group relevant to target text data set existing model;
3) the first determination unit, for determining that the performance of existing model meets the requirements;
4) the second selecting unit, for selecting initial samples of text set to be checked;
5) assignment unit selects the samples of text for needing to inquire by being multiplexed existing model, filters out unnecessary look into
It askes, saves Query Cost;
6) the second determination unit, is updated for the significance level to existing model, preferably filters unnecessary look into
It askes;
7) third determination unit, according to machine learning algorithm, training obtains final engineering on target text data set
Practise model.
By above-mentioned technical proposal, the present invention provides a kind of machine learning sides for realizing fast lifting text classification performance
Method and device are selected this paper sample urgently inquired, are filtered out a large amount of by being multiplexed existing model to target text data set
Unnecessary inquiry effectively increases the efficiency of Active Learning, can realize goal task performance based on less Query Cost
Very fast promotion.
Detailed description of the invention
Fig. 1 is the flow chart of Active Learning method in conjunction with Model Reuse;
Fig. 2 is the flow chart for filtering samples of text to be checked;
Fig. 3 is the flow chart for updating existing Model Weight;
Fig. 4 is the composition block diagram for realizing the machine learning device of fast lifting text classification performance;
Fig. 5 is the method for the present invention flow chart.
Specific embodiment
Combined with specific embodiments below, the present invention is furture elucidated.These embodiments are merely to illustrate the present invention and do not have to
In limiting the scope of the invention, after the present invention has been read, those skilled in the art are to various equivalent forms of the invention
Modification falls within the application range as defined in the appended claims.
The embodiment of the invention provides a kind of machine learning methods of fast lifting text classification performance.This method it is specific
Step is as shown in figure 5, specifically include that
101, target text categorized data set is obtained.
Wherein, the part sample data in the target text data set has label.In embodiments of the present invention, described
Target text data set can be the various text data sets such as sentiment analysis or spam.Wherein, include in the data set
Multiple sample datas in there are flag data and Unlabeled datas, in embodiments of the present invention for have flag data be
Know the data of classification results, Unlabeled data is the data of unknown classification results.
In embodiments of the present invention, for obtain target text data set process can according to existing acquisition modes into
Row, such as setting are exclusively used in the interface of target text data, carry out the acquisition of target data set.
102, a group model relevant to target text classification task is obtained.
Here, the also referred to as existing model of a group model relevant to target text classification task, these existing models be by
A large amount of marked samples training of inter-related task forms.For example, the text categorization task of theme of soccer can usually use for reference basketball,
Sport entertains this paper topic model of class to provide existing model;Backwoodsman bank transaction data classification can use hair
Help is provided up to the existing model in area.Since data distribution has differences, these models directly apply to target text and appoint
Often performance is bad in business.
Fig. 1 is the flow chart of Active Learning method in conjunction with Model Reuse.The method of the present invention is picked out suitable unmarked
Samples of text gives domain expert and provides label, that is, inquires.The initial policy selected herein can select any one
The active learning strategies known.On this basis, Model Reuse can be picked out in conjunction with the result of Active Learning jointly in the process
Suitable unmarked samples of text.Detail as per the description of following 103 steps and the description of Fig. 2.
103, it is multiplexed the existing less necessary inquiry of model filter, saves a large amount of Query Cost.
Fig. 2 show the process for filtering unnecessary query text sample.The present invention is primarily based on existing active learning strategies
Select candidate unmarked samples of text.On this basis, candidate unmarked samples of text is calculated by having model
Confidence is predicted out.Formula is as follows
θ(x(t))=(1+ α (x(t)))-1 (4)
Wherein, x(t)T-th candidate of unmarked samples of text is represented,Text sample is represented to obtain on existing model
To predictive marker, ηjIndicate the weight of j-th of existing model,Indicate the number of current acquired marked sample,
Indicate Active Learning model to sample x(t)Prediction,Indicate the posterior probability of j-th of existing model.Proposition z
As correct, then [[z]]=1.Indicate the existing correct probability of category of model.It can be obtained by above formula derivation
It arrives, 0≤α (x(t))≤1, it indicates that existing model can correct classifying text sample x(t)Prediction Confidence.Work as Active Learning
Model and existing model are to samples of text x(t)When predicting inconsistent, α (x(t))=0.Indicate with have label text
The increase of this sample data, the Confidence of model can also be increase accordingly.The present invention finally uses θ (x(t)) calculate inquiry expert's
Probability, it is with prediction Confidence α (x(t)) increase and reduce, it is expressed as θ (x(t))∝(α(x(t)))-1.Obtaining θ (x(t)) item
Under part, the present invention selects threshold value 0 < R < 1, which generates in section [0,1] at random, and the decision of sample queries is by as follows
Function determines:
Algorithm decides whether to inquire by domain expert to obtain sample by having the prediction Confidence of model
Label.
104, the criterion minimized based on error in classification is updated the significance level of existing model, and then preferably mistake
Filter unnecessary inquiry;
Fig. 3 show the process updated to existing Model Weight.Firstly the need of the source of judge mark.If marking source
In the prediction of existing model, then existing model work is good, weight is without updating.If label derives from domain expert, need
It updates the weight of existing model and then improves its accuracy.Based on the replacement criteria that error in classification minimizes, the present invention considers
Following optimization problem:
In above-mentioned formula,Indicate j-th of model in sample x(t)Upper loss of significance, ηjIndicate the power of j-th of model
Weight.DKL(η‖η(t)) indicate Kullback-Leibler (KL) distance of current and previous round weight distribution, it is every for controlling
One wheel weight obtains smoother change, prevents from being mutated.λ is two project targets compromise parameter, is arranged according to specific tasks.Optimization
There are direct forms for the optimal solution of formula (6), are written as follow expression formula:
According to formula (7), the relevant weight of j-th of model is obtainedPass through right value update, this hair
It is bright to improve the weight that the existing model of larger contribution is made to performance boost, at the same time, reduces and tribute is lacked to performance boost
The weight for the existing model offered.
105, by being multiplexed existing model, help obtains more accurate Active Learning model.
As shown in figure 3, the present invention selects linear logistic to return the realization as Model Reuse method, pass through structure wind
Danger minimizes to learn to obtain model f+Expression-form:
WhereinIndicate the label of samples of text,Indicate the feature of samples of text, Ω is regularization term, and λ is compromise ginseng
Number, the model f that optimized-type (8) obtains+The Active Learning model as updated, for selecting the unmarked text sample of next round
This.
106, using final model as the machine learning model on target text data set.
After learning process terminates, final model is the machine learning model that the present invention obtains.
As shown in figure 4, the present invention is used for the machine learning device of fast lifting text classification performance, specifically include:
Acquiring unit 41, the part sample data for obtaining target text data set, in the target text data set
With label;
First selecting unit 42, for selecting one group relevant to target text data set existing model;
First determination unit 43, for determining that the performance of existing model meets the requirements;
Second selecting unit 44, for selecting initial samples of text set to be checked;
Assignment unit 45 selects the samples of text for needing to inquire by being multiplexed existing model, filters out unnecessary look into
It askes, saves Query Cost;
Second determination unit 46, is updated for the significance level to existing model, preferably filters unnecessary look into
It askes;
Third determination unit 47, according to machine learning algorithm, training obtains final engineering on target text data set
Practise model;
Further, as shown in figure 4, the first determination unit 43 includes:
Module 431 is obtained, for determining the satisfactory existing model of one group of performance;
Module 432 is constructed, for constructing the Active Learning model of performance boost on current text data set.
The present invention is for solving the problems, such as that model training performance boost is excessively slow in conventional machines study, effectively reduces training
Resource overhead in the process improves the utilization efficiency to existing model and marked sample.Present invention is primarily based below
Observation: existing research shows that in marked sample negligible amounts, the available performance boost of Model Reuse, but this property
Energy promotion is limited, because the mode that existing Model Reuse method obtains marked sample is passively;Existing Active Learning
Method is using existing model, which results in the performance boost of Active Learning model is slow, or needs a large amount of inquiry generation
Valence.The present invention combines Model Reuse with Active Learning, proposes Active model multiplexing method, enables goal task quick
Ground obtains performance boost.
It is apparent to those skilled in the art that for convenience and simplicity of description, the system of foregoing description,
The specific work process of device and unit, can refer to corresponding processes in the foregoing method embodiment, and details are not described herein.
Claims (5)
1. a kind of machine learning method of fast lifting text classification performance, which is characterized in that specifically include:
1) target text categorized data set is obtained, target text data set part samples of text has label;
2) group model relevant to target text classification task is obtained, these model performances are limited;
3) samples of text for needing to inquire by being multiplexed existing model selection, help obtain more accurate Active Learning model, save
Save a large amount of Query Cost;
4) principle minimized based on error in classification is updated the significance level of existing model, and then preferably filtering need not
The inquiry wanted;
5) using final model as the machine learning model on target text data set.
2. the machine learning method of fast lifting text classification performance as described in claim 1, which is characterized in that the acquisition
Target data set, including the pretreatment to target text data set.
3. the machine learning method of fast lifting text classification performance as claimed in claim 2, which is characterized in that described logical
It crosses the existing Construction of A Model of multiplexing and goes out query text sample, existing Model Reuse method includes: to be realized using Adaptive SVM
The multiplexing of SVM model, i.e., by instructing the training text sample of goal task to complete using the weight of existing model as regular terms
Modeling;Realize the multiplexing of Random Forest model using STRUT and SER, i.e., by structural information using decision tree and
Text data distributed intelligence instructs the training text sample of goal task to complete modeling;Using deep learning Fine-tune skill
Art realizes the multiplexing of deep learning model, that is, freezes the part convolutional layer of existing model, use linear Logistic regression technique
The remaining convolutional layer of training and the multiplexing of full articulamentum implementation model.
4. the machine learning method of fast lifting text classification performance as claimed in claim 3, which is characterized in that the utilization
The step of existing model filter unnecessary inquiry are as follows:
1) samples of text to be checked is selected by Active Learning, inquiry, which refers to, here obtains this paper sample by domain expert
Label;
2) the prediction Confidence of text sample is calculated using existing model:
3) judge whether to need to inquire according to prediction Confidence;Specifically, if prediction Confidence is higher than specified threshold, lead to
It crosses existing model and provides label;Otherwise, then it is marked by domain expert.
5. the machine learning method of fast lifting text classification performance as claimed in claim 4, which is characterized in that described to be based on
Markd samples of text is updated the significance level of existing model, makes larger tribute to performance boost it is, improving
The weight for the existing model offered reduces the weight for lacking the existing model of contribution to performance boost at the same time.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910565455.0A CN110263173A (en) | 2019-06-27 | 2019-06-27 | A kind of machine learning method and device of fast lifting text classification performance |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910565455.0A CN110263173A (en) | 2019-06-27 | 2019-06-27 | A kind of machine learning method and device of fast lifting text classification performance |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110263173A true CN110263173A (en) | 2019-09-20 |
Family
ID=67922131
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910565455.0A Pending CN110263173A (en) | 2019-06-27 | 2019-06-27 | A kind of machine learning method and device of fast lifting text classification performance |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110263173A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112541083A (en) * | 2020-12-23 | 2021-03-23 | 西安交通大学 | Text classification method based on active learning hybrid neural network |
CN115274375A (en) * | 2022-07-25 | 2022-11-01 | 东莞市博钺电子有限公司 | Fuse encapsulating material and preparation method and application thereof |
US12093645B2 (en) | 2021-09-14 | 2024-09-17 | International Business Machines Corporation | Inter-training of pre-trained transformer-based language models using partitioning and classification |
-
2019
- 2019-06-27 CN CN201910565455.0A patent/CN110263173A/en active Pending
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112541083A (en) * | 2020-12-23 | 2021-03-23 | 西安交通大学 | Text classification method based on active learning hybrid neural network |
US12093645B2 (en) | 2021-09-14 | 2024-09-17 | International Business Machines Corporation | Inter-training of pre-trained transformer-based language models using partitioning and classification |
CN115274375A (en) * | 2022-07-25 | 2022-11-01 | 东莞市博钺电子有限公司 | Fuse encapsulating material and preparation method and application thereof |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN104572998B (en) | Question and answer order models update method and device for automatically request-answering system | |
CN111191732B (en) | Target detection method based on full-automatic learning | |
Hong et al. | Dropnas: Grouped operation dropout for differentiable architecture search | |
CN110263173A (en) | A kind of machine learning method and device of fast lifting text classification performance | |
CN105844287B (en) | A kind of the domain adaptive approach and system of classification of remote-sensing images | |
CN102508859B (en) | Advertisement classification method and device based on webpage characteristic | |
CN108875816A (en) | Merge the Active Learning samples selection strategy of Reliability Code and diversity criterion | |
CN110704624B (en) | Geographic information service metadata text multi-level multi-label classification method | |
CN102063642A (en) | Selection method for fuzzy neural network sample on basis of active learning | |
CN112307153B (en) | Automatic construction method and device of industrial knowledge base and storage medium | |
CN107169586A (en) | Resource optimization method, device and storage medium based on artificial intelligence | |
CN112115264B (en) | Text classification model adjustment method for data distribution change | |
CN106951565B (en) | File classification method and the text classifier of acquisition | |
CN106156805A (en) | A kind of classifier training method of sample label missing data | |
CN112685504A (en) | Production process-oriented distributed migration chart learning method | |
Makhadmeh et al. | Recent advances in butterfly optimization algorithm, its versions and applications | |
CN113011559A (en) | Automatic machine learning method and system based on kubernets | |
CN106326188A (en) | Task division system based on backward learning of radius particle swarm optimization and method thereof | |
CN109919374A (en) | Prediction of Stock Price method based on APSO-BP neural network | |
CN108563720A (en) | Big data based on AI recommends learning system and recommends method | |
CN108009735B (en) | Resume evaluation method and device | |
Ancin Murguzur et al. | Research gaps and trends in the Arctic tundra: a topic-modelling approach | |
CN109829535A (en) | The method for solving No-wait flowshop problem based on quantum migratory bird optimization algorithm | |
CN106611187A (en) | Multi-dimensional scaling heterogeneous cost sensitive decision-making tree constructing method | |
CN108763565B (en) | Deep learning-based data automatic association matching construction method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20190920 |
|
RJ01 | Rejection of invention patent application after publication |