CN106126751A - A kind of sorting technique with time availability and device - Google Patents
A kind of sorting technique with time availability and device Download PDFInfo
- Publication number
- CN106126751A CN106126751A CN201610685180.0A CN201610685180A CN106126751A CN 106126751 A CN106126751 A CN 106126751A CN 201610685180 A CN201610685180 A CN 201610685180A CN 106126751 A CN106126751 A CN 106126751A
- Authority
- CN
- China
- Prior art keywords
- sample
- training
- property set
- classification
- marked
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/355—Class or cluster creation or modification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
A kind of sorting technique with time availability of disclosure and device, described method is by training a fundamental classifier using the historical sample collection marked as training sample, and use iterative manner present stage is made a reservation for do not mark sample set to be labeled based on fundamental classifier, on this basis, it is combined with described historical sample collection and sample that present stage has marked, training and obtain an object classifiers, sample to be tested is classified by follow-up this object classifiers available.Due to when training objective grader, the sample that with the addition of present stage is concentrated to historical sample, so that when utilizing the historical sample marked training grader herein in connection with the feature considering present stage sample, and then make the classification task finally training the grader drawn to can adapt to present stage sample, there is higher time availability, and owing to making full use of the historical sample marked to predict the class label of present stage sample, thus greatly reduce the mark work of present stage sample.
Description
Technical field
The invention belongs to natural language processing and mode identification technology, particularly relate to a kind of there is time availability
Sorting technique and device.
Background technology
Along with the fast development of the Internet, network trading is day by day popularized, and the thing followed is the comment on commodity number on network
Amount is more and more, forms the comment text information of magnanimity.The text message of these magnanimity is typically with obvious emotional color, tool
There is the highest value, it is carried out sentiment analysis and research, it is possible to enterprise, government, individual etc. carry out decision-making provides effective
Help.
Emotional semantic classification is an important Task in sentiment analysis, and it is mainly according to expressed by author/commentator
Viewpoint and attitude realize text is classified.But, owing to language has the characteristic of active development, it is in different time sections
The mode showed emotion is often different, and as a example by the comment text of commodity, in up-to-date comment text, some are old
The use of word can be fewer and feweri, in some instances it may even be possible to can fade away, meanwhile, it is possible that some new words showed emotion
Converging, therefore, the comment text of different time sections gap in terms of word distribution is the biggest, and this kind of situation can cause emotional semantic classification
Time availability poor, i.e. before utilizing the text that marked as the grader obtained by training sample to present stage
When produced text carries out emotional semantic classification, the accuracy rate of its classification can reduce.
Based on this consideration, it is same that current most of emotional semantic classification research nearly all assumes that training set and test set are all from
One time period, but this kind of mode owing to need to carry out the mark tasks such as such as expert's mark to present stage sample, undoubtedly can be greatly
Increase the workload of present stage sample mark, based on this, how on the premise of guaranteeing relatively high-accuracy, before making full use of
The sample that marks having carries out emotional semantic classification to the text to be tested of present stage so that it is suitable that emotional semantic classification has the higher time
Answering property becomes the study hotspot of this area.
Summary of the invention
In view of this, it is an object of the invention to provide a kind of sorting technique with time availability and device, it is intended to
Solve the problem that existing emotional semantic classification mode exists so that emotional semantic classification has higher time availability.
To this end, the present invention is disclosed directly below technical scheme:
A kind of sorting technique with time availability, including:
The historical sample collection marked is obtained fundamental classifier as training sample, training;
The part sample utilizing described fundamental classifier to make a reservation for present stage not mark in sample set is classified, and obtains
There is the part sample of class label;
By the described historical sample collection marked and there is class label described part sample in confidence level higher than predetermined
The sample of threshold value is as new training sample, and iteration performs the updated of described training, described classification and described training sample
Journey, until described each sample standard deviation not marked in sample set has corresponding class label;
Based on the described historical sample collection marked and described do not mark sample set mark after corresponding there is class label
All samples, training obtain object classifiers, so that sample to be tested being classified based on described object classifiers.
Said method, it is preferred that described using the historical sample collection marked as training sample, training obtains base categories
Device, including:
Described historical sample collection is divided into two property sets: the first property set and the second property set;Wherein, described first
The common factor of property set and described second property set is empty, and intersection is described historical sample collection;
First foundation grader is obtained based on described first property set training;
The second fundamental classifier is obtained based on described second property set training.
Said method, it is preferred that described utilize described fundamental classifier to make a reservation for not mark in sample set to present stage
Part sample is classified, and obtains the part sample with class label, including:
Utilize described first foundation grader that the Part I sample in described part sample is classified, had
The Part I sample of class label;
Utilize described second fundamental classifier that the Part II sample in described part sample is classified, had
The Part II sample of class label.
Said method, it is preferred that by the described historical sample collection marked and the described part sample with class label
Middle confidence level is higher than the sample of threshold value as new training sample, and iteration performs described training, described classification and described training
The renewal process of sample, including:
Confidence level in the described Part I sample with class label is added to described higher than the sample of predetermined threshold
First property set, obtains the first new property set;
Confidence level in the described Part II sample with class label is added to described higher than the sample of predetermined threshold
Second property set, obtains the second new property set;
Using described first property set and described second property set as new training sample, and iteration perform described training,
Described classification and the renewal process of training sample.
Said method, it is preferred that also include:
Class categories based on described sample to be tested and concrete class, verify the classification accuracy of described object classifiers.
A kind of sorter with time availability, including:
Fundamental classifier training module, obtains basis for the historical sample collection that will have marked as training sample, training
Grader;
Label for labelling module, for utilizing described fundamental classifier that present stage makes a reservation for the part not marking in sample set
Sample is classified, and obtains the part sample with class label;
Iteration module, for by the described historical sample collection marked and there is class label described part sample in put
Reliability is higher than the sample of predetermined threshold as new training sample, and iteration performs described training, described classification based training sample
Renewal process, until described each sample standard deviation not marked in sample set has corresponding class label;
Object classifiers training module, for based on the described historical sample collection marked and described do not mark sample set mark
All samples with class label corresponding after note, training obtains object classifiers, so that based on described object classifiers
Sample to be tested is classified.
Said apparatus, it is preferred that described fundamental classifier training module includes:
Division unit, for being divided into two property sets: the first property set and the second property set by described historical sample collection;
Wherein, the common factor of said two property set is empty, and intersection is described historical sample collection;
First training unit, for obtaining first foundation grader based on described first property set training;
Second training unit, for obtaining the second fundamental classifier based on described second property set training.
8, device according to claim 7, it is characterised in that described label for labelling module includes:
First mark unit, for utilizing described first foundation grader to the Part I sample in described part sample
Classify, obtain the Part I sample with class label;
Second mark unit, for utilizing described second fundamental classifier to the Part II sample in described part sample
Classify, obtain the Part II sample with class label.
Said apparatus, it is preferred that described iteration module includes:
First adding device, for being higher than predetermined threshold by confidence level in the described Part I sample with class label
Sample add to described first property set, obtain the first new property set;
Second adding device, be used for described in there is confidence level in the Part II sample of class label higher than predetermined threshold
Sample adds to described second property set, obtains the second new property set;
Iteration unit, for using described first property set and described second property set as new training sample, and iteration
Perform described training, described classification and the renewal process of training sample.
Said apparatus, it is preferred that also include:
Accuracy Verification module, for class categories based on described sample to be tested and concrete class, verifies described target
The classification accuracy of grader.
From above scheme, the sorting technique with time availability disclosed in the present application, by by going through of having marked
History sample set obtains fundamental classifier as training sample training, and uses iterative manner to present stage based on fundamental classifier
The predetermined sample not marked in sample set is labeled, and on this basis, is combined with described historical sample collection and present stage
The sample marked, trains and obtains an object classifiers, thus sample to be tested is carried out by follow-up this object classifiers available
Classification.Owing to when training objective grader, concentrating the sample that with the addition of present stage to historical sample, so that utilizing
Herein in connection with considering the feature of present stage sample during the historical sample training grader of mark, and then make finally to train and draw
Grader can adapt to the classification task of present stage sample, has higher time availability, and marks owing to making full use of
Historical sample predict the class label of present stage sample, thus greatly reduce the mark work of present stage sample.
Accompanying drawing explanation
In order to be illustrated more clearly that the embodiment of the present invention or technical scheme of the prior art, below will be to embodiment or existing
In having technology to describe, the required accompanying drawing used is briefly described, it should be apparent that, the accompanying drawing in describing below is only this
Inventive embodiment, for those of ordinary skill in the art, on the premise of not paying creative work, it is also possible to according to
The accompanying drawing provided obtains other accompanying drawing.
Fig. 1 is the flow chart of the sorting technique with time availability that the embodiment of the present invention one provides;
Fig. 2 is the flow chart of the sorting technique with time availability that the embodiment of the present invention two provides;
Fig. 3-Fig. 4 is the structural representation of the sorter with time availability that the embodiment of the present invention three provides.
Detailed description of the invention
For the sake of quoting and understanding, the technical term that is used below, write a Chinese character in simplified form or summary of abridging is explained as follows:
Semi-supervised learning: Semi-Supervised Learning, SSL: be pattern recognition and machine learning area research
Important Problems, be a kind of learning method of combining with unsupervised learning of supervised learning.It mainly considers how to utilize on a small quantity
Mark sample and substantial amounts of do not mark the problem that sample is trained and classifies.It is broadly divided into semisupervised classification, semi-supervised time
Return, semi-supervised clustering and semi-supervised dimension-reduction algorithm.
Linear model (Unigram): unitary word feature, such as " Qin's goat milk powder is the most conscientious false " participle become: ' Qin ',
' sheep ', ' milk powder ', ' how ', ' recognizing ', ' true and false '.
Machine learning classification method: Classification Methods Based on Machine Learning: use
In the statistical learning method of structure grader, input is the vector representing sample, and output is the class label of sample.According to study
The difference of algorithm, common sorting technique has the sorting techniques such as naive Bayesian, maximum entropy sorting technique, support vector machine, this
Invention uses maximum entropy sorting technique.
Time availability: time adaptation, refers to when investigating text feeling polarities produced by present stage, and
The present stage text the most marked, the same domain text marked before now utilizing is as training sample, it was predicted that existing
Text emotion.
Emotional semantic classification: sentiment classification, refers to the feeling polarities of text, it is simply that the text that will provide
Being categorized in correct emotional category, in general, class categories includes that front is evaluated and unfavorable ratings.
Data pick-up: Data Extraction, refers in the most mixed and disorderly data, it is thus achieved that be distributed in each classification,
Data required for the data of different time periods, the such as present invention are the data that interval time is longer, therefore may select 2002
Year before data and later data in 2012 as experimental data.This is accomplished by by the unwanted data of program filters,
These useful data are selected to be stored in local computing.
Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out clear, complete
Describe, it is clear that described embodiment is only a part of embodiment of the present invention rather than whole embodiments wholely.Based on
Embodiment in the present invention, it is every other that those of ordinary skill in the art are obtained under not making creative work premise
Embodiment, broadly falls into the scope of protection of the invention.
Embodiment one
The embodiment of the present application one discloses a kind of sorting technique with time availability, the method be applicable to but be not limited to right
Text data carries out feeling polarities classification, and with reference to the sorting technique flow chart shown in Fig. 1, the method may comprise steps of:
S101: the historical sample collection marked is obtained fundamental classifier as training sample, training.
The application method is illustrated as a example by emotional semantic classification by the present embodiment, and specifically, the present embodiment proposes a kind of base
Time availability emotional semantic classification scheme in coorinated training.
Wherein, coorinated training (Co-training) is a kind of the more commonly used semi-supervised learning method, and it is initially by being
Blum proposed in the nineties in 20th century.They suppose to there are two substantially redundant views for given mark sample set,
I.e. meet two property sets of following condition: each property set can describe this problem well, say, that if instruction
Practice collection more sufficient, each property set can train a strong classifier;Two property set conditions each other are only
Vertical.Basic ideas are: based on marking sample two graders of structure on two views, then utilize two graders respectively
Carry out classification realize classification mark to not marking sample, and choose some confidence levels from the sample of mark of each classification gained
High adding each self-corresponding marked in sample set to, two after updating have marked sample set as training set, again
Training two graders, this process of iteration is until terminating when meeting condition.
The basic conception of the time availability sensibility classification method based on coorinated training of the present invention is: extraction some
Theme at the comment text of different time sections, including in historical period mark text and present stage do not mark text,
The described described text that do not marks having marked text and present stage being then based on historical period uses coorinated training mode, training
Having the emotion classifiers of time availability, the sample to be tested of present stage is classified by this emotion classifiers of follow-up employing.
Sample data required for the present invention is the data that interval time is longer, based on this, the present invention select 2002 with
Before text data and later text data in 2012 as the sample data of the present embodiment, the present embodiment is especially by data
Four themes in extraction program extraction Amazon comment on commodity: the positive and negative class of electronic, kitchen, movies, video
Comment, comment (present stage that is that each theme selects (historical sample marked) before 2002 and that produce after 2012
Do not mark sample), and each theme each period extract positive and negative each 2000 comments, be i.e. equivalent to each subject extraction
Article 8000, comment, four themes totally 32000.The example of the comment text of four themes extracted specifically refers to following
Table 1.
Table 1
Wherein, carry out data segmentation for each theme, specifically by the text of mark of each theme before 2002
As initial training sample, 3200 comment texts in later each theme in 2012 are as being used for working in coordination with
Training do not mark sample, and after 2012 in each theme remaining 800 samples as the sample to be tested of present stage, with
Realize finally training the accuracy of the object classifiers drawn to verify.
On the basis of being described above, this step S101 is specifically based on the thought of coorinated training, by extraction 2002 it
Before mark text i.e. historical sample collection as initial training sample, and this training sample is divided into two property sets:
First property set and the second property set, in general, can make the first property set and first belong to when carrying out sample and dividing as far as possible
Property collection have a sample that number is suitable, and the common factor of described first property set and the second property set is empty, and intersection is described history
Sample set.
Afterwards, each self-training grader on each property set, thus obtain two bases based on initial training sample
Grader: first foundation grader and the second fundamental classifier.
S102: the part sample utilizing described fundamental classifier to make a reservation for present stage not mark in sample set is carried out point
Class, obtains the part sample with class label.
It follows that utilize fundamental classifier to present stage will be for carrying out not the marking in sample set of coorinated training
Part sample carries out classifying to realize classification mark.Specifically, utilize described first foundation grader the most each to 2012
Part sample in 3200 samples of theme carries out emotional semantic classification, obtains participating in the feeling polarities confidence of point sector of breakdown sample
Number of degrees value (includes the probability of normal polarity, the probability of negative sense polarity), and using polarity higher for confidence level as the classification of sample
Label, thus realize the mark to this part sample.
Meanwhile, utilize second fundamental classifier, 3200 samples to later each theme in 2012 remain and do not mark
Part sample in note sample carries out emotional semantic classification, obtains participating in the feeling polarities confidence level of point sector of breakdown sample, thus real
The now classification to this part sample, and then achieve the mark that this part sample is carried out class label.
The present invention uses TF (Term Frequency, word frequency) vector representation to represent text, i.e. the component of text vector
The frequency occurred in the text for corresponding word, the grader that the vector of text realizes as machine learning classification method
Input.In the present embodiment, the acquisition of TF vector is specifically based on unitary word feature (linear model) of text, above four themes
The linear model example of text refer to table 2 below.
Table 2
Theme | Feature citing (Unigram) |
electronic | ‘This’、‘case’、‘is’、‘junk’ |
kitchen | ‘Well’、‘made’、‘tough’、‘and’、‘strong’ |
movie | ‘Would’、‘have’、‘been’、‘5’、‘star’、‘but’、‘read’、‘book’ |
video | ‘If’、‘you're’、‘into’、‘retro’、‘gaming’ |
On this basis, the grader of the present embodiment uses maximum entropy sorting technique to realize the classification of sample text, maximum entropy
Sorting technique is based on maximum entropy information theory, and its basic thought is to set up model for all known factors, and all the unknowns
Factor foreclose.It is to say, to find a kind of probability distribution, meet all known true, but allow the unknown because of
Element randomization.Relative to Nae Bayesianmethod, the feature of the method maximum need not meet between feature and feature exactly
Conditional sampling.Therefore, the method is suitable for merging various different features, and without considering the impact between them.
Under maximum entropy model, it was predicted that the formula of conditional probability P (c | D) is as follows:
Wherein, Z (D) is normalization factor.Fk,cIt is characteristic function, is defined as:
λk,cRepresent each characteristic function F in modelk,cParameter vector, carry out control characteristic power in whole formula
Weight, Z is that its meaning is with the normalization factor that observation sequence D (set of all different words, dictionary in data) is conditional probability
Complicated Joint Distribution is decomposed into the product of multiple factor, and essence is that to obtain the given D of normalization factor Z (D) equilibrium the most special
Levy the conditional probability distribution P (c of ci| D) numerical value, maximum entropy model learning process estimates the relevant c of both, the parameter of D exactly.
nkD () represents the number of times that the word d in feature lexicon D occurs in a comment.C ' represents the context of the word c of current predictive
Word.Such as, a comment: I likes this commodity.Then predicting the probability of " liking " this word, " I ", " this ", " commodity " are just
It it is the context of " liking ".
S103: by the described historical sample collection marked and there is class label described part sample in confidence level be higher than
The sample of predetermined value is as new training sample, and iteration performs described training, described classification and the renewal of described training sample
Process, until described each sample standard deviation not marked in sample set has corresponding class label.
Employing first foundation grader is carried out classifying and realizes the part present stage sample of class label mark, Cong Zhongxuan
Select out a part of sample that confidence level is higher, the most therefrom filter out confidence level higher (can by set lowest confidence threshold value
Screening) P normal polarity sample and the higher N number of negative sense polarity sample of confidence level, and by the sample interpolation that filters out extremely
Constituting new training sample in second property set, described P, N are natural number.
Classify similarly, for using the second fundamental classifier to carry out and realize the part present stage sample of class label mark
This, therefrom select a part of sample that confidence level is higher, the most therefrom filter out P the normal polarity sample that confidence level is higher
And N number of negative sense polarity sample that confidence level is higher, and the sample filtered out interpolation is constituted new instruction to the second property set
Practice sample.
Afterwards, the new training sample that two separate property sets are corresponding continues training grader respectively, and
Two the new graders utilizing training to draw continue to classify the remaining sample that do not marks of present stage by said process
Mark, follow-up, the present stage sample participating in coorinated training do not mark complete on the premise of, continue iteration and perform above training
The current class device of grader and utilization training is to not marking the process that sample is labeled, until present stage need to participate in working in coordination with
Training do not mark sample all mark complete till.
S104: based on the described historical sample collection marked and described do not mark sample set mark after corresponding there is classification
All samples of label, training obtains object classifiers, so that classifying sample to be tested based on described object classifiers.
Present stage need to be participated in coorinated training do not mark sample all mark complete on the basis of, i.e. to above-mentioned each
The theme 3200 samples after 2012 all mark complete on the basis of, this step by the historical sample collection marked and
The mark sample set that completes of present stage carrys out coorinated training object classifiers collectively as training sample, provides for the present embodiment
The example of four themes, the most specifically will extraction 2002 before mark sample and 2012 after each themes 3200
The sample that bar has completed to mark, collectively as training sample, carrys out coorinated training one and has the object classifiers of time availability.
Thus on this basis, the sample to be tested of present stage is classified, such as, specifically by this object classifiers available
800 samples that each theme was reserved after 2012 by this object classifiers available are classified.Owing to target is divided
Class device has taken into full account the sample characteristics of present stage sample when building, thus is utilizing this object classifiers to enter present stage sample
During row classification, there is higher classification accuracy.
From above scheme, the sorting technique with time availability disclosed in the present application, by by going through of having marked
History sample set obtains fundamental classifier as training sample training, and uses iterative manner to present stage based on fundamental classifier
The predetermined sample not marked in sample set is labeled, and on this basis, is combined with described historical sample collection and present stage
The sample marked, trains and obtains an object classifiers, thus sample to be tested is carried out by follow-up this object classifiers available
Classification.Owing to when training objective grader, concentrating the sample that with the addition of present stage to historical sample, so that utilizing
Herein in connection with considering the feature of present stage sample during the historical sample training grader of mark, and then make finally to train and draw
Grader can adapt to the classification task of present stage sample, has higher time availability, and marks owing to making full use of
Historical sample predict the class label of present stage sample, thus greatly reduce the mark work of present stage sample.
Embodiment two
In the present embodiment two, with reference to the sorting technique flow chart with time availability shown in Fig. 2, the method is all right
Comprise the following steps:
S105: class categories based on described sample to be tested and concrete class, verify that the classification of described object classifiers is accurate
Really property.
The present embodiment specifically carries out Accuracy Verification to the object classifiers obtained based on coorinated training in embodiment one,
In the example of four subject datas that the application provides, specifically by 800 comment literary compositions reserved in each theme in 2012 afterwards
This is as sample to be tested, and utilizes the object classifiers obtained based on coorinated training to classify this sample to be tested.
On the basis of classification, by described 800 comment literary compositions reserved with each theme for the class label of classification gained
This concrete class compares (identical then classification is accurate, difference then classification error), draws described object classifiers with this
Accuracy rate, it is achieved the accuracy of this object classifiers is verified.
The present embodiment uses sample training one grader of mark of four themes before 2002 simultaneously, and utilization should
Described sample to be tested (800 texts reserved in each theme after 2012) is classified by grader, and based on treating
The class categories of test sample basis, the comparable situation of concrete class obtain the classification accuracy of this grader.With reference to table 3 below, table 3
Show accuracy rate and the basis of the grader (do not utilize and do not mark sample coorinated training) having marked sample training based on history
The employing of application does not marks the correction data of the classification accuracy of the grader of sample coorinated training.
Table 3
Classification | Do not utilize and do not mark sample ME | Utilize and do not mark sample Co-training |
electronic | 0.791 | 0.866 |
kitchen | 0.815 | 0.861 |
movie | 0.802 | 0.898 |
video | 0.780 | 0.835 |
As shown in Table 3, do not mark sample and carry out the feelings of coorinated training not utilizing merely with the historical sample marked
Under condition, the classification accuracy of four themes is the most relatively low;And the application is combined with the historical sample that marked and present stage does not marks
The grader that note sample obtains based on coorinated training, in testing at four groups, the classification accuracy of each group all has the biggest lifting, from
And show that the classification performance of the present invention program is obviously improved compared to prior art.
Embodiment three
The present embodiment three discloses a kind of sorter with time availability, this device tool disclosed with above example
There is time adaptive sorting technique corresponding.
Corresponding to the method for embodiment one, with reference to the structural representation of the sorter shown in Fig. 3, this device can include
Fundamental classifier training module 100, label for labelling module 200, iteration module 300 and object classifiers training module 400.
Fundamental classifier training module 100, obtains base for the historical sample collection that will have marked as training sample, training
Plinth grader.
Described fundamental classifier training module 100 includes division unit, the first training unit and the second training unit.
Division unit, for being divided into two property sets: the first property set and the second property set by described historical sample collection;
Wherein, the common factor of said two property set is empty, and intersection is described historical sample collection;
First training unit, for obtaining first foundation grader based on described first property set training;
Second training unit, for obtaining the second fundamental classifier based on described second property set training.
Label for labelling module 200, for utilizing described fundamental classifier to make a reservation for not mark in sample set to present stage
Part sample is classified, and obtains the part sample with class label.
Described label for labelling module 200 includes the first mark unit and the second mark unit.
First mark unit, for utilizing described first foundation grader to the Part I sample in described part sample
Classify, obtain the Part I sample with class label;
Second mark unit, for utilizing described second fundamental classifier to the Part II sample in described part sample
Classify, obtain the Part II sample with class label.
Iteration module 300, for by the described historical sample collection marked and the described part sample with class label
Middle confidence level is higher than the sample of predetermined threshold as new training sample, and iteration performs described training, described classification based training sample
This renewal process, until described each sample standard deviation not marked in sample set has corresponding class label.
Described iteration module 300 includes the first adding device, the second adding device and iteration unit.
First adding device, for being higher than predetermined threshold by confidence level in the described Part I sample with class label
Sample add to described first property set, obtain the first new property set;
Second adding device, be used for described in there is confidence level in the Part II sample of class label higher than predetermined threshold
Sample adds to described second property set, obtains the second new property set;
Iteration unit, for using described first property set and described second property set as new training sample, and iteration
Perform described training, described classification and the renewal process of training sample.
Object classifiers training module 400, for based on the described historical sample collection marked and described do not mark sample
All samples with class label corresponding after collection mark, training obtains object classifiers, so that dividing based on described target
Sample to be tested is classified by class device.
Corresponding to embodiment two, with reference to the structural representation of the sorter shown in Fig. 4, described device can also include standard
Really property authentication module 500, for class categories based on described sample to be tested and concrete class, verifies described object classifiers
Classification accuracy.
For the sorter energy assessment system disclosed in the embodiment of the present invention three with time availability, due to
Its sorting technique disclosed in embodiment one to embodiment two with time availability is corresponding, so the comparison described is simple
Single, relevant similarity refers to have the explanation of the sorting technique part of time availability in embodiment one to embodiment two i.e.
Can, the most no longer describe in detail.
It should be noted that each embodiment in this specification all uses the mode gone forward one by one to describe, each embodiment weight
Point explanation is all the difference with other embodiments, and between each embodiment, identical similar part sees mutually.
For convenience of description, it is divided into various module or unit to be respectively described with function when describing system above or device.
Certainly, the function of each unit can be realized in same or multiple softwares and/or hardware when implementing the application.
As seen through the above description of the embodiments, those skilled in the art it can be understood that to the application can
The mode adding required general hardware platform by software realizes.Based on such understanding, the technical scheme essence of the application
On the part that in other words prior art contributed can embody with the form of software product, this computer software product
Can be stored in storage medium, such as ROM/RAM, magnetic disc, CD etc., including some instructions with so that a computer equipment
(can be personal computer, server, or the network equipment etc.) performs some of each embodiment of the application or embodiment
Method described in part.
Finally, in addition it is also necessary to explanation, in this article, the relational terms of such as first, second, third and fourth or the like
It is used merely to separate an entity or operation with another entity or operating space, and not necessarily requires or imply these
Relation or the order of any this reality is there is between entity or operation.And, term " includes ", " comprising " or it is any
Other variants are intended to comprising of nonexcludability, so that include the process of a series of key element, method, article or equipment
Not only include those key elements, but also include other key elements being not expressly set out, or also include for this process, side
The key element that method, article or equipment are intrinsic.In the case of there is no more restriction, statement " including ... " limit
Key element, it is not excluded that there is also other identical element in including the process of described key element, method, article or equipment.
The above is only the preferred embodiment of the present invention, it is noted that for the ordinary skill people of the art
For Yuan, under the premise without departing from the principles of the invention, it is also possible to make some improvements and modifications, these improvements and modifications also should
It is considered as protection scope of the present invention.
Claims (10)
1. a sorting technique with time availability, it is characterised in that including:
The historical sample collection marked is obtained fundamental classifier as training sample, training;
The part sample utilizing described fundamental classifier to make a reservation for present stage not mark in sample set is classified, and is had
The part sample of class label;
By the described historical sample collection marked and there is class label described part sample in confidence level higher than predetermined threshold
Sample as new training sample, and iteration performs described training, described classification and the renewal process of described training sample, directly
To described each sample standard deviation not marked in sample set, there is corresponding class label;
Based on the described historical sample collection marked and described do not mark sample set mark after the corresponding institute with class label
Sample, training is had to obtain object classifiers, so that sample to be tested being classified based on described object classifiers.
Method the most according to claim 1, it is characterised in that described using the historical sample collection marked as training sample
This, training obtains fundamental classifier, including:
Described historical sample collection is divided into two property sets: the first property set and the second property set;Wherein, described first attribute
The common factor of collection and described second property set is empty, and intersection is described historical sample collection;
First foundation grader is obtained based on described first property set training;
The second fundamental classifier is obtained based on described second property set training.
Method the most according to claim 2, it is characterised in that described utilize described fundamental classifier that present stage is made a reservation for
The part sample not marked in sample set is classified, and obtains the part sample with class label, including:
Utilize described first foundation grader that the Part I sample in described part sample is classified, obtain that there is classification
The Part I sample of label;
Utilize described second fundamental classifier that the Part II sample in described part sample is classified, obtain that there is classification
The Part II sample of label.
Method the most according to claim 3, it is characterised in that by the described historical sample collection marked and have classification mark
In the described part sample signed, confidence level is higher than the sample of threshold value as new training sample, and iteration performs described training, institute
State classification and the renewal process of described training sample, including:
Confidence level in the described Part I sample with class label is added to described first higher than the sample of predetermined threshold
Property set, obtains the first new property set;
Confidence level in the described Part II sample with class label is added to described second higher than the sample of predetermined threshold
Property set, obtains the second new property set;
Using described first property set and described second property set as new training sample, and iteration performs described training, described
Classification and the renewal process of training sample.
5. according to the method described in claim 1-4 any one, it is characterised in that also include:
Class categories based on described sample to be tested and concrete class, verify the classification accuracy of described object classifiers.
6. a sorter with time availability, it is characterised in that including:
Fundamental classifier training module, obtains base categories for the historical sample collection that will have marked as training sample, training
Device;
Label for labelling module, for utilizing described fundamental classifier that present stage makes a reservation for the part sample not marking in sample set
Classify, obtain the part sample with class label;
Iteration module, for by the described historical sample collection marked and there is class label described part sample in confidence level
Higher than the sample of predetermined threshold as new training sample, and iteration performs described training, the renewal of described classification based training sample
Process, until described each sample standard deviation not marked in sample set has corresponding class label;
Object classifiers training module, for based on the described historical sample collection marked and described do not mark sample set mark after
Corresponding all samples with class label, training obtains object classifiers, so that treating based on described object classifiers
Test sample is originally classified.
Device the most according to claim 6, it is characterised in that described fundamental classifier training module includes:
Division unit, for being divided into two property sets: the first property set and the second property set by described historical sample collection;Its
In, the common factor of said two property set is empty, and intersection is described historical sample collection;
First training unit, for obtaining first foundation grader based on described first property set training;
Second training unit, for obtaining the second fundamental classifier based on described second property set training.
Device the most according to claim 7, it is characterised in that described label for labelling module includes:
First mark unit, for utilizing described first foundation grader to carry out the Part I sample in described part sample
Classification, obtains the Part I sample with class label;
Second mark unit, for utilizing described second fundamental classifier to carry out the Part II sample in described part sample
Classification, obtains the Part II sample with class label.
Device the most according to claim 8, it is characterised in that described iteration module includes:
First adding device, for being higher than the sample of predetermined threshold by confidence level in the described Part I sample with class label
This interpolation, to described first property set, obtains the first new property set;
Second adding device, be used for described in there is the sample higher than predetermined threshold of confidence level in the Part II sample of class label
Add to described second property set, obtain the second new property set;
Iteration unit, for using described first property set and described second property set as new training sample, and iteration performs
Described training, described classification and the renewal process of training sample.
10. according to the device described in claim 6-9 any one, it is characterised in that also include:
Accuracy Verification module, for class categories based on described sample to be tested and concrete class, verifies described target classification
The classification accuracy of device.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610685180.0A CN106126751A (en) | 2016-08-18 | 2016-08-18 | A kind of sorting technique with time availability and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610685180.0A CN106126751A (en) | 2016-08-18 | 2016-08-18 | A kind of sorting technique with time availability and device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106126751A true CN106126751A (en) | 2016-11-16 |
Family
ID=57278915
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610685180.0A Pending CN106126751A (en) | 2016-08-18 | 2016-08-18 | A kind of sorting technique with time availability and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106126751A (en) |
Cited By (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108288161A (en) * | 2017-01-10 | 2018-07-17 | 第四范式(北京)技术有限公司 | The method and system of prediction result are provided based on machine learning |
CN108563786A (en) * | 2018-04-26 | 2018-09-21 | 腾讯科技(深圳)有限公司 | Text classification and methods of exhibiting, device, computer equipment and storage medium |
CN108628873A (en) * | 2017-03-17 | 2018-10-09 | 腾讯科技(北京)有限公司 | A kind of file classification method, device and equipment |
CN109191167A (en) * | 2018-07-17 | 2019-01-11 | 阿里巴巴集团控股有限公司 | A kind of method for digging and device of target user |
CN110033486A (en) * | 2019-04-19 | 2019-07-19 | 山东大学 | Transparent crystal growth course edge and volume method of real-time and system |
CN110245235A (en) * | 2019-06-24 | 2019-09-17 | 杭州微洱网络科技有限公司 | A kind of text classification auxiliary mask method based on coorinated training |
CN110310123A (en) * | 2019-07-01 | 2019-10-08 | 阿里巴巴集团控股有限公司 | Risk judgment method and apparatus |
CN110335250A (en) * | 2019-05-31 | 2019-10-15 | 上海联影智能医疗科技有限公司 | Network training method, device, detection method, computer equipment and storage medium |
CN110717880A (en) * | 2018-07-11 | 2020-01-21 | 杭州海康威视数字技术股份有限公司 | Defect detection method and device and electronic equipment |
CN110766652A (en) * | 2019-09-06 | 2020-02-07 | 上海联影智能医疗科技有限公司 | Network training method, device, segmentation method, computer equipment and storage medium |
CN111897912A (en) * | 2020-07-13 | 2020-11-06 | 上海乐言信息科技有限公司 | Active learning short text classification method and system based on sampling frequency optimization |
CN112396084A (en) * | 2019-08-19 | 2021-02-23 | 中国移动通信有限公司研究院 | Data processing method, device, equipment and storage medium |
WO2021128704A1 (en) * | 2019-12-25 | 2021-07-01 | 华南理工大学 | Open set classification method based on classification utility |
US20220067486A1 (en) * | 2020-09-02 | 2022-03-03 | Sap Se | Collaborative learning of question generation and question answering |
CN114443849A (en) * | 2022-02-09 | 2022-05-06 | 北京百度网讯科技有限公司 | Method and device for selecting marked sample, electronic equipment and storage medium |
CN116824275A (en) * | 2023-08-29 | 2023-09-29 | 青岛美迪康数字工程有限公司 | Method, device and computer equipment for realizing intelligent model optimization |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103617429A (en) * | 2013-12-16 | 2014-03-05 | 苏州大学 | Sorting method and system for active learning |
CN103793510A (en) * | 2014-01-29 | 2014-05-14 | 苏州融希信息科技有限公司 | Classifier construction method based on active learning |
CN105022845A (en) * | 2015-08-26 | 2015-11-04 | 苏州大学张家港工业技术研究院 | News classification method and system based on feature subspaces |
CN105205044A (en) * | 2015-08-26 | 2015-12-30 | 苏州大学张家港工业技术研究院 | Emotion and non-emotion question classifying method and system |
-
2016
- 2016-08-18 CN CN201610685180.0A patent/CN106126751A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103617429A (en) * | 2013-12-16 | 2014-03-05 | 苏州大学 | Sorting method and system for active learning |
CN103793510A (en) * | 2014-01-29 | 2014-05-14 | 苏州融希信息科技有限公司 | Classifier construction method based on active learning |
CN105022845A (en) * | 2015-08-26 | 2015-11-04 | 苏州大学张家港工业技术研究院 | News classification method and system based on feature subspaces |
CN105205044A (en) * | 2015-08-26 | 2015-12-30 | 苏州大学张家港工业技术研究院 | Emotion and non-emotion question classifying method and system |
Cited By (22)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108288161A (en) * | 2017-01-10 | 2018-07-17 | 第四范式(北京)技术有限公司 | The method and system of prediction result are provided based on machine learning |
CN108628873A (en) * | 2017-03-17 | 2018-10-09 | 腾讯科技(北京)有限公司 | A kind of file classification method, device and equipment |
CN108628873B (en) * | 2017-03-17 | 2022-09-27 | 腾讯科技(北京)有限公司 | Text classification method, device and equipment |
CN108563786A (en) * | 2018-04-26 | 2018-09-21 | 腾讯科技(深圳)有限公司 | Text classification and methods of exhibiting, device, computer equipment and storage medium |
CN110717880A (en) * | 2018-07-11 | 2020-01-21 | 杭州海康威视数字技术股份有限公司 | Defect detection method and device and electronic equipment |
CN109191167A (en) * | 2018-07-17 | 2019-01-11 | 阿里巴巴集团控股有限公司 | A kind of method for digging and device of target user |
CN110033486A (en) * | 2019-04-19 | 2019-07-19 | 山东大学 | Transparent crystal growth course edge and volume method of real-time and system |
CN110335250A (en) * | 2019-05-31 | 2019-10-15 | 上海联影智能医疗科技有限公司 | Network training method, device, detection method, computer equipment and storage medium |
CN110245235A (en) * | 2019-06-24 | 2019-09-17 | 杭州微洱网络科技有限公司 | A kind of text classification auxiliary mask method based on coorinated training |
CN110310123A (en) * | 2019-07-01 | 2019-10-08 | 阿里巴巴集团控股有限公司 | Risk judgment method and apparatus |
CN110310123B (en) * | 2019-07-01 | 2023-09-26 | 创新先进技术有限公司 | Risk judging method and device |
CN112396084A (en) * | 2019-08-19 | 2021-02-23 | 中国移动通信有限公司研究院 | Data processing method, device, equipment and storage medium |
CN110766652A (en) * | 2019-09-06 | 2020-02-07 | 上海联影智能医疗科技有限公司 | Network training method, device, segmentation method, computer equipment and storage medium |
WO2021128704A1 (en) * | 2019-12-25 | 2021-07-01 | 华南理工大学 | Open set classification method based on classification utility |
CN111897912A (en) * | 2020-07-13 | 2020-11-06 | 上海乐言信息科技有限公司 | Active learning short text classification method and system based on sampling frequency optimization |
CN111897912B (en) * | 2020-07-13 | 2021-04-06 | 上海乐言科技股份有限公司 | Active learning short text classification method and system based on sampling frequency optimization |
US20220067486A1 (en) * | 2020-09-02 | 2022-03-03 | Sap Se | Collaborative learning of question generation and question answering |
CN114443849A (en) * | 2022-02-09 | 2022-05-06 | 北京百度网讯科技有限公司 | Method and device for selecting marked sample, electronic equipment and storage medium |
CN114443849B (en) * | 2022-02-09 | 2023-10-27 | 北京百度网讯科技有限公司 | Labeling sample selection method and device, electronic equipment and storage medium |
US11907668B2 (en) | 2022-02-09 | 2024-02-20 | Beijing Baidu Netcom Science Technology Co., Ltd. | Method for selecting annotated sample, apparatus, electronic device and storage medium |
CN116824275A (en) * | 2023-08-29 | 2023-09-29 | 青岛美迪康数字工程有限公司 | Method, device and computer equipment for realizing intelligent model optimization |
CN116824275B (en) * | 2023-08-29 | 2023-11-17 | 青岛美迪康数字工程有限公司 | Method, device and computer equipment for realizing intelligent model optimization |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106126751A (en) | A kind of sorting technique with time availability and device | |
CN111160037B (en) | Fine-grained emotion analysis method supporting cross-language migration | |
Dhanalakshmi et al. | Opinion mining from student feedback data using supervised learning algorithms | |
CN110427623A (en) | Semi-structured document Knowledge Extraction Method, device, electronic equipment and storage medium | |
CN108182279A (en) | Object classification method, device and computer equipment based on text feature | |
CN104239554A (en) | Cross-domain and cross-category news commentary emotion prediction method | |
CN108536870A (en) | A kind of text sentiment classification method of fusion affective characteristics and semantic feature | |
CN106294344A (en) | Video retrieval method and device | |
CN110851718B (en) | Movie recommendation method based on long and short term memory network and user comments | |
CN110807084A (en) | Attention mechanism-based patent term relationship extraction method for Bi-LSTM and keyword strategy | |
KR101561464B1 (en) | Collected data sentiment analysis method and apparatus | |
CN109446423B (en) | System and method for judging sentiment of news and texts | |
JP2020512651A (en) | Search method, device, and non-transitory computer-readable storage medium | |
Hamim et al. | Student profile modeling using boosting algorithms | |
CN108053351A (en) | Intelligent college entrance will commending system and recommendation method | |
CN105574213A (en) | Microblog recommendation method and device based on data mining technology | |
CN105609116A (en) | Speech emotional dimensions region automatic recognition method | |
Gupta et al. | Personality traits identification using rough sets based machine learning | |
CN103412878A (en) | Document theme partitioning method based on domain knowledge map community structure | |
CN109858550B (en) | Machine identification method for potential process failure mode | |
Kumari | Collaborative classification approach for airline tweets using sentiment analysis | |
Durga et al. | Deep-Sentiment: An Effective Deep Sentiment Analysis Using a Decision-Based Recurrent Neural Network (D-RNN) | |
Zhang et al. | How to recommend appropriate developers for bug fixing? | |
Zhang et al. | Semi-autonomous data enrichment based on cross-task labelling of missing targets for holistic speech analysis | |
CN111914060A (en) | Merchant multi-view feature extraction and model construction method based on online comment data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20161116 |