CN106126751A - A kind of sorting technique with time availability and device - Google Patents

A kind of sorting technique with time availability and device Download PDF

Info

Publication number
CN106126751A
CN106126751A CN201610685180.0A CN201610685180A CN106126751A CN 106126751 A CN106126751 A CN 106126751A CN 201610685180 A CN201610685180 A CN 201610685180A CN 106126751 A CN106126751 A CN 106126751A
Authority
CN
China
Prior art keywords
sample
training
property set
classification
marked
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610685180.0A
Other languages
Chinese (zh)
Inventor
李寿山
张栋
周国栋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou University
Original Assignee
Suzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou University filed Critical Suzhou University
Priority to CN201610685180.0A priority Critical patent/CN106126751A/en
Publication of CN106126751A publication Critical patent/CN106126751A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A kind of sorting technique with time availability of disclosure and device, described method is by training a fundamental classifier using the historical sample collection marked as training sample, and use iterative manner present stage is made a reservation for do not mark sample set to be labeled based on fundamental classifier, on this basis, it is combined with described historical sample collection and sample that present stage has marked, training and obtain an object classifiers, sample to be tested is classified by follow-up this object classifiers available.Due to when training objective grader, the sample that with the addition of present stage is concentrated to historical sample, so that when utilizing the historical sample marked training grader herein in connection with the feature considering present stage sample, and then make the classification task finally training the grader drawn to can adapt to present stage sample, there is higher time availability, and owing to making full use of the historical sample marked to predict the class label of present stage sample, thus greatly reduce the mark work of present stage sample.

Description

A kind of sorting technique with time availability and device
Technical field
The invention belongs to natural language processing and mode identification technology, particularly relate to a kind of there is time availability Sorting technique and device.
Background technology
Along with the fast development of the Internet, network trading is day by day popularized, and the thing followed is the comment on commodity number on network Amount is more and more, forms the comment text information of magnanimity.The text message of these magnanimity is typically with obvious emotional color, tool There is the highest value, it is carried out sentiment analysis and research, it is possible to enterprise, government, individual etc. carry out decision-making provides effective Help.
Emotional semantic classification is an important Task in sentiment analysis, and it is mainly according to expressed by author/commentator Viewpoint and attitude realize text is classified.But, owing to language has the characteristic of active development, it is in different time sections The mode showed emotion is often different, and as a example by the comment text of commodity, in up-to-date comment text, some are old The use of word can be fewer and feweri, in some instances it may even be possible to can fade away, meanwhile, it is possible that some new words showed emotion Converging, therefore, the comment text of different time sections gap in terms of word distribution is the biggest, and this kind of situation can cause emotional semantic classification Time availability poor, i.e. before utilizing the text that marked as the grader obtained by training sample to present stage When produced text carries out emotional semantic classification, the accuracy rate of its classification can reduce.
Based on this consideration, it is same that current most of emotional semantic classification research nearly all assumes that training set and test set are all from One time period, but this kind of mode owing to need to carry out the mark tasks such as such as expert's mark to present stage sample, undoubtedly can be greatly Increase the workload of present stage sample mark, based on this, how on the premise of guaranteeing relatively high-accuracy, before making full use of The sample that marks having carries out emotional semantic classification to the text to be tested of present stage so that it is suitable that emotional semantic classification has the higher time Answering property becomes the study hotspot of this area.
Summary of the invention
In view of this, it is an object of the invention to provide a kind of sorting technique with time availability and device, it is intended to Solve the problem that existing emotional semantic classification mode exists so that emotional semantic classification has higher time availability.
To this end, the present invention is disclosed directly below technical scheme:
A kind of sorting technique with time availability, including:
The historical sample collection marked is obtained fundamental classifier as training sample, training;
The part sample utilizing described fundamental classifier to make a reservation for present stage not mark in sample set is classified, and obtains There is the part sample of class label;
By the described historical sample collection marked and there is class label described part sample in confidence level higher than predetermined The sample of threshold value is as new training sample, and iteration performs the updated of described training, described classification and described training sample Journey, until described each sample standard deviation not marked in sample set has corresponding class label;
Based on the described historical sample collection marked and described do not mark sample set mark after corresponding there is class label All samples, training obtain object classifiers, so that sample to be tested being classified based on described object classifiers.
Said method, it is preferred that described using the historical sample collection marked as training sample, training obtains base categories Device, including:
Described historical sample collection is divided into two property sets: the first property set and the second property set;Wherein, described first The common factor of property set and described second property set is empty, and intersection is described historical sample collection;
First foundation grader is obtained based on described first property set training;
The second fundamental classifier is obtained based on described second property set training.
Said method, it is preferred that described utilize described fundamental classifier to make a reservation for not mark in sample set to present stage Part sample is classified, and obtains the part sample with class label, including:
Utilize described first foundation grader that the Part I sample in described part sample is classified, had The Part I sample of class label;
Utilize described second fundamental classifier that the Part II sample in described part sample is classified, had The Part II sample of class label.
Said method, it is preferred that by the described historical sample collection marked and the described part sample with class label Middle confidence level is higher than the sample of threshold value as new training sample, and iteration performs described training, described classification and described training The renewal process of sample, including:
Confidence level in the described Part I sample with class label is added to described higher than the sample of predetermined threshold First property set, obtains the first new property set;
Confidence level in the described Part II sample with class label is added to described higher than the sample of predetermined threshold Second property set, obtains the second new property set;
Using described first property set and described second property set as new training sample, and iteration perform described training, Described classification and the renewal process of training sample.
Said method, it is preferred that also include:
Class categories based on described sample to be tested and concrete class, verify the classification accuracy of described object classifiers.
A kind of sorter with time availability, including:
Fundamental classifier training module, obtains basis for the historical sample collection that will have marked as training sample, training Grader;
Label for labelling module, for utilizing described fundamental classifier that present stage makes a reservation for the part not marking in sample set Sample is classified, and obtains the part sample with class label;
Iteration module, for by the described historical sample collection marked and there is class label described part sample in put Reliability is higher than the sample of predetermined threshold as new training sample, and iteration performs described training, described classification based training sample Renewal process, until described each sample standard deviation not marked in sample set has corresponding class label;
Object classifiers training module, for based on the described historical sample collection marked and described do not mark sample set mark All samples with class label corresponding after note, training obtains object classifiers, so that based on described object classifiers Sample to be tested is classified.
Said apparatus, it is preferred that described fundamental classifier training module includes:
Division unit, for being divided into two property sets: the first property set and the second property set by described historical sample collection; Wherein, the common factor of said two property set is empty, and intersection is described historical sample collection;
First training unit, for obtaining first foundation grader based on described first property set training;
Second training unit, for obtaining the second fundamental classifier based on described second property set training.
8, device according to claim 7, it is characterised in that described label for labelling module includes:
First mark unit, for utilizing described first foundation grader to the Part I sample in described part sample Classify, obtain the Part I sample with class label;
Second mark unit, for utilizing described second fundamental classifier to the Part II sample in described part sample Classify, obtain the Part II sample with class label.
Said apparatus, it is preferred that described iteration module includes:
First adding device, for being higher than predetermined threshold by confidence level in the described Part I sample with class label Sample add to described first property set, obtain the first new property set;
Second adding device, be used for described in there is confidence level in the Part II sample of class label higher than predetermined threshold Sample adds to described second property set, obtains the second new property set;
Iteration unit, for using described first property set and described second property set as new training sample, and iteration Perform described training, described classification and the renewal process of training sample.
Said apparatus, it is preferred that also include:
Accuracy Verification module, for class categories based on described sample to be tested and concrete class, verifies described target The classification accuracy of grader.
From above scheme, the sorting technique with time availability disclosed in the present application, by by going through of having marked History sample set obtains fundamental classifier as training sample training, and uses iterative manner to present stage based on fundamental classifier The predetermined sample not marked in sample set is labeled, and on this basis, is combined with described historical sample collection and present stage The sample marked, trains and obtains an object classifiers, thus sample to be tested is carried out by follow-up this object classifiers available Classification.Owing to when training objective grader, concentrating the sample that with the addition of present stage to historical sample, so that utilizing Herein in connection with considering the feature of present stage sample during the historical sample training grader of mark, and then make finally to train and draw Grader can adapt to the classification task of present stage sample, has higher time availability, and marks owing to making full use of Historical sample predict the class label of present stage sample, thus greatly reduce the mark work of present stage sample.
Accompanying drawing explanation
In order to be illustrated more clearly that the embodiment of the present invention or technical scheme of the prior art, below will be to embodiment or existing In having technology to describe, the required accompanying drawing used is briefly described, it should be apparent that, the accompanying drawing in describing below is only this Inventive embodiment, for those of ordinary skill in the art, on the premise of not paying creative work, it is also possible to according to The accompanying drawing provided obtains other accompanying drawing.
Fig. 1 is the flow chart of the sorting technique with time availability that the embodiment of the present invention one provides;
Fig. 2 is the flow chart of the sorting technique with time availability that the embodiment of the present invention two provides;
Fig. 3-Fig. 4 is the structural representation of the sorter with time availability that the embodiment of the present invention three provides.
Detailed description of the invention
For the sake of quoting and understanding, the technical term that is used below, write a Chinese character in simplified form or summary of abridging is explained as follows:
Semi-supervised learning: Semi-Supervised Learning, SSL: be pattern recognition and machine learning area research Important Problems, be a kind of learning method of combining with unsupervised learning of supervised learning.It mainly considers how to utilize on a small quantity Mark sample and substantial amounts of do not mark the problem that sample is trained and classifies.It is broadly divided into semisupervised classification, semi-supervised time Return, semi-supervised clustering and semi-supervised dimension-reduction algorithm.
Linear model (Unigram): unitary word feature, such as " Qin's goat milk powder is the most conscientious false " participle become: ' Qin ', ' sheep ', ' milk powder ', ' how ', ' recognizing ', ' true and false '.
Machine learning classification method: Classification Methods Based on Machine Learning: use In the statistical learning method of structure grader, input is the vector representing sample, and output is the class label of sample.According to study The difference of algorithm, common sorting technique has the sorting techniques such as naive Bayesian, maximum entropy sorting technique, support vector machine, this Invention uses maximum entropy sorting technique.
Time availability: time adaptation, refers to when investigating text feeling polarities produced by present stage, and The present stage text the most marked, the same domain text marked before now utilizing is as training sample, it was predicted that existing Text emotion.
Emotional semantic classification: sentiment classification, refers to the feeling polarities of text, it is simply that the text that will provide Being categorized in correct emotional category, in general, class categories includes that front is evaluated and unfavorable ratings.
Data pick-up: Data Extraction, refers in the most mixed and disorderly data, it is thus achieved that be distributed in each classification, Data required for the data of different time periods, the such as present invention are the data that interval time is longer, therefore may select 2002 Year before data and later data in 2012 as experimental data.This is accomplished by by the unwanted data of program filters, These useful data are selected to be stored in local computing.
Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out clear, complete Describe, it is clear that described embodiment is only a part of embodiment of the present invention rather than whole embodiments wholely.Based on Embodiment in the present invention, it is every other that those of ordinary skill in the art are obtained under not making creative work premise Embodiment, broadly falls into the scope of protection of the invention.
Embodiment one
The embodiment of the present application one discloses a kind of sorting technique with time availability, the method be applicable to but be not limited to right Text data carries out feeling polarities classification, and with reference to the sorting technique flow chart shown in Fig. 1, the method may comprise steps of:
S101: the historical sample collection marked is obtained fundamental classifier as training sample, training.
The application method is illustrated as a example by emotional semantic classification by the present embodiment, and specifically, the present embodiment proposes a kind of base Time availability emotional semantic classification scheme in coorinated training.
Wherein, coorinated training (Co-training) is a kind of the more commonly used semi-supervised learning method, and it is initially by being Blum proposed in the nineties in 20th century.They suppose to there are two substantially redundant views for given mark sample set, I.e. meet two property sets of following condition: each property set can describe this problem well, say, that if instruction Practice collection more sufficient, each property set can train a strong classifier;Two property set conditions each other are only Vertical.Basic ideas are: based on marking sample two graders of structure on two views, then utilize two graders respectively Carry out classification realize classification mark to not marking sample, and choose some confidence levels from the sample of mark of each classification gained High adding each self-corresponding marked in sample set to, two after updating have marked sample set as training set, again Training two graders, this process of iteration is until terminating when meeting condition.
The basic conception of the time availability sensibility classification method based on coorinated training of the present invention is: extraction some Theme at the comment text of different time sections, including in historical period mark text and present stage do not mark text, The described described text that do not marks having marked text and present stage being then based on historical period uses coorinated training mode, training Having the emotion classifiers of time availability, the sample to be tested of present stage is classified by this emotion classifiers of follow-up employing.
Sample data required for the present invention is the data that interval time is longer, based on this, the present invention select 2002 with Before text data and later text data in 2012 as the sample data of the present embodiment, the present embodiment is especially by data Four themes in extraction program extraction Amazon comment on commodity: the positive and negative class of electronic, kitchen, movies, video Comment, comment (present stage that is that each theme selects (historical sample marked) before 2002 and that produce after 2012 Do not mark sample), and each theme each period extract positive and negative each 2000 comments, be i.e. equivalent to each subject extraction Article 8000, comment, four themes totally 32000.The example of the comment text of four themes extracted specifically refers to following Table 1.
Table 1
Wherein, carry out data segmentation for each theme, specifically by the text of mark of each theme before 2002 As initial training sample, 3200 comment texts in later each theme in 2012 are as being used for working in coordination with Training do not mark sample, and after 2012 in each theme remaining 800 samples as the sample to be tested of present stage, with Realize finally training the accuracy of the object classifiers drawn to verify.
On the basis of being described above, this step S101 is specifically based on the thought of coorinated training, by extraction 2002 it Before mark text i.e. historical sample collection as initial training sample, and this training sample is divided into two property sets: First property set and the second property set, in general, can make the first property set and first belong to when carrying out sample and dividing as far as possible Property collection have a sample that number is suitable, and the common factor of described first property set and the second property set is empty, and intersection is described history Sample set.
Afterwards, each self-training grader on each property set, thus obtain two bases based on initial training sample Grader: first foundation grader and the second fundamental classifier.
S102: the part sample utilizing described fundamental classifier to make a reservation for present stage not mark in sample set is carried out point Class, obtains the part sample with class label.
It follows that utilize fundamental classifier to present stage will be for carrying out not the marking in sample set of coorinated training Part sample carries out classifying to realize classification mark.Specifically, utilize described first foundation grader the most each to 2012 Part sample in 3200 samples of theme carries out emotional semantic classification, obtains participating in the feeling polarities confidence of point sector of breakdown sample Number of degrees value (includes the probability of normal polarity, the probability of negative sense polarity), and using polarity higher for confidence level as the classification of sample Label, thus realize the mark to this part sample.
Meanwhile, utilize second fundamental classifier, 3200 samples to later each theme in 2012 remain and do not mark Part sample in note sample carries out emotional semantic classification, obtains participating in the feeling polarities confidence level of point sector of breakdown sample, thus real The now classification to this part sample, and then achieve the mark that this part sample is carried out class label.
The present invention uses TF (Term Frequency, word frequency) vector representation to represent text, i.e. the component of text vector The frequency occurred in the text for corresponding word, the grader that the vector of text realizes as machine learning classification method Input.In the present embodiment, the acquisition of TF vector is specifically based on unitary word feature (linear model) of text, above four themes The linear model example of text refer to table 2 below.
Table 2
Theme Feature citing (Unigram)
electronic ‘This’、‘case’、‘is’、‘junk’
kitchen ‘Well’、‘made’、‘tough’、‘and’、‘strong’
movie ‘Would’、‘have’、‘been’、‘5’、‘star’、‘but’、‘read’、‘book’
video ‘If’、‘you're’、‘into’、‘retro’、‘gaming’
On this basis, the grader of the present embodiment uses maximum entropy sorting technique to realize the classification of sample text, maximum entropy Sorting technique is based on maximum entropy information theory, and its basic thought is to set up model for all known factors, and all the unknowns Factor foreclose.It is to say, to find a kind of probability distribution, meet all known true, but allow the unknown because of Element randomization.Relative to Nae Bayesianmethod, the feature of the method maximum need not meet between feature and feature exactly Conditional sampling.Therefore, the method is suitable for merging various different features, and without considering the impact between them.
Under maximum entropy model, it was predicted that the formula of conditional probability P (c | D) is as follows:
P ( c i | D ) = 1 Z ( D ) exp ( Σ k λ k , c F k , c ( D , c i ) )
Wherein, Z (D) is normalization factor.Fk,cIt is characteristic function, is defined as:
F k , c ( D , c ′ ) = 1 , n k ( d ) > 0 a n d c ′ = c 0 , o t h e r w i s e
λk,cRepresent each characteristic function F in modelk,cParameter vector, carry out control characteristic power in whole formula Weight, Z is that its meaning is with the normalization factor that observation sequence D (set of all different words, dictionary in data) is conditional probability Complicated Joint Distribution is decomposed into the product of multiple factor, and essence is that to obtain the given D of normalization factor Z (D) equilibrium the most special Levy the conditional probability distribution P (c of ci| D) numerical value, maximum entropy model learning process estimates the relevant c of both, the parameter of D exactly. nkD () represents the number of times that the word d in feature lexicon D occurs in a comment.C ' represents the context of the word c of current predictive Word.Such as, a comment: I likes this commodity.Then predicting the probability of " liking " this word, " I ", " this ", " commodity " are just It it is the context of " liking ".
S103: by the described historical sample collection marked and there is class label described part sample in confidence level be higher than The sample of predetermined value is as new training sample, and iteration performs described training, described classification and the renewal of described training sample Process, until described each sample standard deviation not marked in sample set has corresponding class label.
Employing first foundation grader is carried out classifying and realizes the part present stage sample of class label mark, Cong Zhongxuan Select out a part of sample that confidence level is higher, the most therefrom filter out confidence level higher (can by set lowest confidence threshold value Screening) P normal polarity sample and the higher N number of negative sense polarity sample of confidence level, and by the sample interpolation that filters out extremely Constituting new training sample in second property set, described P, N are natural number.
Classify similarly, for using the second fundamental classifier to carry out and realize the part present stage sample of class label mark This, therefrom select a part of sample that confidence level is higher, the most therefrom filter out P the normal polarity sample that confidence level is higher And N number of negative sense polarity sample that confidence level is higher, and the sample filtered out interpolation is constituted new instruction to the second property set Practice sample.
Afterwards, the new training sample that two separate property sets are corresponding continues training grader respectively, and Two the new graders utilizing training to draw continue to classify the remaining sample that do not marks of present stage by said process Mark, follow-up, the present stage sample participating in coorinated training do not mark complete on the premise of, continue iteration and perform above training The current class device of grader and utilization training is to not marking the process that sample is labeled, until present stage need to participate in working in coordination with Training do not mark sample all mark complete till.
S104: based on the described historical sample collection marked and described do not mark sample set mark after corresponding there is classification All samples of label, training obtains object classifiers, so that classifying sample to be tested based on described object classifiers.
Present stage need to be participated in coorinated training do not mark sample all mark complete on the basis of, i.e. to above-mentioned each The theme 3200 samples after 2012 all mark complete on the basis of, this step by the historical sample collection marked and The mark sample set that completes of present stage carrys out coorinated training object classifiers collectively as training sample, provides for the present embodiment The example of four themes, the most specifically will extraction 2002 before mark sample and 2012 after each themes 3200 The sample that bar has completed to mark, collectively as training sample, carrys out coorinated training one and has the object classifiers of time availability.
Thus on this basis, the sample to be tested of present stage is classified, such as, specifically by this object classifiers available 800 samples that each theme was reserved after 2012 by this object classifiers available are classified.Owing to target is divided Class device has taken into full account the sample characteristics of present stage sample when building, thus is utilizing this object classifiers to enter present stage sample During row classification, there is higher classification accuracy.
From above scheme, the sorting technique with time availability disclosed in the present application, by by going through of having marked History sample set obtains fundamental classifier as training sample training, and uses iterative manner to present stage based on fundamental classifier The predetermined sample not marked in sample set is labeled, and on this basis, is combined with described historical sample collection and present stage The sample marked, trains and obtains an object classifiers, thus sample to be tested is carried out by follow-up this object classifiers available Classification.Owing to when training objective grader, concentrating the sample that with the addition of present stage to historical sample, so that utilizing Herein in connection with considering the feature of present stage sample during the historical sample training grader of mark, and then make finally to train and draw Grader can adapt to the classification task of present stage sample, has higher time availability, and marks owing to making full use of Historical sample predict the class label of present stage sample, thus greatly reduce the mark work of present stage sample.
Embodiment two
In the present embodiment two, with reference to the sorting technique flow chart with time availability shown in Fig. 2, the method is all right Comprise the following steps:
S105: class categories based on described sample to be tested and concrete class, verify that the classification of described object classifiers is accurate Really property.
The present embodiment specifically carries out Accuracy Verification to the object classifiers obtained based on coorinated training in embodiment one, In the example of four subject datas that the application provides, specifically by 800 comment literary compositions reserved in each theme in 2012 afterwards This is as sample to be tested, and utilizes the object classifiers obtained based on coorinated training to classify this sample to be tested.
On the basis of classification, by described 800 comment literary compositions reserved with each theme for the class label of classification gained This concrete class compares (identical then classification is accurate, difference then classification error), draws described object classifiers with this Accuracy rate, it is achieved the accuracy of this object classifiers is verified.
The present embodiment uses sample training one grader of mark of four themes before 2002 simultaneously, and utilization should Described sample to be tested (800 texts reserved in each theme after 2012) is classified by grader, and based on treating The class categories of test sample basis, the comparable situation of concrete class obtain the classification accuracy of this grader.With reference to table 3 below, table 3 Show accuracy rate and the basis of the grader (do not utilize and do not mark sample coorinated training) having marked sample training based on history The employing of application does not marks the correction data of the classification accuracy of the grader of sample coorinated training.
Table 3
Classification Do not utilize and do not mark sample ME Utilize and do not mark sample Co-training
electronic 0.791 0.866
kitchen 0.815 0.861
movie 0.802 0.898
video 0.780 0.835
As shown in Table 3, do not mark sample and carry out the feelings of coorinated training not utilizing merely with the historical sample marked Under condition, the classification accuracy of four themes is the most relatively low;And the application is combined with the historical sample that marked and present stage does not marks The grader that note sample obtains based on coorinated training, in testing at four groups, the classification accuracy of each group all has the biggest lifting, from And show that the classification performance of the present invention program is obviously improved compared to prior art.
Embodiment three
The present embodiment three discloses a kind of sorter with time availability, this device tool disclosed with above example There is time adaptive sorting technique corresponding.
Corresponding to the method for embodiment one, with reference to the structural representation of the sorter shown in Fig. 3, this device can include Fundamental classifier training module 100, label for labelling module 200, iteration module 300 and object classifiers training module 400.
Fundamental classifier training module 100, obtains base for the historical sample collection that will have marked as training sample, training Plinth grader.
Described fundamental classifier training module 100 includes division unit, the first training unit and the second training unit.
Division unit, for being divided into two property sets: the first property set and the second property set by described historical sample collection; Wherein, the common factor of said two property set is empty, and intersection is described historical sample collection;
First training unit, for obtaining first foundation grader based on described first property set training;
Second training unit, for obtaining the second fundamental classifier based on described second property set training.
Label for labelling module 200, for utilizing described fundamental classifier to make a reservation for not mark in sample set to present stage Part sample is classified, and obtains the part sample with class label.
Described label for labelling module 200 includes the first mark unit and the second mark unit.
First mark unit, for utilizing described first foundation grader to the Part I sample in described part sample Classify, obtain the Part I sample with class label;
Second mark unit, for utilizing described second fundamental classifier to the Part II sample in described part sample Classify, obtain the Part II sample with class label.
Iteration module 300, for by the described historical sample collection marked and the described part sample with class label Middle confidence level is higher than the sample of predetermined threshold as new training sample, and iteration performs described training, described classification based training sample This renewal process, until described each sample standard deviation not marked in sample set has corresponding class label.
Described iteration module 300 includes the first adding device, the second adding device and iteration unit.
First adding device, for being higher than predetermined threshold by confidence level in the described Part I sample with class label Sample add to described first property set, obtain the first new property set;
Second adding device, be used for described in there is confidence level in the Part II sample of class label higher than predetermined threshold Sample adds to described second property set, obtains the second new property set;
Iteration unit, for using described first property set and described second property set as new training sample, and iteration Perform described training, described classification and the renewal process of training sample.
Object classifiers training module 400, for based on the described historical sample collection marked and described do not mark sample All samples with class label corresponding after collection mark, training obtains object classifiers, so that dividing based on described target Sample to be tested is classified by class device.
Corresponding to embodiment two, with reference to the structural representation of the sorter shown in Fig. 4, described device can also include standard Really property authentication module 500, for class categories based on described sample to be tested and concrete class, verifies described object classifiers Classification accuracy.
For the sorter energy assessment system disclosed in the embodiment of the present invention three with time availability, due to Its sorting technique disclosed in embodiment one to embodiment two with time availability is corresponding, so the comparison described is simple Single, relevant similarity refers to have the explanation of the sorting technique part of time availability in embodiment one to embodiment two i.e. Can, the most no longer describe in detail.
It should be noted that each embodiment in this specification all uses the mode gone forward one by one to describe, each embodiment weight Point explanation is all the difference with other embodiments, and between each embodiment, identical similar part sees mutually.
For convenience of description, it is divided into various module or unit to be respectively described with function when describing system above or device. Certainly, the function of each unit can be realized in same or multiple softwares and/or hardware when implementing the application.
As seen through the above description of the embodiments, those skilled in the art it can be understood that to the application can The mode adding required general hardware platform by software realizes.Based on such understanding, the technical scheme essence of the application On the part that in other words prior art contributed can embody with the form of software product, this computer software product Can be stored in storage medium, such as ROM/RAM, magnetic disc, CD etc., including some instructions with so that a computer equipment (can be personal computer, server, or the network equipment etc.) performs some of each embodiment of the application or embodiment Method described in part.
Finally, in addition it is also necessary to explanation, in this article, the relational terms of such as first, second, third and fourth or the like It is used merely to separate an entity or operation with another entity or operating space, and not necessarily requires or imply these Relation or the order of any this reality is there is between entity or operation.And, term " includes ", " comprising " or it is any Other variants are intended to comprising of nonexcludability, so that include the process of a series of key element, method, article or equipment Not only include those key elements, but also include other key elements being not expressly set out, or also include for this process, side The key element that method, article or equipment are intrinsic.In the case of there is no more restriction, statement " including ... " limit Key element, it is not excluded that there is also other identical element in including the process of described key element, method, article or equipment.
The above is only the preferred embodiment of the present invention, it is noted that for the ordinary skill people of the art For Yuan, under the premise without departing from the principles of the invention, it is also possible to make some improvements and modifications, these improvements and modifications also should It is considered as protection scope of the present invention.

Claims (10)

1. a sorting technique with time availability, it is characterised in that including:
The historical sample collection marked is obtained fundamental classifier as training sample, training;
The part sample utilizing described fundamental classifier to make a reservation for present stage not mark in sample set is classified, and is had The part sample of class label;
By the described historical sample collection marked and there is class label described part sample in confidence level higher than predetermined threshold Sample as new training sample, and iteration performs described training, described classification and the renewal process of described training sample, directly To described each sample standard deviation not marked in sample set, there is corresponding class label;
Based on the described historical sample collection marked and described do not mark sample set mark after the corresponding institute with class label Sample, training is had to obtain object classifiers, so that sample to be tested being classified based on described object classifiers.
Method the most according to claim 1, it is characterised in that described using the historical sample collection marked as training sample This, training obtains fundamental classifier, including:
Described historical sample collection is divided into two property sets: the first property set and the second property set;Wherein, described first attribute The common factor of collection and described second property set is empty, and intersection is described historical sample collection;
First foundation grader is obtained based on described first property set training;
The second fundamental classifier is obtained based on described second property set training.
Method the most according to claim 2, it is characterised in that described utilize described fundamental classifier that present stage is made a reservation for The part sample not marked in sample set is classified, and obtains the part sample with class label, including:
Utilize described first foundation grader that the Part I sample in described part sample is classified, obtain that there is classification The Part I sample of label;
Utilize described second fundamental classifier that the Part II sample in described part sample is classified, obtain that there is classification The Part II sample of label.
Method the most according to claim 3, it is characterised in that by the described historical sample collection marked and have classification mark In the described part sample signed, confidence level is higher than the sample of threshold value as new training sample, and iteration performs described training, institute State classification and the renewal process of described training sample, including:
Confidence level in the described Part I sample with class label is added to described first higher than the sample of predetermined threshold Property set, obtains the first new property set;
Confidence level in the described Part II sample with class label is added to described second higher than the sample of predetermined threshold Property set, obtains the second new property set;
Using described first property set and described second property set as new training sample, and iteration performs described training, described Classification and the renewal process of training sample.
5. according to the method described in claim 1-4 any one, it is characterised in that also include:
Class categories based on described sample to be tested and concrete class, verify the classification accuracy of described object classifiers.
6. a sorter with time availability, it is characterised in that including:
Fundamental classifier training module, obtains base categories for the historical sample collection that will have marked as training sample, training Device;
Label for labelling module, for utilizing described fundamental classifier that present stage makes a reservation for the part sample not marking in sample set Classify, obtain the part sample with class label;
Iteration module, for by the described historical sample collection marked and there is class label described part sample in confidence level Higher than the sample of predetermined threshold as new training sample, and iteration performs described training, the renewal of described classification based training sample Process, until described each sample standard deviation not marked in sample set has corresponding class label;
Object classifiers training module, for based on the described historical sample collection marked and described do not mark sample set mark after Corresponding all samples with class label, training obtains object classifiers, so that treating based on described object classifiers Test sample is originally classified.
Device the most according to claim 6, it is characterised in that described fundamental classifier training module includes:
Division unit, for being divided into two property sets: the first property set and the second property set by described historical sample collection;Its In, the common factor of said two property set is empty, and intersection is described historical sample collection;
First training unit, for obtaining first foundation grader based on described first property set training;
Second training unit, for obtaining the second fundamental classifier based on described second property set training.
Device the most according to claim 7, it is characterised in that described label for labelling module includes:
First mark unit, for utilizing described first foundation grader to carry out the Part I sample in described part sample Classification, obtains the Part I sample with class label;
Second mark unit, for utilizing described second fundamental classifier to carry out the Part II sample in described part sample Classification, obtains the Part II sample with class label.
Device the most according to claim 8, it is characterised in that described iteration module includes:
First adding device, for being higher than the sample of predetermined threshold by confidence level in the described Part I sample with class label This interpolation, to described first property set, obtains the first new property set;
Second adding device, be used for described in there is the sample higher than predetermined threshold of confidence level in the Part II sample of class label Add to described second property set, obtain the second new property set;
Iteration unit, for using described first property set and described second property set as new training sample, and iteration performs Described training, described classification and the renewal process of training sample.
10. according to the device described in claim 6-9 any one, it is characterised in that also include:
Accuracy Verification module, for class categories based on described sample to be tested and concrete class, verifies described target classification The classification accuracy of device.
CN201610685180.0A 2016-08-18 2016-08-18 A kind of sorting technique with time availability and device Pending CN106126751A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610685180.0A CN106126751A (en) 2016-08-18 2016-08-18 A kind of sorting technique with time availability and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610685180.0A CN106126751A (en) 2016-08-18 2016-08-18 A kind of sorting technique with time availability and device

Publications (1)

Publication Number Publication Date
CN106126751A true CN106126751A (en) 2016-11-16

Family

ID=57278915

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610685180.0A Pending CN106126751A (en) 2016-08-18 2016-08-18 A kind of sorting technique with time availability and device

Country Status (1)

Country Link
CN (1) CN106126751A (en)

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108288161A (en) * 2017-01-10 2018-07-17 第四范式(北京)技术有限公司 The method and system of prediction result are provided based on machine learning
CN108563786A (en) * 2018-04-26 2018-09-21 腾讯科技(深圳)有限公司 Text classification and methods of exhibiting, device, computer equipment and storage medium
CN108628873A (en) * 2017-03-17 2018-10-09 腾讯科技(北京)有限公司 A kind of file classification method, device and equipment
CN109191167A (en) * 2018-07-17 2019-01-11 阿里巴巴集团控股有限公司 A kind of method for digging and device of target user
CN110033486A (en) * 2019-04-19 2019-07-19 山东大学 Transparent crystal growth course edge and volume method of real-time and system
CN110245235A (en) * 2019-06-24 2019-09-17 杭州微洱网络科技有限公司 A kind of text classification auxiliary mask method based on coorinated training
CN110310123A (en) * 2019-07-01 2019-10-08 阿里巴巴集团控股有限公司 Risk judgment method and apparatus
CN110335250A (en) * 2019-05-31 2019-10-15 上海联影智能医疗科技有限公司 Network training method, device, detection method, computer equipment and storage medium
CN110717880A (en) * 2018-07-11 2020-01-21 杭州海康威视数字技术股份有限公司 Defect detection method and device and electronic equipment
CN110766652A (en) * 2019-09-06 2020-02-07 上海联影智能医疗科技有限公司 Network training method, device, segmentation method, computer equipment and storage medium
CN111897912A (en) * 2020-07-13 2020-11-06 上海乐言信息科技有限公司 Active learning short text classification method and system based on sampling frequency optimization
CN112396084A (en) * 2019-08-19 2021-02-23 中国移动通信有限公司研究院 Data processing method, device, equipment and storage medium
WO2021128704A1 (en) * 2019-12-25 2021-07-01 华南理工大学 Open set classification method based on classification utility
US20220067486A1 (en) * 2020-09-02 2022-03-03 Sap Se Collaborative learning of question generation and question answering
CN114443849A (en) * 2022-02-09 2022-05-06 北京百度网讯科技有限公司 Method and device for selecting marked sample, electronic equipment and storage medium
CN116824275A (en) * 2023-08-29 2023-09-29 青岛美迪康数字工程有限公司 Method, device and computer equipment for realizing intelligent model optimization

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103617429A (en) * 2013-12-16 2014-03-05 苏州大学 Sorting method and system for active learning
CN103793510A (en) * 2014-01-29 2014-05-14 苏州融希信息科技有限公司 Classifier construction method based on active learning
CN105022845A (en) * 2015-08-26 2015-11-04 苏州大学张家港工业技术研究院 News classification method and system based on feature subspaces
CN105205044A (en) * 2015-08-26 2015-12-30 苏州大学张家港工业技术研究院 Emotion and non-emotion question classifying method and system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103617429A (en) * 2013-12-16 2014-03-05 苏州大学 Sorting method and system for active learning
CN103793510A (en) * 2014-01-29 2014-05-14 苏州融希信息科技有限公司 Classifier construction method based on active learning
CN105022845A (en) * 2015-08-26 2015-11-04 苏州大学张家港工业技术研究院 News classification method and system based on feature subspaces
CN105205044A (en) * 2015-08-26 2015-12-30 苏州大学张家港工业技术研究院 Emotion and non-emotion question classifying method and system

Cited By (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108288161A (en) * 2017-01-10 2018-07-17 第四范式(北京)技术有限公司 The method and system of prediction result are provided based on machine learning
CN108628873A (en) * 2017-03-17 2018-10-09 腾讯科技(北京)有限公司 A kind of file classification method, device and equipment
CN108628873B (en) * 2017-03-17 2022-09-27 腾讯科技(北京)有限公司 Text classification method, device and equipment
CN108563786A (en) * 2018-04-26 2018-09-21 腾讯科技(深圳)有限公司 Text classification and methods of exhibiting, device, computer equipment and storage medium
CN110717880A (en) * 2018-07-11 2020-01-21 杭州海康威视数字技术股份有限公司 Defect detection method and device and electronic equipment
CN109191167A (en) * 2018-07-17 2019-01-11 阿里巴巴集团控股有限公司 A kind of method for digging and device of target user
CN110033486A (en) * 2019-04-19 2019-07-19 山东大学 Transparent crystal growth course edge and volume method of real-time and system
CN110335250A (en) * 2019-05-31 2019-10-15 上海联影智能医疗科技有限公司 Network training method, device, detection method, computer equipment and storage medium
CN110245235A (en) * 2019-06-24 2019-09-17 杭州微洱网络科技有限公司 A kind of text classification auxiliary mask method based on coorinated training
CN110310123A (en) * 2019-07-01 2019-10-08 阿里巴巴集团控股有限公司 Risk judgment method and apparatus
CN110310123B (en) * 2019-07-01 2023-09-26 创新先进技术有限公司 Risk judging method and device
CN112396084A (en) * 2019-08-19 2021-02-23 中国移动通信有限公司研究院 Data processing method, device, equipment and storage medium
CN110766652A (en) * 2019-09-06 2020-02-07 上海联影智能医疗科技有限公司 Network training method, device, segmentation method, computer equipment and storage medium
WO2021128704A1 (en) * 2019-12-25 2021-07-01 华南理工大学 Open set classification method based on classification utility
CN111897912A (en) * 2020-07-13 2020-11-06 上海乐言信息科技有限公司 Active learning short text classification method and system based on sampling frequency optimization
CN111897912B (en) * 2020-07-13 2021-04-06 上海乐言科技股份有限公司 Active learning short text classification method and system based on sampling frequency optimization
US20220067486A1 (en) * 2020-09-02 2022-03-03 Sap Se Collaborative learning of question generation and question answering
CN114443849A (en) * 2022-02-09 2022-05-06 北京百度网讯科技有限公司 Method and device for selecting marked sample, electronic equipment and storage medium
CN114443849B (en) * 2022-02-09 2023-10-27 北京百度网讯科技有限公司 Labeling sample selection method and device, electronic equipment and storage medium
US11907668B2 (en) 2022-02-09 2024-02-20 Beijing Baidu Netcom Science Technology Co., Ltd. Method for selecting annotated sample, apparatus, electronic device and storage medium
CN116824275A (en) * 2023-08-29 2023-09-29 青岛美迪康数字工程有限公司 Method, device and computer equipment for realizing intelligent model optimization
CN116824275B (en) * 2023-08-29 2023-11-17 青岛美迪康数字工程有限公司 Method, device and computer equipment for realizing intelligent model optimization

Similar Documents

Publication Publication Date Title
CN106126751A (en) A kind of sorting technique with time availability and device
CN111160037B (en) Fine-grained emotion analysis method supporting cross-language migration
Dhanalakshmi et al. Opinion mining from student feedback data using supervised learning algorithms
CN110427623A (en) Semi-structured document Knowledge Extraction Method, device, electronic equipment and storage medium
CN108182279A (en) Object classification method, device and computer equipment based on text feature
CN104239554A (en) Cross-domain and cross-category news commentary emotion prediction method
CN108536870A (en) A kind of text sentiment classification method of fusion affective characteristics and semantic feature
CN106294344A (en) Video retrieval method and device
CN110851718B (en) Movie recommendation method based on long and short term memory network and user comments
CN110807084A (en) Attention mechanism-based patent term relationship extraction method for Bi-LSTM and keyword strategy
KR101561464B1 (en) Collected data sentiment analysis method and apparatus
CN109446423B (en) System and method for judging sentiment of news and texts
JP2020512651A (en) Search method, device, and non-transitory computer-readable storage medium
Hamim et al. Student profile modeling using boosting algorithms
CN108053351A (en) Intelligent college entrance will commending system and recommendation method
CN105574213A (en) Microblog recommendation method and device based on data mining technology
CN105609116A (en) Speech emotional dimensions region automatic recognition method
Gupta et al. Personality traits identification using rough sets based machine learning
CN103412878A (en) Document theme partitioning method based on domain knowledge map community structure
CN109858550B (en) Machine identification method for potential process failure mode
Kumari Collaborative classification approach for airline tweets using sentiment analysis
Durga et al. Deep-Sentiment: An Effective Deep Sentiment Analysis Using a Decision-Based Recurrent Neural Network (D-RNN)
Zhang et al. How to recommend appropriate developers for bug fixing?
Zhang et al. Semi-autonomous data enrichment based on cross-task labelling of missing targets for holistic speech analysis
CN111914060A (en) Merchant multi-view feature extraction and model construction method based on online comment data

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20161116