CN102402713A - Robot learning method and device - Google Patents

Robot learning method and device Download PDF

Info

Publication number
CN102402713A
CN102402713A CN2010102802390A CN201010280239A CN102402713A CN 102402713 A CN102402713 A CN 102402713A CN 2010102802390 A CN2010102802390 A CN 2010102802390A CN 201010280239 A CN201010280239 A CN 201010280239A CN 102402713 A CN102402713 A CN 102402713A
Authority
CN
China
Prior art keywords
sorter
seed
example collection
mark
utilize
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2010102802390A
Other languages
Chinese (zh)
Other versions
CN102402713B (en
Inventor
杨宇航
于浩
孟遥
陆应亮
夏迎炬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fujitsu Ltd
Original Assignee
Fujitsu Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fujitsu Ltd filed Critical Fujitsu Ltd
Priority to CN201010280239.0A priority Critical patent/CN102402713B/en
Publication of CN102402713A publication Critical patent/CN102402713A/en
Application granted granted Critical
Publication of CN102402713B publication Critical patent/CN102402713B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention discloses a robot learning method and a corresponding device. The robot learning method comprises the following steps of: using different methods to automatically label and obtain n different seed sets S1 to Sn from unlabeled data set, and n is natural number of not less than 2; using the n automatically labeled seed sets S1 to Sn to train n corresponding sorters C1 to Cn respectively; for each seed set Si in the n automatically labeled seed sets, i is 1 to n, and using part or all of sorters expect for the sorter Ci trained by the seed set Si to check the seed set Si; and using the n checked seed sets S1 to Sn to retrain the n corresponding sorters C1 to Cn respectively.

Description

Machine learning method and device
Technical field
The present invention relates to the machine learning field, more specifically, relate to a kind of fault-tolerant machine learning method and device.
Background technology
Machine learning is intended to study computing machine and how simulates or realize human learning behavior, to obtain new knowledge or skills, reorganizes the existing structure of knowledge and makes it constantly to improve the performance of self.Machine learning method and device are widely used in the task of different field, for example computer vision, natural language processing, bioinformatics etc.
Machine learning can be divided into supervise learning and not have two big types of the study of guidance.Generally speaking, guideless learning method is used the not data set training classifier of mark.Fig. 1 shows the indicative flowchart that a kind of nothing of the prior art instructs machine learning method.In step S110, the data set that does not mark is carried out random labelling, obtain training set.In step S120, use the training set training classifier.In step S130, with the pending example collection of sorter prediction that trains.Guideless learning method need not to drop into a large amount of manpowers data set is marked, but since data set without mark, effect possibly not be very desirable.
Fig. 2 shows a kind of indicative flowchart that the guidance machine learning method is arranged of the prior art.In step S210, with the training set training classifier of artificial mark.In step S220, with the pending example collection of sorter prediction that trains.There is the directed learning method to use the data of a large amount of artificial check and correction, thereby can obtains effect preferably.But such method is difficult to be transplanted to the field or the application of resource-constrained.
Therefore machine learning method often faces such awkward situation: guideless method is maybe effect not very good, is used to prepare corpus and there is the method for guidance need consume lot of manpower and material resources.
In order to overcome this awkward situation, half directed learning method has appearred.Fig. 3 shows the indicative flowchart of a kind of half guidance machine learning method of the prior art.Instruct learning method to compare with the nothing of Fig. 1, when training classifier, except using data centralization random labelling that never marks and the training set that obtains, also used the training set of artificial mark among Fig. 3.Fig. 4 shows the indicative flowchart of another kind half guidance machine learning method of the prior art.In the method for Fig. 4, manual work marks and obtains a seed set in step S410, and in step S420, trains a sorter with this seed set.In addition, in order to improve the performance of sorter, in step S430, with the pending example collection of sorter prediction; In step S440, the instance that confidence level in predicting the outcome is the highest adds in the seed set; And in step S450, utilize the seed that adds instance to gather training classifier once more.Repeating step S430 to S450 is up to the repetition end condition that satisfies regulation.
The method of half guidance can be used mark and the language material that does not mark simultaneously, but still depends critically upon the scale and the quality of mark language material.Still be the significant challenge that the machine learning field faces how at artificial degree of participation and aspect of performance seeking balance.
Summary of the invention
Provided hereinafter about brief overview of the present invention, so that the basic comprehension about some aspect of the present invention is provided.Should be appreciated that this general introduction is not about exhaustive general introduction of the present invention.It is not that intention is confirmed key of the present invention or pith, neither be intended to limit scope of the present invention.Its purpose only is to provide some notion with the form of simplifying, with this as the preorder in greater detail of argumentation after a while.
In view of the above situation of prior art, the present invention aim to provide a kind of efficiently, fault-tolerant machine learning method and device.
According to an aspect of the present invention, a kind of machine learning method comprises: the data centralization of utilizing diverse ways never to mark marks and obtains n different seed S set automatically 1, S 2..., S n, n is natural number and n>=2; Utilize said n the seed S set of mark automatically 1, S 2..., S nTrain corresponding n sorter C respectively 1, C 2..., C nAutomatically, each the seed S set during the seed of mark is gathered for said n i, i=1,2 ..., n utilizes removing by this seed S set in the said n sorter iThe sorter C of training iOutside part or all of sorter to this seed S set iVerify; And said n seed S set utilizing empirical tests 1, S 2..., S nTrain corresponding n sorter C respectively once more 1, C 2..., C n
According to a further aspect in the invention, a kind of machine learning device comprises: initialization unit, configuration is used for: the data centralization of utilizing diverse ways never to mark marks automatically and obtains n different seed S set 1, S 2..., S n, n is natural number and n>=2; Utilize said n the seed S set of mark automatically 1, S 2..., S nTrain corresponding n sorter C respectively 1, C 2..., C nAnd for each the seed S set in said n the seed set that marks automatically i, i=1,2 ..., n utilizes removing by this seed S set in the said n sorter iThe sorter C of training iOutside part or all of sorter to this seed S set iVerify; And optimization and processing unit, configuration is used for: said n seed S set utilizing empirical tests 1, S 2..., S nTrain corresponding n sorter C respectively once more 1, C 2..., C n
In said method and device, through distinct methods the data set that does not mark is marked automatically, need not artificial the participation, improved learning efficiency.In addition, carry out cross validation through seed being gathered, and utilize through the seed set of cross validation and train the respective classified device once more, controlled the noise of introducing by automatic mark effectively, realized fault-tolerant study with sorter.
Through below in conjunction with the detailed description of accompanying drawing to most preferred embodiment of the present invention, these and other advantage of the present invention will be more obvious.
Description of drawings
With reference to below in conjunction with the explanation of accompanying drawing, can understand above and other purpose, characteristics and advantage of the present invention to the embodiment of the invention with being more prone to.Parts in the accompanying drawing are just in order to illustrate principle of the present invention.In the accompanying drawings, identical or similar techniques characteristic or parts will adopt identical or similar Reference numeral to represent.
Fig. 1 shows the indicative flowchart that a kind of nothing of the prior art instructs machine learning method.
Fig. 2 shows a kind of indicative flowchart that the guidance machine learning method is arranged of the prior art.
Fig. 3 shows the indicative flowchart of a kind of half guidance machine learning method of the prior art.
Fig. 4 shows the indicative flowchart of another kind half guidance machine learning method of the prior art.
Fig. 5 shows the indicative flowchart according to the machine learning method of the embodiment of the invention.
Fig. 6 shows the indicative flowchart according to the machine learning method of two sorters of use of the embodiment of the invention.
Fig. 7 shows the indicative flowchart according to the machine learning method of three sorters of use of the embodiment of the invention.
Fig. 8 shows the schematic block diagram according to the machine learning device of the embodiment of the invention.
Fig. 9 shows the schematic block diagram that can be used for implementing according to the computing machine of the method and apparatus of the embodiment of the invention.
Embodiment
Embodiments of the invention are described below with reference to accompanying drawings.Element of in an accompanying drawing of the present invention or a kind of embodiment, describing and characteristic can combine with element and the characteristic shown in one or more other accompanying drawing or the embodiment.Should be noted that for purpose clearly, omitted the parts that have nothing to do with the present invention, those of ordinary skills are known and the expression and the description of processing in accompanying drawing and the explanation.
In view of the challenge that has artificial degree of participation and aspect of performance seeking balance in the prior art, the method that the present inventor has proposed a kind of fault-tolerant study (Fault-Tolerant Learning) overcomes this problem.
Fault-tolerant notion proposes in Computer Architecture the earliest; Finger is when data, file corruption having occurred or having lost for various reasons in system; System can return to the former state that has an accident with these corrupted or lost files and data automatically, a kind of technology that system can normally be moved continuously.
In fault-tolerant learning method and device according to the embodiment of the invention; Learn through the corpus of automatic mark rather than the language material or the priori of artificial mark; Be a kind of machine learning method fully automatically, therefore be applied to easily in any specific area or the task.In addition, said method and apparatus is through training different sorters and be respectively applied for checking and further prediction being fault-tolerant to carry out, the raising of guaranteed performance.
Below in conjunction with Fig. 5-8 machine learning method and device according to the embodiment of the invention are described.
Fig. 5 shows the indicative flowchart according to the machine learning method of the embodiment of the invention.As shown in the figure, in step S510, the data centralization of utilizing diverse ways never to mark marks automatically and obtains a plurality of different seeds set.Here, can use various automated process to come the labeled data collection.Those skilled in the art can select suitable automated process based on application scenarios.For example; Under the application scenarios of terminology extraction; Can use G.Salton and M.J.McGill in 1983 the terminology extraction methods that in Introduction to Modern Information Retrieval.McGraw-Hill, propose based on TF-IDF; Perhaps Yuhang Yang, Qin Lu and Tiejun Zhao in 2008 at Chinese Term Extraction Using Minimal Resources.Proceedings of the 22th International Conference on Computational Linguistics; The terminology extraction method based on deictic words that proposes in the 1033-1040 page or leaf is come the labeled data collection, comprises in the seed set that obtains and utilizes this automated process to judge term and the non-term that obtains.
Then, in step S520, utilize the seed set of mark automatically to train a plurality of different sorters respectively.Sorter of each seed set training.
Then, in step S530, utilize a plurality of sorters that train to different seed set carrying out cross validations, to obtain the seed set of empirical tests.That is to say,, use other seed set to train a part or all classification device in the sorter that obtains to verify this seed set for the incompatible theory of subset.
In step S540, utilize a plurality of seed set to train the respective classified device once more.That is to say, utilize the seed set of this renewal to train once more by this seed set trained listening group.
Next, can handle pending example collection with the sorter of training once more.This can carry out with reference to the method for prior art, and is not shown here.
Preferably, in order further to improve performance, can also in the processing of example collection, introduce cross validation.Particularly, among the step S550, utilize the sorter of training once more that pending example collection is predicted.In step S560, utilize sorter that the example collection of prediction is carried out cross validation.S530 is similar with step, for an example collection through prediction, can use the part of other sorters except the sorter that is used for this example collection is predicted or all classification device to come this example collection is verified.Then, in step S570, the instance in the example collection of empirical tests is added in the corresponding seed set, so that train the corresponding sorter of branch once more with the seed set of this renewal.That is to say, the instance in the example collection of empirical tests is joined the seed set that is used for training the sorter that is used to verify this example collection.Here, as an example, can the instance of the some that confidence level is the highest in the example collection of empirical tests be joined in the seed set.Repeated execution of steps S540 to S570 is up to satisfying repetition end condition (hereinafter also write and do stopping criterion for iteration).Here, end condition can be set as required.As an example, can set when the seed sum in the set of all seeds reaches the number of instance of predetermined needs mark termination of iterations.
In said method, use the language material of mark automatically and the language material of unartificial mark is learnt.Automatically the seed of mark is higher than random labeled accuracy rate, makes that adopting automated process to obtain seed gathers more meaningful.In addition, can make proof procedure more effective with a plurality of relatively independent visual angles (like different seed set, different character set etc.) different sorter of training.
In addition, in said method, because use is the language material that marks automatically, noise information possibly exist from the beginning, and after each iteration, all possibly increase.In order to control noise effectively, make the result more reliable, train a plurality of sorters to be respectively applied for the checking of seed set and gather training classifier once more, to alleviate noise, the raising performance with the seed after the checking.Carry out prediction and the checking of example collection and gather training classifier once more with the seed of the instance that has added empirical tests with a plurality of sorters, make noise further alleviate, performance further improves.
Fig. 6 shows the indicative flowchart according to the machine learning method of two sorters of use of the embodiment of the invention.In Fig. 6, the given data set D that does not mark, example collection U to be marked, the instance number that needs to mark are n.
At first, adopt a kind of method to generate the seed S set automatically 1, adopt another kind of method to generate the seed S set automatically 2
Then, utilize the seed S set 1Train first sorter C 1, utilize the seed S set 2Train first sorter C 2
Then, utilize sorter C 1And C 2To seed S set through marking automatically 1And S 2Carry out cross validation.Particularly, utilize sorter C 1Mark seed S set 2, utilize sorter C 2Mark seed S set 1From the seed S set 1And S 2In delete automatic annotation results and sorter respectively annotation results have inconsistent seed, obtain the seed S set of empirical tests 1And S 2
Shown in the square frame among Fig. 6 610, above-mentioned steps can be generically and collectively referred to as initialization procedure.
In order further to improve performance, can in the processing procedure of example collection, also carry out cross validation, specific as follows.
At first, utilize the seed S set 1Training classifier C once more 1, utilize the seed S set 2Training classifier C once more 2
Then, utilize sorter C 1Instance among the prediction sets U.Particularly, utilize sorter C 1Instance among the mark set U is chosen m the instance that confidence level is the highest in the annotation results and is formed the example collection L that marks 1, i.e. the example collection L of warp prediction 1
Equally, utilize sorter C 2Instance among the prediction sets U.Particularly, utilize sorter C 2Instance among the mark set U is chosen m the instance that confidence level is the highest in the annotation results and is formed the example collection L that marks 2, i.e. the example collection L of warp prediction 2
Then, utilize sorter C 1And C 2To example collection L through prediction 1And L 2Carry out cross validation.Particularly, utilize C 2Again mark example collection L 1In instance, the deletion L 1Middle C 2Annotation results and C 1The inconsistent instance that predicts the outcome, obtain the example collection L of empirical tests 1Utilize C1 to mark example collection L again 2In instance, the deletion L 2Middle C 1Annotation results and C 2The inconsistent instance that predicts the outcome, obtain the example collection L of empirical tests 2
Then, set L 1In instance add the seed S set to 1, set L 2In instance add the seed S set to 2, accomplish one time iteration.
Judge whether iteration should stop in the time of can or finishing in the beginning of an iteration.As stopping criterion for iteration, for example, can be at | S 1∪ S 2| under the situation of>=N, termination of iterations; Otherwise continuation iteration.
Shown in the square frame among Fig. 6 620, above-mentioned steps can be generically and collectively referred to as iterative process.
Fig. 7 shows the indicative flowchart according to the machine learning method of three sorters of use of the embodiment of the invention.Compare with Fig. 6, used three sorters in the method for Fig. 7.But each step and Fig. 6 are basic identical in the method, no longer repeat here.
What be worth explanation is that use sorter C has been shown in Fig. 7 2And C 3To the seed S set that marks automatically 1Verify, use sorter C 1And C 3To the seed S set that marks automatically 2Verify, use sorter C 1And C 2To the seed S set that marks automatically 3Verify.Respectively from the seed S set 1, S 2And S 3There is inconsistent seed in middle deletion checking result with automatic annotation results, to obtain the seed S set of empirical tests 1, S 2And S 3Yet, also can only use other sorters of a part that a seed set is verified.For example, can only be suitable for sorter C 2To the seed S set 1Verify, only be suitable for sorter C 3To the seed S set 2Verify etc.Here no longer enumerate.
Equally, although use sorter C has been shown among Fig. 7 2And C 3To example collection L through prediction 1Verify, use sorter C 1And C 3To example collection L through prediction 2Verify, use sorter C 1And C 2To example collection L through prediction 3Verify, still, also can use other sorters of part that an example collection is verified.For example, can only be suitable for sorter C 2To example collection L 1Verify, only be suitable for sorter C 2To example collection L 3Verify etc.Here no longer enumerate.
More than show the machine learning method example of using two sorters and three sorters, but this is just for illustration purpose, rather than will be with the present invention's restriction therewith.It will be understood by those skilled in the art that the situation that can be used for a plurality of sorters of other arbitrary numbers according to the machine learning method of the embodiment of the invention, repeat no more here.
Fig. 8 shows the schematic block diagram according to the machine learning device of the embodiment of the invention.As shown in the figure, machine learning device 800 comprises initialization unit 810 and optimization and processing unit 820.According to one embodiment of present invention, 810 configurable being used for of initialization unit: the data centralization of utilizing diverse ways never to mark marks and obtains a plurality of different seeds set automatically; Utilize the seed set of said a plurality of marks automatically to train corresponding a plurality of sorter respectively; And, utilize the part or all of sorter except that the sorter of training in said a plurality of sorter that this seed set is verified by this seed set for each the seed set in the seed set of said a plurality of marks automatically.Optimize to gather and train corresponding a plurality of sorter respectively once more with the configurable a plurality of seeds that are used to utilize empirical tests of processing unit 820.
According to another embodiment of the present invention, optimization and processing unit 820 also dispose and are used for: utilize a plurality of sorters of training once more that example collection is predicted respectively, to obtain corresponding a plurality of example collection through prediction; To each example collection, utilize the part or all of sorter except that the sorter that is used for this example collection is predicted in said a plurality of sorter that this example collection is verified through prediction; Instance in the example collection of each empirical tests is added corresponding seed set; And repeat said training once more, said example collection is predicted, said each example collection is verified and said instance in the example collection of each empirical tests added the set of corresponding seed, be met up to repeating end condition.
Based on another embodiment of the present invention, repeat end condition and be the number that seed sum in whole seeds set reaches the instance of predetermined needs mark.
According to another embodiment of the present invention, optimization and processing unit 820 further dispose and are used for: utilize said a plurality of sorter that example collection is marked respectively; And the instance of choosing the predetermined number that confidence level is the highest in the annotation results of each sorter of said a plurality of sorters is respectively formed corresponding a plurality of example collection through prediction.
According to another embodiment of the present invention, optimize and the processing unit 820 further example collection that are used for through with the checking warp prediction of getting off that dispose: utilize the part or all of sorter except that the sorter that is used for this example collection is predicted of said a plurality of sorters that this example collection is marked; And there is inconsistent instance in the annotation results that deletion predicts the outcome with said part or all of sorter from this example collection.
According to another embodiment of the present invention, initialization unit 810 further disposes and is used for verifying the seed set of mark automatically through some: utilize marking except that by the part or all of sorter the sorter of this seed set training this seed being gathered of said a plurality of sorters; And from the set of this seed, there is inconsistent seed between the annotation results of the automatic annotation results of deletion and said part or all of sorter.
Further details about according to the operation of the machine learning device of the embodiment of the invention can be not described in detail with reference to each embodiment of above-described method here.
In said method and device, through automated process the data set that does not mark is marked, need not artificial the participation, improved learning efficiency.In addition, carry out cross validation through seed being gathered, and utilize through the seed set of cross validation and train the respective classified device once more, controlled the noise of introducing by automatic mark effectively, realized fault-tolerant study with sorter.
Method and apparatus according to the embodiment of the invention is not done any restriction for the practical application scene.Also not restrictions such as training method for employed classifier type, sorter.
In addition, each forms module in the said apparatus, the unit can be configured through the mode of software, firmware, hardware or its combination.Dispose spendable concrete means or mode and be well known to those skilled in the art, repeat no more at this.Under situation about realizing through software or firmware, to computing machine the program that constitutes this software is installed from storage medium or network with specialized hardware structure, this computing machine can be carried out various functions etc. when various program is installed.
Fig. 9 shows the schematic block diagram that can be used for implementing according to the computing machine of the method and apparatus of the embodiment of the invention.In Fig. 9, CPU (CPU) 901 carries out various processing according to program stored among ROM (read-only memory) (ROM) 902 or from the program that storage area 908 is loaded into random-access memory (ram) 903.In RAM 903, also store data required when CPU 901 carries out various processing or the like as required.CPU 901, ROM 902 and RAM 903 are connected to each other via bus 904.Input/output interface 905 also is connected to bus 904.
Following parts are connected to input/output interface 905: importation 906 (comprising keyboard, mouse or the like), output 907 (comprise display; Such as cathode ray tube (CRT), LCD (LCD) etc. and loudspeaker etc.), storage area 908 (comprising hard disk etc.), communications portion 909 (comprising that NIC is such as LAN card, modulator-demodular unit etc.).Communications portion 909 is handled such as the Internet executive communication via network.As required, driver 910 also can be connected to input/output interface 905.Detachable media 911 can be installed on the driver 910 such as disk, CD, magneto-optic disk, semiconductor memory or the like as required, makes the computer program of therefrom reading be installed to as required in the storage area 908.
Realizing through software under the situation of above-mentioned series of processes, such as detachable media 911 program that constitutes software is being installed such as the Internet or storage medium from network.
It will be understood by those of skill in the art that this storage medium is not limited to shown in Figure 9 wherein having program stored therein, distribute so that the detachable media 911 of program to be provided to the user with equipment with being separated.The example of detachable media 911 comprises disk (comprising floppy disk (registered trademark)), CD (comprising compact disc read-only memory (CD-ROM) and digital universal disc (DVD)), magneto-optic disk (comprising mini-disk (MD) (registered trademark)) and semiconductor memory.Perhaps, storage medium can be hard disk that comprises in ROM 902, the storage area 908 or the like, computer program stored wherein, and be distributed to the user with the equipment that comprises them.
The present invention also proposes a kind of program product that stores the instruction code of machine-readable.When said instruction code is read and carried out by machine, can carry out above-mentioned method according to the embodiment of the invention.
Correspondingly, the storage medium that is used for carrying the program product of the above-mentioned instruction code that stores machine-readable is also included within of the present invention open.Said storage medium includes but not limited to floppy disk, CD, magneto-optic disk, storage card, memory stick or the like.
In the above in the description to the specific embodiment of the invention; Characteristic to a kind of embodiment is described and/or illustrated can be used in one or more other embodiment with identical or similar mode; Combined with the characteristic in other embodiment, or substitute the characteristic in other embodiment.
Should stress that term " comprises/comprise " existence that when this paper uses, refers to characteristic, key element, step or assembly, but not get rid of the existence of one or more further feature, key element, step or assembly or additional.
In addition, the time sequencing of describing during method of the present invention is not limited to is to specifications carried out, also can according to other time sequencing ground, carry out concurrently or independently.The execution sequence of the method for therefore, describing in this instructions does not constitute restriction to technical scope of the present invention.
Although the present invention is disclosed above through description to specific embodiment of the present invention,, should be appreciated that all above-mentioned embodiment and example all are exemplary, and nonrestrictive.Those skilled in the art can be in the spirit of accompanying claims and scope design to various modifications of the present invention, improve or equivalent.These modifications, improvement or equivalent also should be believed to comprise in protection scope of the present invention.
Remarks
1. 1 kinds of machine learning methods of remarks comprise:
The data centralization of utilizing diverse ways never to mark marks automatically and obtains n different seed S set 1, S 2..., S n, n is natural number and n>=2;
Utilize said n the seed S set of mark automatically 1, S 2..., S nTrain corresponding n sorter C respectively 1, C 2..., C n
Automatically, each the seed S set during the seed of mark is gathered for said n i, i=1,2 ..., n utilizes removing by this seed S set in the said n sorter iThe sorter C of training iOutside part or all of sorter to this seed S set iVerify; And
Utilize said n seed S set of empirical tests 1, S 2..., S nTrain corresponding n sorter C respectively once more 1, C 2..., C n
Remarks 2. also comprises according to the method for remarks 1:
Utilize said n sorter of training once more that example collection is predicted respectively, to obtain corresponding n example collection L through prediction 1, L 2..., L n
To each example collection L through prediction i, i=1,2 ..., n utilizes removing in the said n sorter to be used for this example collection L iThe sorter C that predicts iOutside part or all of sorter to this example collection L iVerify;
Example collection L with each empirical tests iIn instance add corresponding seed S set iAnd
Repeat said training once more, said example collection is predicted, said each example collection is verified and said instance in the example collection of each empirical tests added the set of corresponding seed, be met up to repeating end condition.
Remarks 3. is according to the method for remarks 2, and wherein, said repetition end condition is:
Said seed S set 1, S 2..., S nIn the seed sum reach the number of the instance of predetermined needs mark.
Remarks 4. is according to the method for remarks 2, and wherein, said example collection is predicted comprises:
Utilize a said n sorter that said example collection is marked respectively; And
Choose the instance of the predetermined number that confidence level is the highest in the annotation results of each sorter of a said n sorter respectively and form corresponding n example collection L through prediction 1, L 2..., L n
Remarks 5. is according to the method for remarks 2, and wherein, said checking is through the example collection L of prediction iComprise:
Utilize removing in the said n sorter to be used for to this example collection L iThe sorter C that predicts iOutside part or all of sorter to this example collection L iMark; And
From this example collection L iThere is inconsistent instance in the middle annotation results that predicts the outcome with said part or all of sorter of deleting.
Remarks 6. is according to the method for remarks 1, wherein, and the seed S set that said checking marks automatically iComprise:
Utilize removing in the said n sorter by this seed S set iThe sorter C of training iOutside part or all of sorter to this seed S set iMark; And
From this seed S set iThere is inconsistent seed between the annotation results of middle automatic annotation results of deletion and said part or all of sorter.
7. 1 kinds of machine learning devices of remarks comprise:
Initialization unit, configuration is used for:
The data centralization of utilizing diverse ways never to mark marks automatically and obtains n different seed S set 1, S 2..., S n, n is natural number and n>=2;
Utilize said n the seed S set of mark automatically 1, S 2..., S nTrain corresponding n sorter C respectively 1, C 2..., C nAnd
Automatically, each the seed S set during the seed of mark is gathered for said n i, i=1,2 ..., n utilizes removing by this seed S set in the said n sorter iThe sorter C of training iOutside part or all of sorter to this seed S set iVerify; And
Optimize and processing unit, configuration is used for:
Utilize said n seed S set of empirical tests 1, S 2..., S nTrain corresponding n sorter C respectively once more 1, C 2..., C n
Remarks 8. is according to the device of remarks 7, and wherein, said optimization and processing unit also dispose and be used for:
Utilize said n sorter of training once more that example collection is predicted respectively, to obtain corresponding n example collection L through prediction 1, L 2..., L n
To each example collection L through prediction i, i=1,2 ..., n utilizes removing in the said n sorter to be used for this example collection L iThe sorter C that predicts iOutside part or all of sorter to this example collection L iVerify;
Example collection L with each empirical tests iIn instance add corresponding seed S set iAnd
Repeat said training once more, said example collection is predicted, said each example collection is verified and said instance in the example collection of each empirical tests added the set of corresponding seed, be met up to repeating end condition.
Remarks 9. is according to the device of remarks 8, and wherein, said repetition end condition is:
Said seed S set 1, S 2..., S nIn the seed sum reach the number of the instance of predetermined needs mark.
Remarks 10. is according to the device of remarks 8, and wherein, said optimization and processing unit further configuration are used for:
Utilize a said n sorter that said example collection is marked respectively; And
Choose the instance of the predetermined number that confidence level is the highest in the annotation results of each sorter of a said n sorter respectively and form corresponding n example collection L through prediction 1, L 2..., L n
Remarks 11. is according to the device of remarks 8, and wherein, said optimization and processing unit further configuration are used for through the example collection L with the checking warp prediction of getting off i:
Utilize removing in the said n sorter to be used for to this example collection L iThe sorter C that predicts iOutside part or all of sorter to this example collection L iMark; And
From this example collection L iThere is inconsistent instance in the middle annotation results that predicts the outcome with said part or all of sorter of deleting.
Remarks 12. is according to the device of remarks 7, and wherein, said initialization unit further configuration is used for through verify the seed S set of mark automatically to get off i:
Utilize removing in the said n sorter by this seed S set iThe sorter C of training iOutside part or all of sorter to this seed S set iMark; And
From this seed S set iThere is inconsistent seed between the annotation results of middle automatic annotation results of deletion and said part or all of sorter.

Claims (10)

1. machine learning method comprises:
The data centralization of utilizing diverse ways never to mark marks automatically and obtains n different seed S set 1, S 2..., S n, n is natural number and n>=2;
Utilize said n the seed S set of mark automatically 1, S 2..., S nTrain corresponding n sorter C respectively 1, C 2..., C n
Automatically, each the seed S set during the seed of mark is gathered for said n i, i=1,2 ..., n utilizes removing by this seed S set in the said n sorter iThe sorter C of training iOutside part or all of sorter to this seed S set iVerify; And
Utilize said n seed S set of empirical tests 1, S 2..., S nTrain corresponding n sorter C respectively once more 1, C 2..., C n
2. according to the method for claim 1, also comprise:
Utilize said n sorter of training once more that example collection is predicted respectively, to obtain corresponding n example collection L through prediction 1, L 2..., L n
To each example collection L through prediction i, i=1,2 ..., n utilizes removing in the said n sorter to be used for this example collection L iThe sorter C that predicts iOutside part or all of sorter to this example collection L iVerify;
Example collection L with each empirical tests iIn instance add corresponding seed S set iAnd
Repeat said training once more, said example collection is predicted, said each example collection is verified and said instance in the example collection of each empirical tests added the set of corresponding seed, be met up to repeating end condition.
3. according to the method for claim 2, wherein, said repetition end condition is:
Said seed S set 1, S 2..., S nIn the seed sum reach the number of the instance of predetermined needs mark.
4. according to the method for claim 2, wherein, said example collection is predicted comprises:
Utilize a said n sorter that said example collection is marked respectively; And
Choose the instance of the predetermined number that confidence level is the highest in the annotation results of each sorter of a said n sorter respectively and form corresponding n example collection L through prediction 1, L 2..., L n
5. according to the method for claim 2, wherein, said checking is through the example collection L of prediction iComprise:
Utilize removing in the said n sorter to be used for to this example collection L iThe sorter C that predicts iOutside part or all of sorter to this example collection L iMark; And
From this example collection L iThere is inconsistent instance in the middle annotation results that predicts the outcome with said part or all of sorter of deleting.
6. according to the process of claim 1 wherein the seed S set that said checking marks automatically iComprise:
Utilize removing in the said n sorter by this seed S set iThe sorter C of training iOutside part or all of sorter to this seed S set iMark; And
From this seed S set iThere is inconsistent seed between the annotation results of middle automatic annotation results of deletion and said part or all of sorter.
7. machine learning device comprises:
Initialization unit, configuration is used for:
The data centralization of utilizing diverse ways never to mark marks automatically and obtains n different seed S set 1, S 2..., S n, n is natural number and n>=2;
Utilize said n the seed S set of mark automatically 1, S 2..., S nTrain corresponding n sorter C respectively 1, C 2..., C nAnd
Automatically, each the seed S set during the seed of mark is gathered for said n i, i=1,2 ..., n utilizes removing by this seed S set in the said n sorter iThe sorter C of training iOutside part or all of sorter to this seed S set iVerify; And
Optimize and processing unit, configuration is used for:
Utilize said n seed S set of empirical tests 1, S 2..., S nTrain corresponding n sorter C respectively once more 1, C 2..., C n
8. according to the device of claim 7, wherein, said optimization and processing unit also dispose and are used for:
Utilize said n sorter of training once more that example collection is predicted respectively, to obtain corresponding n example collection L through prediction 1, L 2..., L n
To each example collection L through prediction i, i=1,2 ..., n utilizes removing in the said n sorter to be used for this example collection L iThe sorter C that predicts iOutside part or all of sorter to this example collection L iVerify;
Example collection L with each empirical tests iIn instance add corresponding seed S set iAnd
Repeat said training once more, said example collection is predicted, said each example collection is verified and said instance in the example collection of each empirical tests added the set of corresponding seed, be met up to repeating end condition.
9. according to Claim 8 device, wherein, said optimization and processing unit further configuration are used for through with the example collection L of checking through prediction that get off i:
Utilize removing in the said n sorter to be used for to this example collection L iThe sorter C that predicts iOutside part or all of sorter to this example collection L iMark; And
From this example collection L iThere is inconsistent instance in the middle annotation results that predicts the outcome with said part or all of sorter of deleting.
10. according to the device of claim 7, wherein, said initialization unit further configuration is used for through verify the seed S set of mark automatically to get off i:
Utilize removing in the said n sorter by this seed S set iThe sorter C of training iOutside part or all of sorter to this seed S set iMark; And
From this seed S set iThere is inconsistent seed between the annotation results of middle automatic annotation results of deletion and said part or all of sorter.
CN201010280239.0A 2010-09-09 2010-09-09 machine learning method and device Expired - Fee Related CN102402713B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201010280239.0A CN102402713B (en) 2010-09-09 2010-09-09 machine learning method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201010280239.0A CN102402713B (en) 2010-09-09 2010-09-09 machine learning method and device

Publications (2)

Publication Number Publication Date
CN102402713A true CN102402713A (en) 2012-04-04
CN102402713B CN102402713B (en) 2015-11-25

Family

ID=45884896

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201010280239.0A Expired - Fee Related CN102402713B (en) 2010-09-09 2010-09-09 machine learning method and device

Country Status (1)

Country Link
CN (1) CN102402713B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106202177A (en) * 2016-06-27 2016-12-07 腾讯科技(深圳)有限公司 A kind of file classification method and device
CN108509969A (en) * 2017-09-06 2018-09-07 腾讯科技(深圳)有限公司 Data mask method and terminal
CN110147551A (en) * 2019-05-14 2019-08-20 腾讯科技(深圳)有限公司 Multi-class entity recognition model training, entity recognition method, server and terminal
CN112000808A (en) * 2020-09-29 2020-11-27 迪爱斯信息技术股份有限公司 Data processing method and device and readable storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5555345A (en) * 1991-03-25 1996-09-10 Atr Interpreting Telephony Research Laboratories Learning method of neural network
US20030233369A1 (en) * 2002-06-17 2003-12-18 Fujitsu Limited Data classifying device, and active learning method used by data classifying device and active learning program of data classifying device
CN1851703A (en) * 2006-05-10 2006-10-25 南京大学 Active semi-monitoring-related feedback method for digital image search
US20080281764A1 (en) * 2004-09-29 2008-11-13 Panscient Pty Ltd. Machine Learning System
CN101520847A (en) * 2008-02-29 2009-09-02 富士通株式会社 Pattern identification device and method
US20090228411A1 (en) * 2008-03-06 2009-09-10 Kddi Corporation Reducing method for support vector

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5555345A (en) * 1991-03-25 1996-09-10 Atr Interpreting Telephony Research Laboratories Learning method of neural network
US20030233369A1 (en) * 2002-06-17 2003-12-18 Fujitsu Limited Data classifying device, and active learning method used by data classifying device and active learning program of data classifying device
US20080281764A1 (en) * 2004-09-29 2008-11-13 Panscient Pty Ltd. Machine Learning System
CN1851703A (en) * 2006-05-10 2006-10-25 南京大学 Active semi-monitoring-related feedback method for digital image search
CN101520847A (en) * 2008-02-29 2009-09-02 富士通株式会社 Pattern identification device and method
US20090228411A1 (en) * 2008-03-06 2009-09-10 Kddi Corporation Reducing method for support vector

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
卢志茂等: "基于无指导机器学习的全文语义自动标注方法", 《自动化学报》 *
李庆中等: "基于小规模标注语料的机器学习方法研究", 《计算机应用》 *
王浩畅等: "基于元学习策略的分类器融合方法及应用", 《通信学报》 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106202177A (en) * 2016-06-27 2016-12-07 腾讯科技(深圳)有限公司 A kind of file classification method and device
CN108509969A (en) * 2017-09-06 2018-09-07 腾讯科技(深圳)有限公司 Data mask method and terminal
CN108509969B (en) * 2017-09-06 2021-11-09 腾讯科技(深圳)有限公司 Data labeling method and terminal
CN110147551A (en) * 2019-05-14 2019-08-20 腾讯科技(深圳)有限公司 Multi-class entity recognition model training, entity recognition method, server and terminal
CN112000808A (en) * 2020-09-29 2020-11-27 迪爱斯信息技术股份有限公司 Data processing method and device and readable storage medium
CN112000808B (en) * 2020-09-29 2024-04-16 迪爱斯信息技术股份有限公司 Data processing method and device and readable storage medium

Also Published As

Publication number Publication date
CN102402713B (en) 2015-11-25

Similar Documents

Publication Publication Date Title
US8280830B2 (en) Systems and methods for using multiple in-line heuristics to reduce false positives
CN110413786B (en) Data processing method based on webpage text classification, intelligent terminal and storage medium
CN103577989B (en) A kind of information classification approach and information classifying system based on product identification
CN107491432A (en) Low quality article recognition methods and device, equipment and medium based on artificial intelligence
CN107038157A (en) Identification error detection method, device and storage medium based on artificial intelligence
CN103299304A (en) Classification rule generation device, classification rule generation method, classification rule generation program and recording medium
CN102073704B (en) Text classification processing method, system and equipment
CN108536572B (en) Smart phone App use prediction method based on ApUage 2Vec model
CN110942763A (en) Voice recognition method and device
CN104778560A (en) Learning progress management and control method and device
CN111159414A (en) Text classification method and system, electronic equipment and computer readable storage medium
CN102402713A (en) Robot learning method and device
CN111062036A (en) Malicious software identification model construction method, malicious software identification medium and malicious software identification equipment
CN102291369A (en) Control method and corresponding control device for verifying junk information settings
CN103365849A (en) Keyword search method and equipment
CN111061837A (en) Topic identification method, device, equipment and medium
US8341538B1 (en) Systems and methods for reducing redundancies in quality-assurance reviews of graphical user interfaces
CN103309892A (en) Method and equipment for information processing and Web browsing history navigation and electronic device
US11151021B2 (en) Selecting test-templates using template-aware coverage data
CN112035345A (en) Mixed depth defect prediction method based on code segment analysis
CN107169011A (en) The original recognition methods of webpage based on artificial intelligence, device and storage medium
CN114253866A (en) Malicious code detection method and device, computer equipment and readable storage medium
CN106067889B (en) Electronic device and its method for uploading
CN106484913A (en) Method and server that a kind of Target Photo determines
CN104580109A (en) Method and device for generating click verification code

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20151125

Termination date: 20180909

CF01 Termination of patent right due to non-payment of annual fee