CN108897829A - Modification method, device and the storage medium of data label - Google Patents
Modification method, device and the storage medium of data label Download PDFInfo
- Publication number
- CN108897829A CN108897829A CN201810649534.5A CN201810649534A CN108897829A CN 108897829 A CN108897829 A CN 108897829A CN 201810649534 A CN201810649534 A CN 201810649534A CN 108897829 A CN108897829 A CN 108897829A
- Authority
- CN
- China
- Prior art keywords
- data
- label
- confidence level
- modified
- new
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/103—Formatting, i.e. changing of presentation of documents
- G06F40/117—Tagging; Marking up; Designating a block; Setting of attributes
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention discloses a kind of modification methods of data label, are related to machine learning field, the method comprising the steps of:It is loaded into data set to be modified;Machine learning model is trained, Matching Model is obtained;Matching Model is inputted using the data of test set as input data, the matching result of Matching Model output is obtained, to update the data label of each input data;When the quantity of matching result is not up to preset value, based on the data of data set to be modified, new training set and new test set are constructed;When the quantity of matching result reaches preset value, the amendment confidence level of each label is calculated to each data in data set to be modified in conjunction with the matching result and preset data label obtained, to correct the data label of each data in data set to be modified.The present invention also provides the correcting device of data label and storage mediums, can effectively improve the reliability of the data label of data, to improve the order of accuarcy of obtained machine learning model prediction.
Description
Technical field
The present invention relates to machine learning field more particularly to a kind of modification methods of data label, device and storage medium.
Background technique
In the learning tasks for having supervision, machine learning or the training of deep learning system are needed to use and be largely labeled with
The data of corresponding data label.In general, carry out training pattern using data more and that mark quality is higher, training obtains
Model be more able to reflect true situation, to unknown data prediction result it is more reliable.
In order to improve the mark quality of data, need to find the label to match with data.In the prior art, in order to mention
The mark quality of high data, the technological means of use is common to have artificial mark, cross entropy screening, information retrieval and data to abandon
Deng.Wherein, artificial mark is by being manually labeled data;Cross entropy is screened by the way that original language material is divided into multiple small collection
It closes, calculate the cross entropy of multiple small sets and thinks that the data mark of the smallest set of entropy is reliable;Information retrieval relies on
In determining test set, and relevant information is retrieved as training set;It is to abandon similarity and matching result not that data, which abandon then,
Same data.
During implementing the embodiment of the present invention, inventors have found that due to manually marking knowing dependent on staff
Know and is lost same data message, information retrieval dependent on determining test set and retrieval data matter with energy, cross entropy screening
Amount is difficult to ensure, and there are the risks of loss of vital data for data discarding, cause the data label reliability of available data lower,
The order of accuarcy of obtained machine learning model prediction is not high.
Summary of the invention
The purpose of the embodiment of the present invention is that a kind of modification method of data label, device and storage medium are provided, it can be effective
The reliability of the data label of data is improved, to improve the order of accuarcy of obtained machine learning model prediction.
To achieve the above object, the embodiment of the invention provides a kind of modification methods of data label, including step:
It is loaded into data set to be modified;Wherein, the data set to be modified includes training set and test set, the training set and
The data of the test set are labeled with preset data label;
Machine learning model is trained based on current training set, obtains Matching Model;
The data of current test set are inputted into current Matching Model as input data, obtain the current matching
The matching result of model output, to update the data label of each input data;Wherein, for every in the matching result
One input data, record have the confidence level of each label;
When the quantity of the matching result obtained is not up to preset value, based on the data of the data set to be modified, structure
Build new training set and new test set;
When the quantity of the matching result obtained reaches the preset value, in conjunction with the matching result obtained
With the preset data label, to each data in the data set to be modified, the amendment for calculating each label is set
Reliability, to correct the data label of each data in the data set to be modified.
As an improvement of the above scheme, also directed to each described defeated in the matching result of the current Matching Model output
Enter data, has recorded the confidence level ranking of each label.
As an improvement of the above scheme, which is characterized in that the matching result that has been obtained described in the combination and described default
Data label the amendment confidence level of each label is calculated to each data in the data set to be modified, with amendment
The data label of each data in the data set to be modified, including:
It is obtained each described based on the matching result obtained for each data of the data set to be modified
The confidence level and confidence level ranking of label, and the confidence level of each label and the weighted calculation value of confidence level ranking are calculated,
In conjunction with the preset data label, the amendment confidence level of each label is obtained;
For each data of the data set to be modified, using the highest label of the amendment confidence level as the data
New data label.
As an improvement of the above scheme, which is characterized in that for any data of the data set to be modified, the amendment
Meet between confidence level, the confidence level of the label, the confidence level ranking of the label and the preset data label following
Relationship:
S (k)=α * f (k)+(1- α) * g (k)+β * h (k)
In formula, α and β are constant, and k is any label, and S (k) is the amendment confidence level of label k, and f (k) is setting for label k
The weighted average of reliability, g (k) are the weighted average reciprocal of the confidence level ranking of label k;If the preset data of the data
Label is label k, there is h (k)=1;If the non-label k of the preset data label of data, there is h (k)=0.
As an improvement of the above scheme, which is characterized in that the data based on the data set to be modified construct new
Training set and new test set, including step:
By the consistent data of the new data label and the preset data label, as new training set
In data.
As an improvement of the above scheme, which is characterized in that the data based on the data set to be modified construct new
Training set and new test set, including step:
Using the current test set as former test set, using the current training set as former training set;
Data label that belong to the former test set and new is consistent with the data label before the last update
Data, as training component one, it is described original test set other data as fractions tested one;
It is subordinated in the data of the former training set, the quantity data equal with the trained component one is obtained, as survey
Component two is tried, other data of the original training set are as training component two;
The data for merging the trained component one and the trained component two, as new training set;
The data for merging the fractions tested one and the fractions tested two, as new test set.
As an improvement of the above scheme, which is characterized in that in the data for being subordinated to the former training set, obtain quantity
The data equal with the trained component one, including step:
It is subordinated in the data of the former training set, it is random to obtain the quantity data equal with the trained component one.
The embodiment of the invention also provides a kind of correcting devices of data label, including insmod, training module, matching
Module, update module and correction module;
It is described to insmod, for being loaded into data set to be modified;Wherein, the data set to be modified includes training set and survey
The data of examination collection, the training set and the test set are labeled with preset data label;
The training module obtains Matching Model for being trained based on current training set to machine learning model;
The matching module, for the data of current test set to be inputted current Matching Model as input data,
The matching result of the current Matching Model output is obtained, to update the data label of each input data;Wherein, institute
It states for each input data in matching result, record has the confidence level of each label;
The update module, for being based on described to be repaired when the quantity of the matching result obtained is not up to preset value
The data of correction data collection construct new training set and new test set;
The correction module, for when the quantity of the matching result obtained reaches the preset value, in conjunction with institute
The matching result obtained and the preset data label are stated, to each data in the data set to be modified, is calculated every
The amendment confidence level of one label, to correct the data label of each data in the data set to be modified.
The embodiment of the invention also provides a kind of correcting devices of data label, including processor, memory and storage
In the memory and it is configured as the computer program executed by the processor, the processor executes the computer
The modification method of data label described in any one as above is realized when program.
The embodiment of the invention also provides a kind of computer readable storage medium, the computer readable storage medium includes
The computer program of storage, wherein control in computer program operation and set where the computer readable storage medium
The standby modification method for executing data label described in any one as above.
Compared with prior art, modification method, device and the storage medium of a kind of data label disclosed by the invention uses
Data training Matching Model in the training set of data set to be modified, and the data set to be modified is calculated with the Matching Model
Test set in data and label matching result, the case where the quantity of acquired matching result is not up to preset value
Under, it recombinates to obtain new training set and new test set, to obtain new Matching Model and new matching result, is obtaining
Matching result quantity reach preset value in the case where, calculate amendment confidence level simultaneously update every in the data set to be modified
The new data label of one data, to realize the amendment of the data label to the data of the data set to be modified.It solves
The technical issues of quality of prior art problem data mark is difficult to ensure, effectively improves the reliable of the data label of data
Property, and the order of accuarcy of the machine learning model prediction improved.
Detailed description of the invention
Fig. 1 is a kind of flow diagram of the modification method of data label in the embodiment of the present invention.
Fig. 2 is the flow diagram of the step S140 of modification method as shown in Figure 1.
Fig. 3 is the flow diagram of the step S150 of modification method as shown in Figure 1.
Fig. 4 is a kind of structural schematic diagram of the correcting device of data label in the embodiment of the present invention.
Fig. 5 is the structural schematic diagram of the correcting device of another data label in the embodiment of the present invention.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete
Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on
Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts every other
Embodiment shall fall within the protection scope of the present invention.
It is a kind of flow diagram of the modification method of data label provided in an embodiment of the present invention referring to Fig. 1.The present invention
The modification method that embodiment 1 provides includes step S110 to step S150.
S110, it is loaded into data set to be modified;Wherein, the data set to be modified includes training set and test set, the instruction
The data for practicing collection and the test set are labeled with preset data label.
Wherein, the data in the data set to be modified are respectively belonging to the training set and the test set.
S120, machine learning model is trained based on current training set, obtains Matching Model.
Wherein, the current training set can be the preset training set in the data set to be modified, can also be with
It is the new training set that step S140 is constructed, does not influence the obtainable beneficial effect of the present invention.
The machine learning model can be preset learning model, be also possible to choose according to the actual situation
Practise model, such as single machine learning model or integrated machine learning model do not influence that the present invention is obtainable to be had
Beneficial effect.
S130, the data of current test set are inputted as input data to current Matching Model, obtained described current
Matching Model output matching result, to update the data label of each input data;Wherein, in the matching result
For each input data, record has the confidence level of each label.
Wherein, the current test set can be the preset test set in the data set to be modified, can also be with
It is the new test set that step S140 is constructed, does not influence the obtainable beneficial effect of the present invention.
For example, setting the data set to be modified as A, the training set is A1, the test set is A2, it is assumed that A=a, b,
C, d, e }, Aa1={ a, b, c }, Ab1={ d, e }, tag types include label 1, label 2 and label 3, wherein data a and data b
Preset data label be label 1, data c and the preset data label of data e are label 2, the preset data label of data d
For label 3.Based on Aa1The machine learning model is trained, Matching Model B is obtained1。B1As current Matching Model,
Ab1As current test set, by Ab1Data as input data input B1, obtain B1Output matching result C1.In C1
In be directed to each input data, record has the confidence level of each label, that is, have recorded data d respectively with label 1, label 2 and mark
The confidence level that label 3 match, it is assumed that the confidence level that data d is obtained is respectively 0.6,0.7 and 0.8, the corresponding confidence level of label 3
0.8 is highest confidence level, using label 3 as the new data label of data d, i.e., the new data label of data d with it is preset
Data label is consistent;C1In the confidence level that data e matches with label 1, label 2 and label 3 respectively is also recorded, it is assumed that data
The confidence level that e is obtained is respectively 0.9,0.5 and 0.4, the corresponding confidence level 0.9 of label 1 be highest confidence level, using label 1 as
The new data label of data e, then the new data label of data e and preset data label are inconsistent.On it is to be appreciated that
Stating a kind of citing that example is only implemented as the present invention can also be changed accordingly according to the actual situation in other cases
It is dynamic, such as representation change including more data or confidence level in the data set to be modified etc., do not influence this
Invent obtainable beneficial effect.
Embodiment as one preferred can also be directed to each input data, note in the matching result
Record the confidence level ranking of each label.Continue the example above, C1In also directed to data d have recorded each label confidence level arrange
Name, wherein the confidence level of label 1 is 0.6, and confidence level ranking is 3, and the confidence level of label 2 is 0.7, and confidence level ranking is 2, label
3 confidence level is 0.8, and confidence level ranking is 1;C1The confidence level ranking of each label is had recorded also directed to data e, wherein label 1
Confidence level ranking be 1, the confidence level ranking of label 2 is 2, and the confidence level ranking of label 3 is 3.
S140, when the quantity of the matching result obtained is not up to preset value, the number based on the data set to be modified
According to constructing new training set and new test set.
Wherein, the preset value can be a specific numerical value, such as 4,5 or 6 etc., and it is bigger or it is smaller its
His numerical value does not influence the obtainable beneficial effect of the present invention.
Embodiment as one preferred, can be by the new data label and the preset data label phase one
The data caused, as the data in new training set.For example, in the citing of step S130, the new data mark of data d
Label are consistent with preset data label, then using data d as the data in the new training set.
As a kind of further preferred embodiment, referring to fig. 2, step S140 can be using the current test set as original
Test set, using the current training set as former training set, including step S141 to step S144.
S141, the data label former test set and before new data label and the last update will be belonged to
Consistent data, as training component one, other data of the original test set are as fractions tested one.
Wherein, for any data in the former test set, if the data label of the data merely through it is primary it is described more
Newly, then in the case where the new data label of the data is consistent with preset data label, using the data as training group
Divide one;If the data label of the data only crosses the update two or more times, such as by the update three times, then in the number
According to new data label it is consistent with the data label progress before the third time update in the case where, using the data as
Training component one.
S142, it is subordinated in the data of the former training set, obtains the quantity data equal with the trained component one, make
For fractions tested two, other data of the original training set are as training component two.
Preferably, it can be the data bulk according to the training component one obtained in step S141, from the former training set
In obtain the equal data of quantity at random, using as the fractions tested two, other data of the original training set are as training
Component two.
S143, the data for merging the trained component one and the trained component two, as new training set.
S144, the data for merging the fractions tested one and the fractions tested two, as new test set.
It is to be appreciated that the execution sequence of step S143 and step S144 can be replaced mutually, such as step S144 is in step
It is executed before rapid S143, is also possible to step S143 and step S144 is performed simultaneously, it is obtainable beneficial not influence the present invention
Effect.
For example, before the operation for executing step S140, there is A in conjunction with the citing of step S130a1={ a, b, c }, Ab1=
{ d, e }, and after executing step S130, the new data label of data d is consistent with the data label before update, therefore
After executing step S140, data d becomes new training set Aa2In data, new training set Aa2={ a, c, d }, new
Test set Ab2={ b, e }.
After executing step S140, the operation of step S120 and step S130 can be executed again, to obtain new
With as a result, and in the case where the quantity of the matching result obtained is not up to the preset value, repeat step S120, step
Rapid S130 and step S140, so that the quantity of the matching result obtained reaches the requirement of the preset value.
S150, when the quantity of the matching result obtained reaches the preset value, in conjunction with obtained
Each label is calculated to each data in the data set to be modified with result and the preset data label
Confidence level is corrected, to correct the data label of each data in the data set to be modified.
It preferably, may include step S151 and step S152 referring to Fig. 3, step S150.
S151, it is obtained each based on the matching result obtained for each data of the data set to be modified
The confidence level and confidence level ranking of the label, and calculate the confidence level of each label and the weighted calculation of confidence level ranking
Value, in conjunction with the preset data label, obtains the amendment confidence level of each label.
It is highly preferred that being directed to any data of the data set to be modified, the amendment confidence level, the confidence of the label
It spends, meet following relationship between the confidence level ranking and the preset data label of the label:
S (k)=α * f (k)+(1- α) * g (k)+β * h (k)
In formula, α and β are constant, and k is any label, and S (k) is the amendment confidence level of label k, and f (k) is setting for label k
The weighted average of reliability, g (k) are the weighted average reciprocal of the confidence level ranking of label k;If the preset data of the data
Label is label k, there is h (k)=1;If the non-label k of the preset data label of data, there is h (k)=0.
Divide in conjunction with the citing of step S130 preferred embodiment for each of A={ a, b, c, d, e } data
It is other to data a, data b, data c, data d and data e, calculate the amendment confidence level of corresponding each label.With data a
For, it is assumed that for data a in the matching result obtained, the weighted average of 1 confidence level of label is f (1)=0.9, confidence level
The weighted average g (1)=1 of ranking, and the preset data label of data a is in label 1, i.e. h (1)=1, if α=0.4, β
=0.3, S (1)=1.26 can be obtained, other in the S (2) of the label 2 of data a and the S (3) and A of label 3 similarly can be obtained
The amendment confidence level of each label of data.It is to be appreciated that in other cases, can also carry out according to the actual situation
Corresponding change, such as the numerical value etc. of different α and β are chosen, do not influence the obtainable beneficial effect of the present invention.
More specifically, being directed to each data of the data set to be modified, the weighting meter of the confidence level of any label
Calculation value meets following relationship:
In formula, k is any label, and f (k) is the weighted average of the confidence level of label k, and n is the matching result obtained
Quantity,For the confidence level of label k that is being recorded in i-th of matching result.
More specifically, being directed to each data of the data set to be modified, the confidence level ranking of any label adds
Power calculated value meets following relationship:
In formula, k is any label, and g (k) is the weighted average reciprocal of the confidence level ranking of label k, and n is to have obtained
Matching result quantity,For confidence level ranking recorded in i-th of matching result, label k.
S152, for each data of the data set to be modified, using the highest label of the amendment confidence level as should
The new data label of data.
The modification method of a kind of data label disclosed by the embodiments of the present invention, in the training set using data set to be modified
Data train Matching Model, and calculate with the Matching Model data in the test set of the data set to be modified and label
Matching result, in the case where the quantity of acquired matching result is not up to preset value, recombination with obtain new training set and
New test set reaches default in the quantity of acquired matching result to obtain new Matching Model and new matching result
In the case where value, calculates amendment confidence level and update the new data label of each data in the data set to be modified, from
And realize the amendment of the data label to the data of the data set to be modified.Solves the matter of prior art problem data mark
The technical issues of amount is difficult to ensure, the machine for effectively improving the reliability of the data label of data, and improving
The order of accuarcy of learning model prediction.
The embodiment of the invention also provides a kind of correcting devices of data label, and referring to fig. 4, correcting device 20 includes being loaded into
Module 21, training module 22, matching module 23, update module 24 and correction module 25.
It is described to insmod 21, for being loaded into data set to be modified;Wherein, the data set to be modified include training set and
The data of test set, the training set and the test set are labeled with preset data label.
The training module 22 obtains matching mould for being trained based on current training set to machine learning model
Type.
The matching module 23, for the data of current test set to be inputted current matching mould as input data
Type obtains the matching result of the current Matching Model output, to update the data label of each input data;Its
In, each input data is directed in the matching result, record has the confidence level of each label.
The update module 24, for when the quantity of the matching result obtained is not up to preset value, based on it is described to
The data for correcting data set, construct new training set and new test set.
The correction module 25, for when the quantity of the matching result obtained reaches the preset value, in conjunction with
The matching result obtained and the preset data label calculate each data in the data set to be modified
The amendment confidence level of each label, to correct the data label of each data in the data set to be modified.
The modification method of for example above-mentioned data label of the course of work of the correcting device 20, therefore not to repeat here.
The correcting device of a kind of data label disclosed by the embodiments of the present invention, in the training set using data set to be modified
Data train Matching Model, and calculate with the Matching Model data in the test set of the data set to be modified and label
Matching result, in the case where the quantity of acquired matching result is not up to preset value, recombination with obtain new training set and
New test set reaches default in the quantity of acquired matching result to obtain new Matching Model and new matching result
In the case where value, calculates amendment confidence level and update the new data label of each data in the data set to be modified, from
And realize the amendment of the data label to the data of the data set to be modified.Solves the matter of prior art problem data mark
The technical issues of amount is difficult to ensure, the machine for effectively improving the reliability of the data label of data, and improving
The order of accuarcy of learning model prediction.
The embodiment of the invention also provides the correcting devices of another data label, as shown in figure 5, the amendment of data label
Device 30 includes:Processor 31, memory 32 and storage are in the memory and the meter that can run on the processor
Calculation machine program, such as the revision program of data label.The processor 31 is realized above-mentioned each when executing the computer program
Step in calculation method embodiment, such as step S120 shown in FIG. 1.Alternatively, the processor executes the computer journey
The function of each module in above-mentioned each Installation practice, such as the amendment dress of data label described in above-described embodiment are realized when sequence
It sets.
Illustratively, the computer program can be divided into one or more modules, one or more of moulds
Block is stored in the memory 32, and is executed by the processor 31, to complete the present invention.One or more of modules
It can be the series of computation machine program instruction section that can complete specific function, the instruction segment is for describing the computer program
Implementation procedure in the correcting device 30 of the data label.For example, the computer program can be divided into loading mould
Block, training module, matching module, update module and correction module, each module concrete function are as follows:It is described to insmod, it is used for
It is loaded into data set to be modified;Wherein, the data set to be modified includes training set and test set, the training set and the test
The data of collection are labeled with preset data label;The training module, for based on current training set to machine learning mould
Type is trained, and obtains Matching Model;The matching module, for being inputted the data of current test set as input data
Current Matching Model obtains the matching result of the current Matching Model output, to update each input data
Data label;Wherein, each input data is directed in the matching result, record has the confidence level of each label;It is described
Update module, for when the quantity of the matching result obtained is not up to preset value, the number based on the data set to be modified
According to constructing new training set and new test set;The correction module, for being reached when the quantity of the matching result obtained
When to the preset value, in conjunction with the matching result obtained and the preset data label, to the correction data to be repaired
The each data concentrated calculate the amendment confidence level of each label, to correct each data in the data set to be modified
Data label.
The correcting device 30 of the data label can be desktop PC, notebook, palm PC and cloud service
Device etc. calculates equipment.The correcting device 30 of the data label may include, but be not limited only to, processor 31, memory 32.Ability
Field technique personnel are appreciated that the schematic diagram is only the example of the correcting device of data label, not structure paired data mark
The restriction of the correcting device 30 of label may include perhaps combining certain components or difference than illustrating more or fewer components
Component, such as the correcting device 30 of the data label can also include input-output equipment, network access equipment, bus
Deng.
Alleged processor 31 can be central processing unit (Central Processing Unit, CPU), can also be
Other general processors, digital signal processor (Digital Signal Processor, DSP), specific integrated circuit
(Application Specific Integrated Circuit, ASIC), ready-made programmable gate array (Field-
Programmable Gate Array, FPGA) either other programmable logic device, discrete gate or transistor logic,
Discrete hardware components etc..General processor can be microprocessor or the processor is also possible to any conventional processor
Deng the processor 31 is the control centre of the correcting device 30 of the data label, whole using various interfaces and connection
The various pieces of the correcting device 30 of a data label.
The memory 32 can be used for storing the computer program and/or module, the processor 31 by operation or
The computer program and/or module being stored in the memory 32 are executed, and calls the data being stored in memory 32,
Realize the various functions of the correcting device 30 of the data label.The memory 32 can mainly include storing program area and storage
Data field, wherein storing program area can application program needed for storage program area, at least one function (for example sound plays
Function, image player function etc.) etc.;Storage data area, which can be stored, uses created data (such as audio number according to mobile phone
According to, phone directory etc.) etc..In addition, memory 32 may include high-speed random access memory, it can also include non-volatile memories
Device, such as hard disk, memory, plug-in type hard disk, intelligent memory card (Smart Media Card, SMC), secure digital (Secure
Digital, SD) card, flash card (Flash Card), at least one disk memory, flush memory device or other volatibility are solid
State memory device.
Wherein, if the module that the correcting device 30 of the data label integrates is realized in the form of SFU software functional unit simultaneously
When sold or used as an independent product, it can store in a computer readable storage medium.Based on such reason
Solution, the present invention realize all or part of the process in above-described embodiment method, can also instruct correlation by computer program
Hardware complete, the computer program can be stored in a computer readable storage medium, the computer program is in quilt
When processor executes, it can be achieved that the step of above-mentioned each embodiment of the method.Wherein, the computer program includes computer program
Code, the computer program code can be source code form, object identification code form, executable file or certain intermediate forms
Deng.The computer-readable medium may include:Any entity or device, record of the computer program code can be carried
Medium, USB flash disk, mobile hard disk, magnetic disk, CD, computer storage, read-only memory (ROM, Read-Only Memory), with
Machine access memory (RAM, Random Access Memory), electric carrier signal, telecommunication signal and software distribution medium etc..
The correcting device and storage medium of a kind of data label disclosed by the embodiments of the present invention, using data set to be modified
Data training Matching Model in training set, and calculate with the Matching Model number in the test set of the data set to be modified
According to the matching result with label, in the case where the quantity of acquired matching result is not up to preset value, recombination is new to obtain
Training set and new test set, to obtain new Matching Model and new matching result, in the number of acquired matching result
In the case that amount reaches preset value, calculates amendment confidence level and update the new number of each data in the data set to be modified
According to label, to realize the amendment of the data label to the data of the data set to be modified.Solves prior art problem number
The technical issues of being difficult to ensure according to the quality of mark, effectively improves the reliability of the data label of data, and improves
The order of accuarcy of obtained machine learning model prediction.
The above is a preferred embodiment of the present invention, it is noted that for those skilled in the art
For, various improvements and modifications may be made without departing from the principle of the present invention, these improvements and modifications are also considered as
Protection scope of the present invention.
Claims (10)
1. a kind of modification method of data label, which is characterized in that including step:
It is loaded into data set to be modified;Wherein, the data set to be modified includes training set and test set, the training set and described
The data of test set are labeled with preset data label;
Machine learning model is trained based on current training set, obtains Matching Model;
The data of current test set are inputted into current Matching Model as input data, obtain the current Matching Model
The matching result of output, to update the data label of each input data;Wherein, each institute is directed in the matching result
Input data is stated, record has the confidence level of each label;
When the quantity of the matching result obtained is not up to preset value, new training set is constructed based on the data set to be modified
With new test set;
When the quantity of the matching result obtained reaches the preset value, in conjunction with the matching result obtained and institute
Preset data label is stated, to each data in the data set to be modified, calculates the amendment confidence level of each label,
To correct the data label of each data in the data set to be modified.
2. modification method as described in claim 1, which is characterized in that in the matching result of the current Matching Model output
Also directed to each input data, record has the confidence level ranking of each label.
3. modification method as claimed in claim 2, which is characterized in that the matching result that has been obtained described in the combination and described
Preset data label calculates the amendment confidence level of each label to each data in the data set to be modified, with
The data label of each data in the data set to be modified is corrected, including:
Each label is obtained for each data of the data set to be modified based on the matching result obtained
Confidence level and confidence level ranking, and calculate the confidence level of each label and the weighted calculation value of confidence level ranking, in conjunction with
The preset data label obtains the amendment confidence level of each label;
It is new using the amendment highest label of confidence level as the data for each data of the data set to be modified
Data label.
4. modification method as claimed in claim 3, which is characterized in that for any data of the data set to be modified, institute
It states full between amendment confidence level, the confidence level of the label, the confidence level ranking of the label and the preset data label
It is enough lower relationship:
S (k)=α * f (k)+(1- α) * g (k)+β * h (k)
In formula, α and β are constant, and k is any label, and S (k) is the amendment confidence level of label k, and f (k) is the confidence level of label k
Weighted average, g (k) is the weighted average reciprocal of the confidence level ranking of label k;If the preset data label of the data
For label k, there is h (k)=1;If the non-label k of the preset data label of data, there is h (k)=0.
5. modification method as described in claim 1, which is characterized in that described to construct new instruction based on the data set to be modified
Lian Ji and new test set, including step:
By the consistent data of the new data label and the preset data label, as in new training set
Data.
6. modification method as described in claim 1, which is characterized in that the data based on the data set to be modified, structure
Build new training set and new test set, including step:
Using the current test set as former test set, using the current training set as former training set;
By data label that the belong to the former test set and new number consistent with the data label before the last update
According to as training component one, other data of the original test set are as fractions tested one;
It is subordinated in the data of the former training set, the quantity data equal with the trained component one is obtained, as test group
Divide two, other data of the original training set are as training component two;
The data for merging the trained component one and the trained component two, as new training set;
The data for merging the fractions tested one and the fractions tested two, as new test set.
7. modification method as claimed in claim 6, which is characterized in that in the data for being subordinated to the former training set, obtain
The access amount data equal with the trained component one, as fractions tested two, other data of the original training set are as instruction
Practice component two, including step:
It is subordinated in the data of the former training set, it is random to obtain the quantity data equal with the trained component one, as survey
Component two is tried, other data of the original training set are as training component two.
8. a kind of correcting device of data label, which is characterized in that including insmoding, training module, matching module, update mould
Block and correction module;
It is described to insmod, for being loaded into data set to be modified;Wherein, the data set to be modified includes training set and test
The data of collection, the training set and the test set are labeled with preset data label;
The training module obtains Matching Model for being trained based on current training set to machine learning model;
The matching module is obtained for the data of current test set to be inputted current Matching Model as input data
The matching result of the current Matching Model output, to update the data label of each input data;Wherein, described
With each input data is directed in result, record has the confidence level of each label;
The update module, for being based on the positive number to be repaired when the quantity of the matching result obtained is not up to preset value
According to the data of collection, new training set and new test set are constructed;
The correction module, for when the quantity of the matching result obtained reaches the preset value, in conjunction with it is described
The matching result of acquisition and the preset data label calculate each institute to each data in the data set to be modified
The amendment confidence level of label is stated, to correct the data label of each data in the data set to be modified.
9. a kind of correcting device of data label, including processor, memory and storage in the memory and are configured
For the computer program executed by the processor, the processor realizes such as claim 1 when executing the computer program
To the modification method of data label described in any one of 7.
10. a kind of computer readable storage medium, which is characterized in that the computer readable storage medium includes the calculating of storage
Machine program, wherein equipment where controlling the computer readable storage medium in computer program operation is executed as weighed
Benefit require any one of 1 to 7 described in data label modification method.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810649534.5A CN108897829B (en) | 2018-06-22 | 2018-06-22 | Data label correction method, device and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810649534.5A CN108897829B (en) | 2018-06-22 | 2018-06-22 | Data label correction method, device and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108897829A true CN108897829A (en) | 2018-11-27 |
CN108897829B CN108897829B (en) | 2020-08-04 |
Family
ID=64345558
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810649534.5A Active CN108897829B (en) | 2018-06-22 | 2018-06-22 | Data label correction method, device and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108897829B (en) |
Cited By (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109829490A (en) * | 2019-01-22 | 2019-05-31 | 上海鹰瞳医疗科技有限公司 | Modification vector searching method, objective classification method and equipment |
CN109977255A (en) * | 2019-02-22 | 2019-07-05 | 北京奇艺世纪科技有限公司 | Model generating method, audio-frequency processing method, device, terminal and storage medium |
CN109982137A (en) * | 2019-02-22 | 2019-07-05 | 北京奇艺世纪科技有限公司 | Model generating method, video marker method, apparatus, terminal and storage medium |
CN110008372A (en) * | 2019-02-22 | 2019-07-12 | 北京奇艺世纪科技有限公司 | Model generating method, audio-frequency processing method, device, terminal and storage medium |
CN110324726A (en) * | 2019-05-29 | 2019-10-11 | 北京奇艺世纪科技有限公司 | Model generation, method for processing video frequency, device, electronic equipment and storage medium |
CN110717039A (en) * | 2019-09-17 | 2020-01-21 | 平安科技(深圳)有限公司 | Text classification method and device, electronic equipment and computer-readable storage medium |
CN110990627A (en) * | 2019-12-05 | 2020-04-10 | 北京奇艺世纪科技有限公司 | Knowledge graph construction method and device, electronic equipment and medium |
CN111160484A (en) * | 2019-12-31 | 2020-05-15 | 腾讯科技(深圳)有限公司 | Data processing method and device, computer readable storage medium and electronic equipment |
CN111753174A (en) * | 2020-06-23 | 2020-10-09 | 北京字节跳动网络技术有限公司 | Data processing method and device and electronic equipment |
CN112016613A (en) * | 2020-08-26 | 2020-12-01 | 广州市百果园信息技术有限公司 | Training method and device for video content classification model, computer equipment and medium |
CN113342799A (en) * | 2021-08-09 | 2021-09-03 | 明品云(北京)数据科技有限公司 | Data correction method and system |
CN113496232A (en) * | 2020-03-18 | 2021-10-12 | 杭州海康威视数字技术股份有限公司 | Label checking method and device |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20100138712A1 (en) * | 2008-12-01 | 2010-06-03 | Changki Lee | Apparatus and method for verifying training data using machine learning |
CN106033425A (en) * | 2015-03-11 | 2016-10-19 | 富士通株式会社 | A data processing device and a data processing method |
CN106951925A (en) * | 2017-03-27 | 2017-07-14 | 成都小多科技有限公司 | Data processing method, device, server and system |
CN107368892A (en) * | 2017-06-07 | 2017-11-21 | 无锡小天鹅股份有限公司 | Model training method and device based on machine learning |
CN108021931A (en) * | 2017-11-20 | 2018-05-11 | 阿里巴巴集团控股有限公司 | A kind of data sample label processing method and device |
CN108062394A (en) * | 2017-12-18 | 2018-05-22 | 北京中关村科金技术有限公司 | The mask method and relevant apparatus of a kind of data set |
CN108171335A (en) * | 2017-12-06 | 2018-06-15 | 东软集团股份有限公司 | Choosing method, device, storage medium and the electronic equipment of modeling data |
-
2018
- 2018-06-22 CN CN201810649534.5A patent/CN108897829B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20100138712A1 (en) * | 2008-12-01 | 2010-06-03 | Changki Lee | Apparatus and method for verifying training data using machine learning |
CN106033425A (en) * | 2015-03-11 | 2016-10-19 | 富士通株式会社 | A data processing device and a data processing method |
CN106951925A (en) * | 2017-03-27 | 2017-07-14 | 成都小多科技有限公司 | Data processing method, device, server and system |
CN107368892A (en) * | 2017-06-07 | 2017-11-21 | 无锡小天鹅股份有限公司 | Model training method and device based on machine learning |
CN108021931A (en) * | 2017-11-20 | 2018-05-11 | 阿里巴巴集团控股有限公司 | A kind of data sample label processing method and device |
CN108171335A (en) * | 2017-12-06 | 2018-06-15 | 东软集团股份有限公司 | Choosing method, device, storage medium and the electronic equipment of modeling data |
CN108062394A (en) * | 2017-12-18 | 2018-05-22 | 北京中关村科金技术有限公司 | The mask method and relevant apparatus of a kind of data set |
Cited By (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109829490B (en) * | 2019-01-22 | 2022-03-22 | 上海鹰瞳医疗科技有限公司 | Correction vector searching method, target classification method and device |
CN109829490A (en) * | 2019-01-22 | 2019-05-31 | 上海鹰瞳医疗科技有限公司 | Modification vector searching method, objective classification method and equipment |
CN109977255A (en) * | 2019-02-22 | 2019-07-05 | 北京奇艺世纪科技有限公司 | Model generating method, audio-frequency processing method, device, terminal and storage medium |
CN109982137A (en) * | 2019-02-22 | 2019-07-05 | 北京奇艺世纪科技有限公司 | Model generating method, video marker method, apparatus, terminal and storage medium |
CN110008372A (en) * | 2019-02-22 | 2019-07-12 | 北京奇艺世纪科技有限公司 | Model generating method, audio-frequency processing method, device, terminal and storage medium |
CN110324726A (en) * | 2019-05-29 | 2019-10-11 | 北京奇艺世纪科技有限公司 | Model generation, method for processing video frequency, device, electronic equipment and storage medium |
CN110717039A (en) * | 2019-09-17 | 2020-01-21 | 平安科技(深圳)有限公司 | Text classification method and device, electronic equipment and computer-readable storage medium |
CN110717039B (en) * | 2019-09-17 | 2023-10-13 | 平安科技(深圳)有限公司 | Text classification method and apparatus, electronic device, and computer-readable storage medium |
CN110990627A (en) * | 2019-12-05 | 2020-04-10 | 北京奇艺世纪科技有限公司 | Knowledge graph construction method and device, electronic equipment and medium |
CN111160484A (en) * | 2019-12-31 | 2020-05-15 | 腾讯科技(深圳)有限公司 | Data processing method and device, computer readable storage medium and electronic equipment |
CN111160484B (en) * | 2019-12-31 | 2023-08-29 | 腾讯科技(深圳)有限公司 | Data processing method, data processing device, computer readable storage medium and electronic equipment |
CN113496232A (en) * | 2020-03-18 | 2021-10-12 | 杭州海康威视数字技术股份有限公司 | Label checking method and device |
CN113496232B (en) * | 2020-03-18 | 2024-05-28 | 杭州海康威视数字技术股份有限公司 | Label verification method and device |
CN111753174A (en) * | 2020-06-23 | 2020-10-09 | 北京字节跳动网络技术有限公司 | Data processing method and device and electronic equipment |
CN112016613A (en) * | 2020-08-26 | 2020-12-01 | 广州市百果园信息技术有限公司 | Training method and device for video content classification model, computer equipment and medium |
CN112016613B (en) * | 2020-08-26 | 2024-08-13 | 广州市百果园信息技术有限公司 | Training method and device for video content classification model, computer equipment and medium |
CN113342799A (en) * | 2021-08-09 | 2021-09-03 | 明品云(北京)数据科技有限公司 | Data correction method and system |
Also Published As
Publication number | Publication date |
---|---|
CN108897829B (en) | 2020-08-04 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108897829A (en) | Modification method, device and the storage medium of data label | |
CN109242013A (en) | A kind of data mask method, device, electronic equipment and storage medium | |
CN110969600A (en) | Product defect detection method and device, electronic equipment and storage medium | |
WO2016095068A1 (en) | Pedestrian detection apparatus and method | |
WO2020253038A1 (en) | Model construction method and apparatus | |
EP3649582A1 (en) | System and method for automatic building of learning machines using learning machines | |
US20200090076A1 (en) | Non-transitory computer-readable recording medium, prediction method, and learning device | |
CN112861934A (en) | Image classification method and device of embedded terminal and embedded terminal | |
CN108833592A (en) | Cloud host schedules device optimization method, device, equipment and storage medium | |
CN108681505A (en) | A kind of Test Case Prioritization method and apparatus based on decision tree | |
CN111489003B (en) | Life cycle prediction method and device | |
CN113761026A (en) | Feature selection method, device, equipment and storage medium based on conditional mutual information | |
CN110134598A (en) | A kind of batch processing method, apparatus and system | |
CN117592905A (en) | Data generation method, device, equipment and storage medium | |
CN103049629A (en) | Method and device for detecting noise data | |
CN116365511A (en) | Active power distribution network model construction method, device, terminal and storage medium | |
CN112434817B (en) | Method, apparatus and computer storage medium for constructing communication algorithm database | |
Mohammadi et al. | Machine learning assisted stochastic unit commitment: A feasibility study | |
CN111950753A (en) | Scenic spot passenger flow prediction method and device | |
CN112733453B (en) | Equipment predictive maintenance method and device based on joint learning | |
CN115423159A (en) | Photovoltaic power generation prediction method and device and terminal equipment | |
CN112733454B (en) | Equipment predictive maintenance method and device based on joint learning | |
CN107748711A (en) | Method, terminal device and the storage medium of Automatic Optimal Storm degree of parallelisms | |
CN103955449A (en) | Target sample positioning method and device | |
CN116069767A (en) | Equipment data cleaning method and device, computer equipment and medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |