CN108897829A - Modification method, device and the storage medium of data label - Google Patents

Modification method, device and the storage medium of data label Download PDF

Info

Publication number
CN108897829A
CN108897829A CN201810649534.5A CN201810649534A CN108897829A CN 108897829 A CN108897829 A CN 108897829A CN 201810649534 A CN201810649534 A CN 201810649534A CN 108897829 A CN108897829 A CN 108897829A
Authority
CN
China
Prior art keywords
data
label
confidence level
modified
new
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810649534.5A
Other languages
Chinese (zh)
Other versions
CN108897829B (en
Inventor
徐波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
GUANGDONG LIWEI NETWORK TECHNOLOGY CO LTD
Multi Benefit Network Co Ltd
Guangzhou Duoyi Network Co Ltd
Original Assignee
GUANGDONG LIWEI NETWORK TECHNOLOGY CO LTD
Multi Benefit Network Co Ltd
Guangzhou Duoyi Network Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by GUANGDONG LIWEI NETWORK TECHNOLOGY CO LTD, Multi Benefit Network Co Ltd, Guangzhou Duoyi Network Co Ltd filed Critical GUANGDONG LIWEI NETWORK TECHNOLOGY CO LTD
Priority to CN201810649534.5A priority Critical patent/CN108897829B/en
Publication of CN108897829A publication Critical patent/CN108897829A/en
Application granted granted Critical
Publication of CN108897829B publication Critical patent/CN108897829B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/103Formatting, i.e. changing of presentation of documents
    • G06F40/117Tagging; Marking up; Designating a block; Setting of attributes

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a kind of modification methods of data label, are related to machine learning field, the method comprising the steps of:It is loaded into data set to be modified;Machine learning model is trained, Matching Model is obtained;Matching Model is inputted using the data of test set as input data, the matching result of Matching Model output is obtained, to update the data label of each input data;When the quantity of matching result is not up to preset value, based on the data of data set to be modified, new training set and new test set are constructed;When the quantity of matching result reaches preset value, the amendment confidence level of each label is calculated to each data in data set to be modified in conjunction with the matching result and preset data label obtained, to correct the data label of each data in data set to be modified.The present invention also provides the correcting device of data label and storage mediums, can effectively improve the reliability of the data label of data, to improve the order of accuarcy of obtained machine learning model prediction.

Description

Modification method, device and the storage medium of data label
Technical field
The present invention relates to machine learning field more particularly to a kind of modification methods of data label, device and storage medium.
Background technique
In the learning tasks for having supervision, machine learning or the training of deep learning system are needed to use and be largely labeled with The data of corresponding data label.In general, carry out training pattern using data more and that mark quality is higher, training obtains Model be more able to reflect true situation, to unknown data prediction result it is more reliable.
In order to improve the mark quality of data, need to find the label to match with data.In the prior art, in order to mention The mark quality of high data, the technological means of use is common to have artificial mark, cross entropy screening, information retrieval and data to abandon Deng.Wherein, artificial mark is by being manually labeled data;Cross entropy is screened by the way that original language material is divided into multiple small collection It closes, calculate the cross entropy of multiple small sets and thinks that the data mark of the smallest set of entropy is reliable;Information retrieval relies on In determining test set, and relevant information is retrieved as training set;It is to abandon similarity and matching result not that data, which abandon then, Same data.
During implementing the embodiment of the present invention, inventors have found that due to manually marking knowing dependent on staff Know and is lost same data message, information retrieval dependent on determining test set and retrieval data matter with energy, cross entropy screening Amount is difficult to ensure, and there are the risks of loss of vital data for data discarding, cause the data label reliability of available data lower, The order of accuarcy of obtained machine learning model prediction is not high.
Summary of the invention
The purpose of the embodiment of the present invention is that a kind of modification method of data label, device and storage medium are provided, it can be effective The reliability of the data label of data is improved, to improve the order of accuarcy of obtained machine learning model prediction.
To achieve the above object, the embodiment of the invention provides a kind of modification methods of data label, including step:
It is loaded into data set to be modified;Wherein, the data set to be modified includes training set and test set, the training set and The data of the test set are labeled with preset data label;
Machine learning model is trained based on current training set, obtains Matching Model;
The data of current test set are inputted into current Matching Model as input data, obtain the current matching The matching result of model output, to update the data label of each input data;Wherein, for every in the matching result One input data, record have the confidence level of each label;
When the quantity of the matching result obtained is not up to preset value, based on the data of the data set to be modified, structure Build new training set and new test set;
When the quantity of the matching result obtained reaches the preset value, in conjunction with the matching result obtained With the preset data label, to each data in the data set to be modified, the amendment for calculating each label is set Reliability, to correct the data label of each data in the data set to be modified.
As an improvement of the above scheme, also directed to each described defeated in the matching result of the current Matching Model output Enter data, has recorded the confidence level ranking of each label.
As an improvement of the above scheme, which is characterized in that the matching result that has been obtained described in the combination and described default Data label the amendment confidence level of each label is calculated to each data in the data set to be modified, with amendment The data label of each data in the data set to be modified, including:
It is obtained each described based on the matching result obtained for each data of the data set to be modified The confidence level and confidence level ranking of label, and the confidence level of each label and the weighted calculation value of confidence level ranking are calculated, In conjunction with the preset data label, the amendment confidence level of each label is obtained;
For each data of the data set to be modified, using the highest label of the amendment confidence level as the data New data label.
As an improvement of the above scheme, which is characterized in that for any data of the data set to be modified, the amendment Meet between confidence level, the confidence level of the label, the confidence level ranking of the label and the preset data label following Relationship:
S (k)=α * f (k)+(1- α) * g (k)+β * h (k)
In formula, α and β are constant, and k is any label, and S (k) is the amendment confidence level of label k, and f (k) is setting for label k The weighted average of reliability, g (k) are the weighted average reciprocal of the confidence level ranking of label k;If the preset data of the data Label is label k, there is h (k)=1;If the non-label k of the preset data label of data, there is h (k)=0.
As an improvement of the above scheme, which is characterized in that the data based on the data set to be modified construct new Training set and new test set, including step:
By the consistent data of the new data label and the preset data label, as new training set In data.
As an improvement of the above scheme, which is characterized in that the data based on the data set to be modified construct new Training set and new test set, including step:
Using the current test set as former test set, using the current training set as former training set;
Data label that belong to the former test set and new is consistent with the data label before the last update Data, as training component one, it is described original test set other data as fractions tested one;
It is subordinated in the data of the former training set, the quantity data equal with the trained component one is obtained, as survey Component two is tried, other data of the original training set are as training component two;
The data for merging the trained component one and the trained component two, as new training set;
The data for merging the fractions tested one and the fractions tested two, as new test set.
As an improvement of the above scheme, which is characterized in that in the data for being subordinated to the former training set, obtain quantity The data equal with the trained component one, including step:
It is subordinated in the data of the former training set, it is random to obtain the quantity data equal with the trained component one.
The embodiment of the invention also provides a kind of correcting devices of data label, including insmod, training module, matching Module, update module and correction module;
It is described to insmod, for being loaded into data set to be modified;Wherein, the data set to be modified includes training set and survey The data of examination collection, the training set and the test set are labeled with preset data label;
The training module obtains Matching Model for being trained based on current training set to machine learning model;
The matching module, for the data of current test set to be inputted current Matching Model as input data, The matching result of the current Matching Model output is obtained, to update the data label of each input data;Wherein, institute It states for each input data in matching result, record has the confidence level of each label;
The update module, for being based on described to be repaired when the quantity of the matching result obtained is not up to preset value The data of correction data collection construct new training set and new test set;
The correction module, for when the quantity of the matching result obtained reaches the preset value, in conjunction with institute The matching result obtained and the preset data label are stated, to each data in the data set to be modified, is calculated every The amendment confidence level of one label, to correct the data label of each data in the data set to be modified.
The embodiment of the invention also provides a kind of correcting devices of data label, including processor, memory and storage In the memory and it is configured as the computer program executed by the processor, the processor executes the computer The modification method of data label described in any one as above is realized when program.
The embodiment of the invention also provides a kind of computer readable storage medium, the computer readable storage medium includes The computer program of storage, wherein control in computer program operation and set where the computer readable storage medium The standby modification method for executing data label described in any one as above.
Compared with prior art, modification method, device and the storage medium of a kind of data label disclosed by the invention uses Data training Matching Model in the training set of data set to be modified, and the data set to be modified is calculated with the Matching Model Test set in data and label matching result, the case where the quantity of acquired matching result is not up to preset value Under, it recombinates to obtain new training set and new test set, to obtain new Matching Model and new matching result, is obtaining Matching result quantity reach preset value in the case where, calculate amendment confidence level simultaneously update every in the data set to be modified The new data label of one data, to realize the amendment of the data label to the data of the data set to be modified.It solves The technical issues of quality of prior art problem data mark is difficult to ensure, effectively improves the reliable of the data label of data Property, and the order of accuarcy of the machine learning model prediction improved.
Detailed description of the invention
Fig. 1 is a kind of flow diagram of the modification method of data label in the embodiment of the present invention.
Fig. 2 is the flow diagram of the step S140 of modification method as shown in Figure 1.
Fig. 3 is the flow diagram of the step S150 of modification method as shown in Figure 1.
Fig. 4 is a kind of structural schematic diagram of the correcting device of data label in the embodiment of the present invention.
Fig. 5 is the structural schematic diagram of the correcting device of another data label in the embodiment of the present invention.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts every other Embodiment shall fall within the protection scope of the present invention.
It is a kind of flow diagram of the modification method of data label provided in an embodiment of the present invention referring to Fig. 1.The present invention The modification method that embodiment 1 provides includes step S110 to step S150.
S110, it is loaded into data set to be modified;Wherein, the data set to be modified includes training set and test set, the instruction The data for practicing collection and the test set are labeled with preset data label.
Wherein, the data in the data set to be modified are respectively belonging to the training set and the test set.
S120, machine learning model is trained based on current training set, obtains Matching Model.
Wherein, the current training set can be the preset training set in the data set to be modified, can also be with It is the new training set that step S140 is constructed, does not influence the obtainable beneficial effect of the present invention.
The machine learning model can be preset learning model, be also possible to choose according to the actual situation Practise model, such as single machine learning model or integrated machine learning model do not influence that the present invention is obtainable to be had Beneficial effect.
S130, the data of current test set are inputted as input data to current Matching Model, obtained described current Matching Model output matching result, to update the data label of each input data;Wherein, in the matching result For each input data, record has the confidence level of each label.
Wherein, the current test set can be the preset test set in the data set to be modified, can also be with It is the new test set that step S140 is constructed, does not influence the obtainable beneficial effect of the present invention.
For example, setting the data set to be modified as A, the training set is A1, the test set is A2, it is assumed that A=a, b, C, d, e }, Aa1={ a, b, c }, Ab1={ d, e }, tag types include label 1, label 2 and label 3, wherein data a and data b Preset data label be label 1, data c and the preset data label of data e are label 2, the preset data label of data d For label 3.Based on Aa1The machine learning model is trained, Matching Model B is obtained1。B1As current Matching Model, Ab1As current test set, by Ab1Data as input data input B1, obtain B1Output matching result C1.In C1 In be directed to each input data, record has the confidence level of each label, that is, have recorded data d respectively with label 1, label 2 and mark The confidence level that label 3 match, it is assumed that the confidence level that data d is obtained is respectively 0.6,0.7 and 0.8, the corresponding confidence level of label 3 0.8 is highest confidence level, using label 3 as the new data label of data d, i.e., the new data label of data d with it is preset Data label is consistent;C1In the confidence level that data e matches with label 1, label 2 and label 3 respectively is also recorded, it is assumed that data The confidence level that e is obtained is respectively 0.9,0.5 and 0.4, the corresponding confidence level 0.9 of label 1 be highest confidence level, using label 1 as The new data label of data e, then the new data label of data e and preset data label are inconsistent.On it is to be appreciated that Stating a kind of citing that example is only implemented as the present invention can also be changed accordingly according to the actual situation in other cases It is dynamic, such as representation change including more data or confidence level in the data set to be modified etc., do not influence this Invent obtainable beneficial effect.
Embodiment as one preferred can also be directed to each input data, note in the matching result Record the confidence level ranking of each label.Continue the example above, C1In also directed to data d have recorded each label confidence level arrange Name, wherein the confidence level of label 1 is 0.6, and confidence level ranking is 3, and the confidence level of label 2 is 0.7, and confidence level ranking is 2, label 3 confidence level is 0.8, and confidence level ranking is 1;C1The confidence level ranking of each label is had recorded also directed to data e, wherein label 1 Confidence level ranking be 1, the confidence level ranking of label 2 is 2, and the confidence level ranking of label 3 is 3.
S140, when the quantity of the matching result obtained is not up to preset value, the number based on the data set to be modified According to constructing new training set and new test set.
Wherein, the preset value can be a specific numerical value, such as 4,5 or 6 etc., and it is bigger or it is smaller its His numerical value does not influence the obtainable beneficial effect of the present invention.
Embodiment as one preferred, can be by the new data label and the preset data label phase one The data caused, as the data in new training set.For example, in the citing of step S130, the new data mark of data d Label are consistent with preset data label, then using data d as the data in the new training set.
As a kind of further preferred embodiment, referring to fig. 2, step S140 can be using the current test set as original Test set, using the current training set as former training set, including step S141 to step S144.
S141, the data label former test set and before new data label and the last update will be belonged to Consistent data, as training component one, other data of the original test set are as fractions tested one.
Wherein, for any data in the former test set, if the data label of the data merely through it is primary it is described more Newly, then in the case where the new data label of the data is consistent with preset data label, using the data as training group Divide one;If the data label of the data only crosses the update two or more times, such as by the update three times, then in the number According to new data label it is consistent with the data label progress before the third time update in the case where, using the data as Training component one.
S142, it is subordinated in the data of the former training set, obtains the quantity data equal with the trained component one, make For fractions tested two, other data of the original training set are as training component two.
Preferably, it can be the data bulk according to the training component one obtained in step S141, from the former training set In obtain the equal data of quantity at random, using as the fractions tested two, other data of the original training set are as training Component two.
S143, the data for merging the trained component one and the trained component two, as new training set.
S144, the data for merging the fractions tested one and the fractions tested two, as new test set.
It is to be appreciated that the execution sequence of step S143 and step S144 can be replaced mutually, such as step S144 is in step It is executed before rapid S143, is also possible to step S143 and step S144 is performed simultaneously, it is obtainable beneficial not influence the present invention Effect.
For example, before the operation for executing step S140, there is A in conjunction with the citing of step S130a1={ a, b, c }, Ab1= { d, e }, and after executing step S130, the new data label of data d is consistent with the data label before update, therefore After executing step S140, data d becomes new training set Aa2In data, new training set Aa2={ a, c, d }, new Test set Ab2={ b, e }.
After executing step S140, the operation of step S120 and step S130 can be executed again, to obtain new With as a result, and in the case where the quantity of the matching result obtained is not up to the preset value, repeat step S120, step Rapid S130 and step S140, so that the quantity of the matching result obtained reaches the requirement of the preset value.
S150, when the quantity of the matching result obtained reaches the preset value, in conjunction with obtained Each label is calculated to each data in the data set to be modified with result and the preset data label Confidence level is corrected, to correct the data label of each data in the data set to be modified.
It preferably, may include step S151 and step S152 referring to Fig. 3, step S150.
S151, it is obtained each based on the matching result obtained for each data of the data set to be modified The confidence level and confidence level ranking of the label, and calculate the confidence level of each label and the weighted calculation of confidence level ranking Value, in conjunction with the preset data label, obtains the amendment confidence level of each label.
It is highly preferred that being directed to any data of the data set to be modified, the amendment confidence level, the confidence of the label It spends, meet following relationship between the confidence level ranking and the preset data label of the label:
S (k)=α * f (k)+(1- α) * g (k)+β * h (k)
In formula, α and β are constant, and k is any label, and S (k) is the amendment confidence level of label k, and f (k) is setting for label k The weighted average of reliability, g (k) are the weighted average reciprocal of the confidence level ranking of label k;If the preset data of the data Label is label k, there is h (k)=1;If the non-label k of the preset data label of data, there is h (k)=0.
Divide in conjunction with the citing of step S130 preferred embodiment for each of A={ a, b, c, d, e } data It is other to data a, data b, data c, data d and data e, calculate the amendment confidence level of corresponding each label.With data a For, it is assumed that for data a in the matching result obtained, the weighted average of 1 confidence level of label is f (1)=0.9, confidence level The weighted average g (1)=1 of ranking, and the preset data label of data a is in label 1, i.e. h (1)=1, if α=0.4, β =0.3, S (1)=1.26 can be obtained, other in the S (2) of the label 2 of data a and the S (3) and A of label 3 similarly can be obtained The amendment confidence level of each label of data.It is to be appreciated that in other cases, can also carry out according to the actual situation Corresponding change, such as the numerical value etc. of different α and β are chosen, do not influence the obtainable beneficial effect of the present invention.
More specifically, being directed to each data of the data set to be modified, the weighting meter of the confidence level of any label Calculation value meets following relationship:
In formula, k is any label, and f (k) is the weighted average of the confidence level of label k, and n is the matching result obtained Quantity,For the confidence level of label k that is being recorded in i-th of matching result.
More specifically, being directed to each data of the data set to be modified, the confidence level ranking of any label adds Power calculated value meets following relationship:
In formula, k is any label, and g (k) is the weighted average reciprocal of the confidence level ranking of label k, and n is to have obtained Matching result quantity,For confidence level ranking recorded in i-th of matching result, label k.
S152, for each data of the data set to be modified, using the highest label of the amendment confidence level as should The new data label of data.
The modification method of a kind of data label disclosed by the embodiments of the present invention, in the training set using data set to be modified Data train Matching Model, and calculate with the Matching Model data in the test set of the data set to be modified and label Matching result, in the case where the quantity of acquired matching result is not up to preset value, recombination with obtain new training set and New test set reaches default in the quantity of acquired matching result to obtain new Matching Model and new matching result In the case where value, calculates amendment confidence level and update the new data label of each data in the data set to be modified, from And realize the amendment of the data label to the data of the data set to be modified.Solves the matter of prior art problem data mark The technical issues of amount is difficult to ensure, the machine for effectively improving the reliability of the data label of data, and improving The order of accuarcy of learning model prediction.
The embodiment of the invention also provides a kind of correcting devices of data label, and referring to fig. 4, correcting device 20 includes being loaded into Module 21, training module 22, matching module 23, update module 24 and correction module 25.
It is described to insmod 21, for being loaded into data set to be modified;Wherein, the data set to be modified include training set and The data of test set, the training set and the test set are labeled with preset data label.
The training module 22 obtains matching mould for being trained based on current training set to machine learning model Type.
The matching module 23, for the data of current test set to be inputted current matching mould as input data Type obtains the matching result of the current Matching Model output, to update the data label of each input data;Its In, each input data is directed in the matching result, record has the confidence level of each label.
The update module 24, for when the quantity of the matching result obtained is not up to preset value, based on it is described to The data for correcting data set, construct new training set and new test set.
The correction module 25, for when the quantity of the matching result obtained reaches the preset value, in conjunction with The matching result obtained and the preset data label calculate each data in the data set to be modified The amendment confidence level of each label, to correct the data label of each data in the data set to be modified.
The modification method of for example above-mentioned data label of the course of work of the correcting device 20, therefore not to repeat here.
The correcting device of a kind of data label disclosed by the embodiments of the present invention, in the training set using data set to be modified Data train Matching Model, and calculate with the Matching Model data in the test set of the data set to be modified and label Matching result, in the case where the quantity of acquired matching result is not up to preset value, recombination with obtain new training set and New test set reaches default in the quantity of acquired matching result to obtain new Matching Model and new matching result In the case where value, calculates amendment confidence level and update the new data label of each data in the data set to be modified, from And realize the amendment of the data label to the data of the data set to be modified.Solves the matter of prior art problem data mark The technical issues of amount is difficult to ensure, the machine for effectively improving the reliability of the data label of data, and improving The order of accuarcy of learning model prediction.
The embodiment of the invention also provides the correcting devices of another data label, as shown in figure 5, the amendment of data label Device 30 includes:Processor 31, memory 32 and storage are in the memory and the meter that can run on the processor Calculation machine program, such as the revision program of data label.The processor 31 is realized above-mentioned each when executing the computer program Step in calculation method embodiment, such as step S120 shown in FIG. 1.Alternatively, the processor executes the computer journey The function of each module in above-mentioned each Installation practice, such as the amendment dress of data label described in above-described embodiment are realized when sequence It sets.
Illustratively, the computer program can be divided into one or more modules, one or more of moulds Block is stored in the memory 32, and is executed by the processor 31, to complete the present invention.One or more of modules It can be the series of computation machine program instruction section that can complete specific function, the instruction segment is for describing the computer program Implementation procedure in the correcting device 30 of the data label.For example, the computer program can be divided into loading mould Block, training module, matching module, update module and correction module, each module concrete function are as follows:It is described to insmod, it is used for It is loaded into data set to be modified;Wherein, the data set to be modified includes training set and test set, the training set and the test The data of collection are labeled with preset data label;The training module, for based on current training set to machine learning mould Type is trained, and obtains Matching Model;The matching module, for being inputted the data of current test set as input data Current Matching Model obtains the matching result of the current Matching Model output, to update each input data Data label;Wherein, each input data is directed in the matching result, record has the confidence level of each label;It is described Update module, for when the quantity of the matching result obtained is not up to preset value, the number based on the data set to be modified According to constructing new training set and new test set;The correction module, for being reached when the quantity of the matching result obtained When to the preset value, in conjunction with the matching result obtained and the preset data label, to the correction data to be repaired The each data concentrated calculate the amendment confidence level of each label, to correct each data in the data set to be modified Data label.
The correcting device 30 of the data label can be desktop PC, notebook, palm PC and cloud service Device etc. calculates equipment.The correcting device 30 of the data label may include, but be not limited only to, processor 31, memory 32.Ability Field technique personnel are appreciated that the schematic diagram is only the example of the correcting device of data label, not structure paired data mark The restriction of the correcting device 30 of label may include perhaps combining certain components or difference than illustrating more or fewer components Component, such as the correcting device 30 of the data label can also include input-output equipment, network access equipment, bus Deng.
Alleged processor 31 can be central processing unit (Central Processing Unit, CPU), can also be Other general processors, digital signal processor (Digital Signal Processor, DSP), specific integrated circuit (Application Specific Integrated Circuit, ASIC), ready-made programmable gate array (Field- Programmable Gate Array, FPGA) either other programmable logic device, discrete gate or transistor logic, Discrete hardware components etc..General processor can be microprocessor or the processor is also possible to any conventional processor Deng the processor 31 is the control centre of the correcting device 30 of the data label, whole using various interfaces and connection The various pieces of the correcting device 30 of a data label.
The memory 32 can be used for storing the computer program and/or module, the processor 31 by operation or The computer program and/or module being stored in the memory 32 are executed, and calls the data being stored in memory 32, Realize the various functions of the correcting device 30 of the data label.The memory 32 can mainly include storing program area and storage Data field, wherein storing program area can application program needed for storage program area, at least one function (for example sound plays Function, image player function etc.) etc.;Storage data area, which can be stored, uses created data (such as audio number according to mobile phone According to, phone directory etc.) etc..In addition, memory 32 may include high-speed random access memory, it can also include non-volatile memories Device, such as hard disk, memory, plug-in type hard disk, intelligent memory card (Smart Media Card, SMC), secure digital (Secure Digital, SD) card, flash card (Flash Card), at least one disk memory, flush memory device or other volatibility are solid State memory device.
Wherein, if the module that the correcting device 30 of the data label integrates is realized in the form of SFU software functional unit simultaneously When sold or used as an independent product, it can store in a computer readable storage medium.Based on such reason Solution, the present invention realize all or part of the process in above-described embodiment method, can also instruct correlation by computer program Hardware complete, the computer program can be stored in a computer readable storage medium, the computer program is in quilt When processor executes, it can be achieved that the step of above-mentioned each embodiment of the method.Wherein, the computer program includes computer program Code, the computer program code can be source code form, object identification code form, executable file or certain intermediate forms Deng.The computer-readable medium may include:Any entity or device, record of the computer program code can be carried Medium, USB flash disk, mobile hard disk, magnetic disk, CD, computer storage, read-only memory (ROM, Read-Only Memory), with Machine access memory (RAM, Random Access Memory), electric carrier signal, telecommunication signal and software distribution medium etc..
The correcting device and storage medium of a kind of data label disclosed by the embodiments of the present invention, using data set to be modified Data training Matching Model in training set, and calculate with the Matching Model number in the test set of the data set to be modified According to the matching result with label, in the case where the quantity of acquired matching result is not up to preset value, recombination is new to obtain Training set and new test set, to obtain new Matching Model and new matching result, in the number of acquired matching result In the case that amount reaches preset value, calculates amendment confidence level and update the new number of each data in the data set to be modified According to label, to realize the amendment of the data label to the data of the data set to be modified.Solves prior art problem number The technical issues of being difficult to ensure according to the quality of mark, effectively improves the reliability of the data label of data, and improves The order of accuarcy of obtained machine learning model prediction.
The above is a preferred embodiment of the present invention, it is noted that for those skilled in the art For, various improvements and modifications may be made without departing from the principle of the present invention, these improvements and modifications are also considered as Protection scope of the present invention.

Claims (10)

1. a kind of modification method of data label, which is characterized in that including step:
It is loaded into data set to be modified;Wherein, the data set to be modified includes training set and test set, the training set and described The data of test set are labeled with preset data label;
Machine learning model is trained based on current training set, obtains Matching Model;
The data of current test set are inputted into current Matching Model as input data, obtain the current Matching Model The matching result of output, to update the data label of each input data;Wherein, each institute is directed in the matching result Input data is stated, record has the confidence level of each label;
When the quantity of the matching result obtained is not up to preset value, new training set is constructed based on the data set to be modified With new test set;
When the quantity of the matching result obtained reaches the preset value, in conjunction with the matching result obtained and institute Preset data label is stated, to each data in the data set to be modified, calculates the amendment confidence level of each label, To correct the data label of each data in the data set to be modified.
2. modification method as described in claim 1, which is characterized in that in the matching result of the current Matching Model output Also directed to each input data, record has the confidence level ranking of each label.
3. modification method as claimed in claim 2, which is characterized in that the matching result that has been obtained described in the combination and described Preset data label calculates the amendment confidence level of each label to each data in the data set to be modified, with The data label of each data in the data set to be modified is corrected, including:
Each label is obtained for each data of the data set to be modified based on the matching result obtained Confidence level and confidence level ranking, and calculate the confidence level of each label and the weighted calculation value of confidence level ranking, in conjunction with The preset data label obtains the amendment confidence level of each label;
It is new using the amendment highest label of confidence level as the data for each data of the data set to be modified Data label.
4. modification method as claimed in claim 3, which is characterized in that for any data of the data set to be modified, institute It states full between amendment confidence level, the confidence level of the label, the confidence level ranking of the label and the preset data label It is enough lower relationship:
S (k)=α * f (k)+(1- α) * g (k)+β * h (k)
In formula, α and β are constant, and k is any label, and S (k) is the amendment confidence level of label k, and f (k) is the confidence level of label k Weighted average, g (k) is the weighted average reciprocal of the confidence level ranking of label k;If the preset data label of the data For label k, there is h (k)=1;If the non-label k of the preset data label of data, there is h (k)=0.
5. modification method as described in claim 1, which is characterized in that described to construct new instruction based on the data set to be modified Lian Ji and new test set, including step:
By the consistent data of the new data label and the preset data label, as in new training set Data.
6. modification method as described in claim 1, which is characterized in that the data based on the data set to be modified, structure Build new training set and new test set, including step:
Using the current test set as former test set, using the current training set as former training set;
By data label that the belong to the former test set and new number consistent with the data label before the last update According to as training component one, other data of the original test set are as fractions tested one;
It is subordinated in the data of the former training set, the quantity data equal with the trained component one is obtained, as test group Divide two, other data of the original training set are as training component two;
The data for merging the trained component one and the trained component two, as new training set;
The data for merging the fractions tested one and the fractions tested two, as new test set.
7. modification method as claimed in claim 6, which is characterized in that in the data for being subordinated to the former training set, obtain The access amount data equal with the trained component one, as fractions tested two, other data of the original training set are as instruction Practice component two, including step:
It is subordinated in the data of the former training set, it is random to obtain the quantity data equal with the trained component one, as survey Component two is tried, other data of the original training set are as training component two.
8. a kind of correcting device of data label, which is characterized in that including insmoding, training module, matching module, update mould Block and correction module;
It is described to insmod, for being loaded into data set to be modified;Wherein, the data set to be modified includes training set and test The data of collection, the training set and the test set are labeled with preset data label;
The training module obtains Matching Model for being trained based on current training set to machine learning model;
The matching module is obtained for the data of current test set to be inputted current Matching Model as input data The matching result of the current Matching Model output, to update the data label of each input data;Wherein, described With each input data is directed in result, record has the confidence level of each label;
The update module, for being based on the positive number to be repaired when the quantity of the matching result obtained is not up to preset value According to the data of collection, new training set and new test set are constructed;
The correction module, for when the quantity of the matching result obtained reaches the preset value, in conjunction with it is described The matching result of acquisition and the preset data label calculate each institute to each data in the data set to be modified The amendment confidence level of label is stated, to correct the data label of each data in the data set to be modified.
9. a kind of correcting device of data label, including processor, memory and storage in the memory and are configured For the computer program executed by the processor, the processor realizes such as claim 1 when executing the computer program To the modification method of data label described in any one of 7.
10. a kind of computer readable storage medium, which is characterized in that the computer readable storage medium includes the calculating of storage Machine program, wherein equipment where controlling the computer readable storage medium in computer program operation is executed as weighed Benefit require any one of 1 to 7 described in data label modification method.
CN201810649534.5A 2018-06-22 2018-06-22 Data label correction method, device and storage medium Active CN108897829B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810649534.5A CN108897829B (en) 2018-06-22 2018-06-22 Data label correction method, device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810649534.5A CN108897829B (en) 2018-06-22 2018-06-22 Data label correction method, device and storage medium

Publications (2)

Publication Number Publication Date
CN108897829A true CN108897829A (en) 2018-11-27
CN108897829B CN108897829B (en) 2020-08-04

Family

ID=64345558

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810649534.5A Active CN108897829B (en) 2018-06-22 2018-06-22 Data label correction method, device and storage medium

Country Status (1)

Country Link
CN (1) CN108897829B (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109829490A (en) * 2019-01-22 2019-05-31 上海鹰瞳医疗科技有限公司 Modification vector searching method, objective classification method and equipment
CN109977255A (en) * 2019-02-22 2019-07-05 北京奇艺世纪科技有限公司 Model generating method, audio-frequency processing method, device, terminal and storage medium
CN109982137A (en) * 2019-02-22 2019-07-05 北京奇艺世纪科技有限公司 Model generating method, video marker method, apparatus, terminal and storage medium
CN110008372A (en) * 2019-02-22 2019-07-12 北京奇艺世纪科技有限公司 Model generating method, audio-frequency processing method, device, terminal and storage medium
CN110324726A (en) * 2019-05-29 2019-10-11 北京奇艺世纪科技有限公司 Model generation, method for processing video frequency, device, electronic equipment and storage medium
CN110717039A (en) * 2019-09-17 2020-01-21 平安科技(深圳)有限公司 Text classification method and device, electronic equipment and computer-readable storage medium
CN111160484A (en) * 2019-12-31 2020-05-15 腾讯科技(深圳)有限公司 Data processing method and device, computer readable storage medium and electronic equipment
CN111753174A (en) * 2020-06-23 2020-10-09 北京字节跳动网络技术有限公司 Data processing method and device and electronic equipment
CN112016613A (en) * 2020-08-26 2020-12-01 广州市百果园信息技术有限公司 Training method and device for video content classification model, computer equipment and medium
CN113342799A (en) * 2021-08-09 2021-09-03 明品云(北京)数据科技有限公司 Data correction method and system
CN113496232A (en) * 2020-03-18 2021-10-12 杭州海康威视数字技术股份有限公司 Label checking method and device

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100138712A1 (en) * 2008-12-01 2010-06-03 Changki Lee Apparatus and method for verifying training data using machine learning
CN106033425A (en) * 2015-03-11 2016-10-19 富士通株式会社 A data processing device and a data processing method
CN106951925A (en) * 2017-03-27 2017-07-14 成都小多科技有限公司 Data processing method, device, server and system
CN107368892A (en) * 2017-06-07 2017-11-21 无锡小天鹅股份有限公司 Model training method and device based on machine learning
CN108021931A (en) * 2017-11-20 2018-05-11 阿里巴巴集团控股有限公司 A kind of data sample label processing method and device
CN108062394A (en) * 2017-12-18 2018-05-22 北京中关村科金技术有限公司 The mask method and relevant apparatus of a kind of data set
CN108171335A (en) * 2017-12-06 2018-06-15 东软集团股份有限公司 Choosing method, device, storage medium and the electronic equipment of modeling data

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100138712A1 (en) * 2008-12-01 2010-06-03 Changki Lee Apparatus and method for verifying training data using machine learning
CN106033425A (en) * 2015-03-11 2016-10-19 富士通株式会社 A data processing device and a data processing method
CN106951925A (en) * 2017-03-27 2017-07-14 成都小多科技有限公司 Data processing method, device, server and system
CN107368892A (en) * 2017-06-07 2017-11-21 无锡小天鹅股份有限公司 Model training method and device based on machine learning
CN108021931A (en) * 2017-11-20 2018-05-11 阿里巴巴集团控股有限公司 A kind of data sample label processing method and device
CN108171335A (en) * 2017-12-06 2018-06-15 东软集团股份有限公司 Choosing method, device, storage medium and the electronic equipment of modeling data
CN108062394A (en) * 2017-12-18 2018-05-22 北京中关村科金技术有限公司 The mask method and relevant apparatus of a kind of data set

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109829490B (en) * 2019-01-22 2022-03-22 上海鹰瞳医疗科技有限公司 Correction vector searching method, target classification method and device
CN109829490A (en) * 2019-01-22 2019-05-31 上海鹰瞳医疗科技有限公司 Modification vector searching method, objective classification method and equipment
CN109977255A (en) * 2019-02-22 2019-07-05 北京奇艺世纪科技有限公司 Model generating method, audio-frequency processing method, device, terminal and storage medium
CN109982137A (en) * 2019-02-22 2019-07-05 北京奇艺世纪科技有限公司 Model generating method, video marker method, apparatus, terminal and storage medium
CN110008372A (en) * 2019-02-22 2019-07-12 北京奇艺世纪科技有限公司 Model generating method, audio-frequency processing method, device, terminal and storage medium
CN110324726A (en) * 2019-05-29 2019-10-11 北京奇艺世纪科技有限公司 Model generation, method for processing video frequency, device, electronic equipment and storage medium
CN110717039A (en) * 2019-09-17 2020-01-21 平安科技(深圳)有限公司 Text classification method and device, electronic equipment and computer-readable storage medium
CN110717039B (en) * 2019-09-17 2023-10-13 平安科技(深圳)有限公司 Text classification method and apparatus, electronic device, and computer-readable storage medium
CN111160484A (en) * 2019-12-31 2020-05-15 腾讯科技(深圳)有限公司 Data processing method and device, computer readable storage medium and electronic equipment
CN111160484B (en) * 2019-12-31 2023-08-29 腾讯科技(深圳)有限公司 Data processing method, data processing device, computer readable storage medium and electronic equipment
CN113496232A (en) * 2020-03-18 2021-10-12 杭州海康威视数字技术股份有限公司 Label checking method and device
CN113496232B (en) * 2020-03-18 2024-05-28 杭州海康威视数字技术股份有限公司 Label verification method and device
CN111753174A (en) * 2020-06-23 2020-10-09 北京字节跳动网络技术有限公司 Data processing method and device and electronic equipment
CN112016613A (en) * 2020-08-26 2020-12-01 广州市百果园信息技术有限公司 Training method and device for video content classification model, computer equipment and medium
CN113342799A (en) * 2021-08-09 2021-09-03 明品云(北京)数据科技有限公司 Data correction method and system

Also Published As

Publication number Publication date
CN108897829B (en) 2020-08-04

Similar Documents

Publication Publication Date Title
CN108897829A (en) Modification method, device and the storage medium of data label
CN110969600A (en) Product defect detection method and device, electronic equipment and storage medium
EP3649582A1 (en) System and method for automatic building of learning machines using learning machines
WO2020253038A1 (en) Model construction method and apparatus
CN108833592A (en) Cloud host schedules device optimization method, device, equipment and storage medium
US20100162185A1 (en) Electronic circuit design
CN113761026A (en) Feature selection method, device, equipment and storage medium based on conditional mutual information
CN112861934A (en) Image classification method and device of embedded terminal and embedded terminal
CN108829790A (en) A kind of data batch processing method, apparatus and system
CN103049629A (en) Method and device for detecting noise data
CN112434817B (en) Method, apparatus and computer storage medium for constructing communication algorithm database
CN116089504A (en) Relational form data generation method and system
CN113486583B (en) Method and device for evaluating health of equipment, computer equipment and computer readable storage medium
Mohammadi et al. Machine learning assisted stochastic unit commitment: A feasibility study
CN111950753A (en) Scenic spot passenger flow prediction method and device
CN112733453B (en) Equipment predictive maintenance method and device based on joint learning
CN112733454B (en) Equipment predictive maintenance method and device based on joint learning
CN115423159A (en) Photovoltaic power generation prediction method and device and terminal equipment
CN114417982A (en) Model training method, terminal device and computer readable storage medium
CN103955449A (en) Target sample positioning method and device
US20200285463A1 (en) Evaluation of developer organizations
CN104915352A (en) Method and device for verifying processed data accuracy under MapReduce environment
CN113569315B (en) Bridge cluster dynamic evaluation method, device, equipment and readable storage medium
CN115114806B (en) Autonomous evolution simulation method for autonomous traffic system architecture
CN116402313B (en) Product scheduling method, device, electronic equipment and computer readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant