CN108960409A - Labeled data generation method, equipment and computer readable storage medium - Google Patents
Labeled data generation method, equipment and computer readable storage medium Download PDFInfo
- Publication number
- CN108960409A CN108960409A CN201810609646.8A CN201810609646A CN108960409A CN 108960409 A CN108960409 A CN 108960409A CN 201810609646 A CN201810609646 A CN 201810609646A CN 108960409 A CN108960409 A CN 108960409A
- Authority
- CN
- China
- Prior art keywords
- data
- labeled
- collection
- data set
- labeled data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Computational Linguistics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Evolutionary Biology (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Image Analysis (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The present invention provides a kind of labeled data generation method, equipment and computer readable storage mediums.Labeled data generation method, comprising the following steps: S100: obtaining data complete or collected works and is contained in the labeled data collection being marked in data complete or collected works;S200: the data characteristics of analysis labeled data collection, and the dummy data set for meeting data characteristics is manufactured according to data characteristics;S300: dummy data set is expanded based on GAN neural network, forms EDS extended data set;S400: whether identification needs to mark to the data in EDS extended data set, screens the data being marked to form training dataset;S500: neural metwork training is carried out to training dataset and forms training pattern;S600: based on training pattern to the data cleansing outside labeled data collection in data complete or collected works, mark meets the data of training pattern and is included into labeled data collection, to, based on low volume data, it can quickly and efficiently generate high with sample data matching degree, and the training manifold that randomness is strong, to expand the data volume of labeled data.
Description
Technical field
The present invention relates to data model field more particularly to a kind of labeled data generation methods, equipment and computer-readable
Storage medium.
Background technique
With fast development of the application program on intelligent terminal, and it is based on application program, the artificial intelligence skill built
Art comes into people's lives more and more widely.Whether the fields such as routine use, game, work will need to based on original
The study of beginning sample data understands the use habit in the field, to intelligently judge.
Deep neural network technology can be used in study for raw sample data.The deep neural network technology is close
It quickly grows within several years, obtains the precision of remote super anticipation in field of image recognition, gratifying application is achieved in many fields.But
It is that in practical engineering applications, many special image recognition demands lack for trained data set, and deep neural network
Model accuracy greatly depend on the size and quality of data set.To solve the case where training data lacks, the prior art is usual
It is random cropping, rotation, stretching, reversion to be carried out to existing labeled data, but have the following deficiencies:
1, the corresponding raw image data length and width of certain models are smaller, and the data bulk that random cropping can extend is limited.
2, when raw sample data is less, the data obtained by these methods are since feature not enough disperses to be easy to make mould
Type over-fitting.
3, certain models stretch data sensitive, and discrimination reduces more apparent after stretching;
4, artificial simultaneously labeled data of collecting can expend a large amount of manpowers, energy.
Therefore, it is necessary to a kind of novel labeled data generation method, can marked sample data it is less in the case where, fastly
Speed and in large quantities the generation stronger trained manifold of randomness simplify the collection and mark work of training data.
Summary of the invention
In order to overcome the above technical defects, the purpose of the present invention is to provide a kind of labeled data generation method, equipment and
Computer readable storage medium is quickly and efficiently generated high with sample data matching degree and random based on low volume data
The strong training manifold of property, to expand the data volume of labeled data.
The invention discloses a kind of labeled data generation methods, comprising the following steps:
S100: obtaining a data complete or collected works and is contained in the labeled data collection being marked in the data complete or collected works;
S200: the data characteristics of the labeled data collection is analyzed, and the data are met according to data characteristics manufacture
The dummy data set of feature;
S300: expanding the dummy data set based on GAN neural network, forms an EDS extended data set;
S400: whether identification needs to mark to the data in the EDS extended data set, screens the data being marked to be formed
One training dataset;
S500: neural metwork training is carried out to the training dataset and forms a training pattern;
S600: based on the training pattern to the data cleansing outside the labeled data collection in the data complete or collected works, mark
Note meets the data of the training pattern and is included into the labeled data collection.
Preferably, the labeled data generation method is further comprising the steps of:
S700: judge whether the data volume in the labeled data collection is greater than or equal to an expected data amount;
S800: when the data volume in the labeled data collection is less than the expected data amount, the training dataset is taken
And the union of the labeled data collection, and step S500-S600 is executed again.
Preferably, the step S800 replacement are as follows:
S800 ':, will be in the dummy data set when the data volume in the labeled data collection is less than the expected data amount
Data replace with the data in the labeled data collection, and again execute step S300-S600.
Preferably, the labeled data generation method is further comprising the steps of:
S900: based on the labeled data collection formed in step S600 and/or training dataset training except described
Other data sets outside data complete or collected works.
It is preferably based on GAN neural network to expand the dummy data set, forms the step S300 of an EDS extended data set
Include:
S310: building one generates model and discrimination model;
S320: the discrimination model is configured to be greater than the differentiation probability value output of the data in the dummy data set
0.5, based on differentiation of the differentiation probability value deep learning to the data in the non-dummy data set to the data in dummy data set
The output of probability value;
S330: the model that generates is generated based on the data in the dummy data set to EDS extended data set;
S340: the generation model is input to the discrimination model by the dummy data set and to EDS extended data set;
S350: data of the differentiation probability value greater than 0.5 of the discrimination model output are collected to form the expanding data
Collection.
Preferably, whether identification needs to mark to the data in the EDS extended data set, screens the data being marked with shape
Include: at the step S400 of a training dataset
S410: the data in the EDS extended data set are verified according to the labeled data collection and the data characteristics;
S420: extracting to verification result is the data for identifying mark, and is not by expanding data concentration verification result
The data of identification mark are deleted.
Preferably, the step of the data in the EDS extended data set is verified according to the labeled data collection and the data characteristics
Suddenly S410 includes:
S411: the data concentrated using the labeled data verify the data that the expanding data is concentrated as model;
S412: it is more than the half grade in the model or all consistent for result to the data verification, then determine institute
Verification result is stated as identification mark.
Preferably, the data characteristics includes: the background of the data, unit number, the data of the data
Digital gap, the target of the data, the data one of noise or a variety of.
The invention also discloses a kind of labeled data generating device, the labeled data generating device includes memory, place
The computer program managing device and storage on a memory and can running on a processor, the processor execute the computer journey
Labeled data generation method as described above is realized when sequence.
The present invention discloses a kind of computer readable storage medium again, is stored thereon with computer program, the computer
Labeled data generation method as described above is realized when program is executed by processor.
After above-mentioned technical proposal, compared with prior art, have the advantages that
1. even if the data set comprising a large amount of labeled data can also be quickly generated the amount of sample data is less;
2. the case where data randomness is strong, is not susceptible to over-fitting improves the quality of labeled data collection;
3. identified and marked other data by the model that dummy data set generates, expand the big of labeled data collection
Small and richness, the process can recycle forward iteration, accelerate the training speed and precision of deep neural network model.
Detailed description of the invention
Fig. 1 is to meet the flow diagram that data creation method is marked in one embodiment of the present invention;
Fig. 2 is the data for meeting dummy data set in one embodiment of the present invention;
Fig. 3 is to meet the flow diagram that data creation method is marked in a further preferred embodiments of the invention;
Fig. 4 is to meet the flow diagram that data creation method is marked in another further preferred embodiments of the invention;
Fig. 5 is the flow diagram for meeting the step S300 that data creation method is marked in one embodiment of the present invention;
Fig. 6 is the flow diagram for meeting the step S400 that data creation method is marked in one embodiment of the present invention.
Specific embodiment
Below in conjunction with attached drawing, the advantages of the present invention are further explained with specific embodiment.
Example embodiments are described in detail here, and the example is illustrated in the accompanying drawings.Following description is related to
When attached drawing, unless otherwise indicated, the same numbers in different drawings indicate the same or similar elements.Following exemplary embodiment
Described in embodiment do not represent all implementations consistent with this disclosure.On the contrary, they be only with it is such as appended
The example of the consistent device and method of some aspects be described in detail in claims, the disclosure.
It is only to be not intended to be limiting the disclosure merely for for the purpose of describing particular embodiments in the term that the disclosure uses.
The "an" of the singular used in disclosure and the accompanying claims book, " described " and "the" are also intended to including majority
Form, unless the context clearly indicates other meaning.It is also understood that term "and/or" used herein refers to and wraps
It may be combined containing one or more associated any or all of project listed.
It will be appreciated that though various information, but this may be described using term first, second, third, etc. in the disclosure
A little information should not necessarily be limited by these terms.These terms are only used to for same type of information being distinguished from each other out.For example, not departing from
In the case where disclosure range, the first information can also be referred to as the second information, and similarly, the second information can also be referred to as
One information.Depending on context, word as used in this " if " can be construed to " ... when " or " when ...
When " or " in response to determination "
In the description of the present invention, it is to be understood that, term " longitudinal direction ", " transverse direction ", "upper", "lower", "front", "rear",
The orientation or positional relationship of the instructions such as "left", "right", "vertical", "horizontal", "top", "bottom" "inner", "outside" is based on attached drawing institute
The orientation or positional relationship shown, is merely for convenience of description of the present invention and simplification of the description, rather than the dress of indication or suggestion meaning
It sets or element must have a particular orientation, be constructed and operated in a specific orientation, therefore should not be understood as to limit of the invention
System.
In the description of the present invention, unless otherwise specified and limited, it should be noted that term " installation ", " connected ",
" connection " shall be understood in a broad sense, for example, it may be mechanical connection or electrical connection, the connection being also possible to inside two elements can
, can also indirectly connected through an intermediary, for the ordinary skill in the art to be to be connected directly, it can basis
Concrete condition understands the concrete meaning of above-mentioned term.
In subsequent description, it is only using the suffix for indicating such as " module ", " component " or " unit " of element
Be conducive to explanation of the invention, itself there is no specific meanings.Therefore, " module " can mixedly make with " component "
With.
Refering to fig. 1, the flow diagram to meet mark data creation method in one embodiment of the present invention.In the mark
It infuses in data creation method, to have to phenomenon, financial circles service, social presence in have in an application program, life etc.
The data suggestion mode of mass data is simultaneously analyzed, and by according to the data model collected, marks whether other data meet
The rule of the data model of collection, and further, generation data can be extended to from data are differentiated, to realize in artificial intelligence
The purpose of look-ahead, data processing.It is that a data are complete for the data comprising having marked, not marked all in a certain field
Collect U.Sample data, that is, the data being marked, for the acquired number for being determined as annotation results of user in data complete or collected works U
According to these data form a labeled data collection A.In the embodiment, identification will be established based on data complete or collected works U and labeled data collection A
Model, and the data of data complete or collected works U not marked are accurately marked, specifically, comprising the following steps:
S100: obtaining a data complete or collected works U and is contained in the labeled data collection A being marked in data complete or collected works U;
For the data complete or collected works U comprising all data, it can be all data for including in a certain application program, in full
It is formed by data set after the data acquisition systems such as word, letter, character, is also possible to if purchaser, user are when liveness, enlivening
Between, on-line time, enliven the data acquisition systems such as duration after be formed by data set.In certain embodiments, excessively due to data volume
Data huge, that data complete or collected works U is included are essentially identical to the number that all numbers, letter, character, picture etc. may be combined to form
According to collection.Data volume regardless of data complete or collected works U inside includes part by user annotation, that is, determines that authenticity is genuine
Data, for labeled data as sample data, gathering formation is the labeled data collection A for being really contained in data complete or collected works U for these.It is right
In the acquisition of above-mentioned data complete or collected works U and labeled data collection A, by the basic operation as subsequent step.
S200: the data characteristics of analysis labeled data collection A, and the pseudo- data for meeting data characteristics are manufactured according to data characteristics
Collect F;
After labeled data collection A will be analyzed.Specifically, since the current possessed and data validity of user is true
Data in fixed only labeled data collection A, therefore will be to expand basis with labeled data collection A, analysis labeled data is needed first
The data characteristics of collection, in including but not limited to labeled data collection A between the background of data, the unit number of data, the number of data
One of noise of gap, the target of data, data is a variety of.As shown in Fig. 2, in an embodiment, be in an application program
Data in the labeled data collection A of acquisition, for digital " 2589 ", then can analyze the data characteristics that the data include includes:
The gray scale of background picture, brightness containing number 2,5,8,9, and when indicating these numbers, fonts of these numbers, number key
Gap etc. data, after analyzing one by one, each data in labeled data collection A will all have the data characteristics for corresponding to itself.
Based on the resulting data characteristics of above-mentioned analysis, after manufacture met into these data characteristicses, but be different from mark
Other data of data in data set A.When manufacture, need to meet used number is had by data in labeled data collection A
Number (any one or more in such as 0-9), interdigital gap meets number possessed by data in labeled data collection A
Gap (can preferably be not completely equivalent, established with ± the error of 10%-15%) etc..As shown in Fig. 2, being based on labeled data collection
Interior " 2589 " data can produce such as " 25 ", " 52 ", " 89 " pseudo- data.Dummy data set is due to meet in labeled data
The data characteristicses of data manufactures, and therefore, can also reconfigure for each unit of data in labeled data, can root when combination
According to required pseudo- data volume, manufacturing condition is arbitrarily set.It is understood that if desired a large amount of pseudo- data, can only extract symbol
The data characteristics of the number of labeled data is closed, and other data characteristicses are arbitrarily chosen to form the pseudo- data of big data quantity, it is on the contrary
?.
As possessed by dummy data set F puppet data are to reconfigure to each unit of data in labeled data,
There are pseudo- data can be the data for being not equal to any data in labeled data in dummy data set F.For example, in labeled data collection
The different data extracted in A includes number 1 and 6, but number 1 and number 6 are not individually composed to form data 16 or 61.But
In dummy data set F, can include the numbers 1 such as 16,61,116,661 be optionally combined with digital 6.Therefore, when labeled data is concentrated
Data when there are the digital units of 0-9, dummy data set can manufacture any real number based on this according to data characteristics, thus pole
The earth expands the data volume of dummy data set F.
S300: dummy data set F is expanded based on GAN neural network, forms an EDS extended data set G;
Since dummy data set F is when forming, raw sample data is still conformed to, that is, meets the data characteristics of labeled data,
It therefore, is the lateral extension of labeled data, but such as its background of the interior data based on same numbers of application program, intercharacter
When the display of gap, font on different devices changes, the data in dummy data set F possibly can not be completely covered.Therefore, exist
On the basis of dummy data set F, dummy data set is expanded based on GAN neural network, to guarantee the randomness and multiplicity of data
Property, the data acquisition system after expansion forms an EDS extended data set G.Expansion for data can increase former data characteristics, and increase
On the basis of data characteristics after adding, number is reconfigured.
False sample set is formed by the expansion to dummy data set F using the principle of GAN neural network, while being marked
On the basis of this genuine data set of data set A, fought with a module with discrimination function.
The module with discrimination function solves two classification problems for having supervision, for judging that input data is mark number
According to collection A (dummy data set F) or EDS extended data set G.In the training process, the expansion that labeled data collection A and generation network generate
Data set G, into the discrimination model, discriminates whether to be true using the discrimination model by stochastic inputs.Pass through the engineering of competitive mode
Habit mechanism makes to generate network and differentiates that the performance of network is constantly promoted, and when two model parameters are stablized, training is completed.
After training, new expanding data can be generated according to generation network new at present, the new expansion in the part
Data are more close to the data in dummy data set F and labeled data collection A.
S400: whether identification needs to mark to the data in EDS extended data set G, screens the data being marked to form an instruction
Practice data set T;
For the fitness for further ensuring that the expanding data in EDS extended data set G and the data in labeled data collection A, will borrow
The training pattern obtained after the training of GAN neural network depth is helped to carry out identification mark to expanding data, to clean to expanding data.
Since cleaning action is operated for EDS extended data set G, obtain the portion that the data in training dataset T are EDS extended data set G
Divided data, i.e. training dataset T are really contained in EDS extended data set G.
S500: neural metwork training is carried out to training dataset T and forms a training pattern;
Data in training dataset T after the completion of cleaning have certain cleannes and with the number in labeled data collection A
According to degree of fitting, and training dataset T in have in addition to all data in labeled data collection A, mark is met in expanding data
The data of the data characteristics of data set A, data volume are greater than the data volume of labeled data collection A.And it is understood that expand number
According to the increase of the data volume of collection G and training dataset T, in step S300-S400, the expanding data that manufactures and obtained after screening
It also can be more.It is more rich to be formed by training pattern when carrying out neural metwork training to training dataset T for more data volumes
It is full, on the one hand, the training pattern meets the data characteristics of the data in labeled data collection A, on the other hand, enriches due to mark
The characteristics of shortage of data of data set A, has expanded the data of labeled data collection A under the conditions of different background, font, spacing etc.
Deformation, be the training pattern for being better than labeled data collection A again based on labeled data collection A.
S600: based on training pattern to the data cleansing in data complete or collected works U in addition to labeled data collection A, mark meets training
The data of model are simultaneously included into labeled data collection A.
After training pattern, the standard differentiated when by according in training pattern to data mark, such as whether meeting data
Number in feature, digital gap, digital background, digital object etc., also or the variant etc. pair of the variant of number, digital background
Other data (data in addition to the data in labeled data collection A) in data complete or collected works U are analyzed and are differentiated, according to differentiation
As a result the data capable of washing for not meeting training pattern, and meet the data of training pattern, then will retain and be included into labeled data
Collect in A, to be picked out in a manner of high accuracy in the range of data complete or collected works U and meet original sample, i.e. labeled data collection
Other data of data in A, to expand the data volume of original sample.And the it is understood that number based on original sample
According to being gradually expanded for amount, it can be used for trained labeled data amount and also increase, after successive ignition, the magnitude and instruction of labeled data
Practice model maturity will exponentially other growth, it is subsequent to rely on big data when differentiating to same class data, and
Differentiate accurate, efficient.
In different embodiments, resulting new labeled data collection A ' after being executed according to step S600, if its data volume is not
Foot in different embodiments, will execute different steps when successive ignition being needed to accumulate more data volumes.
Embodiment one
Refering to Fig. 3, in a preferred embodiment, labeled data generation method is further comprising the steps of:
S700: judge whether the data volume in labeled data collection A ' is greater than or equal to an expected data amount;
It is whether full firstly the need of the quality of the data volume and training pattern that judge currently obtained labeled data collection A '
Sufficient expected data amount or desired quality, that is to say, that when finally the data in data complete or collected works U being analyzed and differentiated, the instruction
Practice whether model can accurately be labeled data, under the mode of verifying can assume that training pattern is accurate using experimental judgment
How is the fitness of labeled data and labeled data.
S800: when the data volume in labeled data collection A is less than expected data amount, training dataset T and labeled data are taken
Collect the union of A, and executes step S500-S600 again.
When assert that the data volume in labeled data collection A is less than expected data amount, then need to increase in labeled data collection A
Data volume.Therefore, in the case where not re-starting the quality based on GAN neural network raising EDS extended data set G, it is only necessary to mention
The data volume of the training dataset T obtained in high step S400.Therefore, training dataset T and labeled data collection A are taken into union,
Obtain new training dataset T ', i.e. T '=T ∪ A.Then based on new training dataset T ', iteration returns step S500, again
Neural metwork training is carried out to training dataset and forms a new training pattern, then executes step S600 again, obtains iteration
New labeled data collection A " afterwards.If being still unable to satisfy the condition of model training, iterates and execute step S500- repeatedly
S600, until finally obtained new labeled data collection A " meets the condition of model training.
Embodiment two
In another embodiment, labeled data generation method is further comprising the steps of:
S700: judge whether the data volume in labeled data collection A ' is greater than or equal to an expected data amount;
It is whether full firstly the need of the quality of the data volume and training pattern that judge currently obtained labeled data collection A '
Sufficient expected data amount or desired quality, that is to say, that when finally the data in data complete or collected works U being analyzed and differentiated, the instruction
Practice whether model can accurately be labeled data, under the mode of verifying can assume that training pattern is accurate using experimental judgment
How is the fitness of labeled data and labeled data.
S800 ': when the data volume in labeled data collection A is less than expected data amount, the data in dummy data set F are replaced
For the data in labeled data collection A, and step S300-S600 is executed again
Different from, to the trust of GAN neural metwork training result, in this embodiment, assert and needing again in embodiment one
The confrontation experiment for carrying out GAN neural network, therefore, after step S600 is finished, it is believed that the data in labeled data collection A '
When amount is less than expected data amount, the data in dummy data set F are replaced with into the data in labeled data collection A, so that GAN nerve net
The original sample of network is reduced into the data in labeled data collection A, although data volume reduces, due to data in labeled data collection A
For labeled data, labeled data collection A, matter are then also directly relied on by EDS extended data set G resulting after GAN neural network
Amount is further enhanced.
After data in dummy data set F to be replaced with to the data in labeled data collection A, then successively execute abovementioned steps
S300-S600, the new labeled data collection A " after obtaining iteration.If new labeled data collection A " is still unable to satisfy model training
Condition, then still elect to the iteration as needed in embodiment one or embodiment two and return S500 or S300 to re-execute.And due to
The process of iteration may more than once, therefore, and embodiment one and the iterative manner of embodiment two can be freely combined, according to user couple
Depending on the degree of recognition of the quality of the quality and training dataset T of dummy data set F.
Refering to Fig. 4, in this embodiment, S600, S800 or S800 are executed the step ' after, further includes:
S900: based on the labeled data collection A formed in step S600 and/or training dataset T training in addition to data complete or collected works
Other data sets
In this step, according to user's needs, if data complete or collected works U when choosing, whole shapes of not a certain application program
Data under state or when only having chosen the data in certain time period, in addition to data complete or collected works U, further comprise other data sets, then
In this case, it using the labeled data collection A and/or training dataset T of data volume of certain scale, also or trains
Model is trained and marks to other data sets.It is understood that since training dataset T is from labeled data collection A
It is derivative to be formed, therefore, the data in labeled data collection A can be preferentially selected, then participate in subsequent instruction according to the quality of data is optional
Practice.
Completion specifically is executed by following steps when expanding based on GAN neural network dummy data set F refering to Fig. 5:
S310: building one generates model and discrimination model;
As described above, the module with systematic function has all one's life at network, has and differentiate to generate model
The module of function is discrimination model, has one to differentiate network.Generate the original of model and discrimination model based on GAN neural network
It manages and carries out dual training, to improve the data generation precision and the data differentiation accuracy of discrimination model that generate model simultaneously.
S320: being configured to discrimination model to be greater than 0.5 to the output of the differentiation probability values of the data in dummy data set, based on pair
Output of the differentiation probability value deep learning of data in dummy data set to the differentiation probability value of the data in non-dummy data set;
In initial training, since what input differentiated differentiates that object is the data in dummy data set F, and in dummy data set F
Data be that user is the diversity for expanding data and the data that voluntarily manufacture.Differentiate that network judges input, is judged as
Very (dummy data set F) or false (generating the expanding data that network generates), which makes comparisons with truth, adversely affects
Model parameter, the continuous iteration process make to differentiate the difference of e-learning to true and false data.By sentencing to labeled data
It is disconnected, train discrimination model to the blank of data, to, based on the true and false of preamble data, test when input user is also unknown true and false
Demonstrate,prove a possibility that these data are true and false, that is to say, that based on the differentiation probability value deep learning to the data in dummy data set F,
It realizes to the data in non-dummy data set F, that is, the output for differentiating probability value of the data expanded, to the same of data extending
When, differentiate the true and false of data.
S330: it generates model and is generated based on the data in dummy data set F to EDS extended data set;
Model side is being generated, based on the data in dummy data set F, to number, spacing, font, the combining form in data
It Deng deformation, and reconfigures again, to generate to EDS extended data set.
S340: model is generated by dummy data set and to EDS extended data set and is input to discrimination model;
With the formation to EDS extended data set, labeled data integrates A and has gradually expanded as dummy data set F, to EDS extended data set,
It is input to discrimination model by dummy data set F and to EDS extended data set afterwards, output data after being differentiated by discrimination model to each data
True and false probability value.
S350: data of the differentiation probability value greater than 0.5 of discrimination model output are collected to form EDS extended data set.
Since discrimination model trains discrimination standard based on the data in dummy data set F, collect discrimination model output
Differentiate probability value greater than 0.5 data (data in i.e. whole dummy data set F and meet discrimination model discrimination standard wait expand
Partial data in data set), taking after union is just EDS extended data set G.
Resulting EDS extended data set G through the above steps, both ensure that it is approximate with the characteristic of data in labeled data collection A,
It is the random expansion and diversification processing to the data of dummy data set F again simultaneously, i.e., expands the data of labeled data collection A in an orderly manner
Amount.
Refering to Fig. 6, in this embodiment, step S400 specifically executes completion by following steps:
S410: the data in EDS extended data set G are verified according to labeled data collection A data characteristics;
This existing model by the data characteristics of labeled data collection A, to the data in newly-generated EDS extended data set G into
Row identification mark, verifies whether the data in EDS extended data set G should be performed subsequent identification operation.
Ballot proof method can be used in the mode of verifying, i.e., based on the principle that the minority is subordinate to the majority or adopts unanimously, carries out two
Verification result of data characteristics of multiple arbiters based on labeled data collection A with accounting for more or whole accountings is respected in secondary identification.
That is, including: according to the step S410 that the data characteristics of labeled data collection A verifies the data in EDS extended data set G
S411: the data in A are integrated as the data in model verifying EDS extended data set G using labeled data;S412: when the half grade in model
Above or all data are verified as with result is consistent, then decision verification result is identification mark.
S420: extracting to verification result is the data for identifying mark, and is not identification by expanding data concentration verification result
The data of mark are deleted.
Verification result includes identification mark and unidentified mark, belongs to the data of identification mark, that is, meets labeled data collection A
Data data characteristics, be retained in EDS extended data set G, and verification result is unidentified mark, will make delete processing, with
Carry out data cleansing.
With the labeled data generation method in any of the above-described embodiment, can be applied in a labeled data generating device,
Such as intelligent terminal, server, work station, virtual server, virtual workstation.Labeled data generating device include memory,
Processor and storage on a memory and the computer program that can run on a processor, processor execution computer program
When, computer program realizes above-mentioned labeled data generation method according to preconfigured original language.
Above-mentioned labeled data generation method may alternatively be integrated in a computer readable storage medium, be stored thereon with computer
Program realizes sound control method as described above when computer program is executed by processor, the computer readable storage medium
It can be indicated in the form of software, software, virtual file etc..
Intelligent terminal can be implemented in a variety of manners.For example, terminal described in the present invention may include such as moving
Phone, smart phone, laptop, PDA (personal digital assistant), PAD (tablet computer), PMP (put by portable multimedia broadcasting
Device), the fixed terminal of the intelligent terminal of navigation device etc. and such as number TV, desktop computer etc..Hereinafter it is assumed that eventually
End is intelligent terminal.However, it will be understood by those skilled in the art that other than the element for being used in particular for mobile purpose, root
It can also apply to the terminal of fixed type according to the construction of embodiments of the present invention.
It should be noted that the embodiment of the present invention has preferable implementation, and not the present invention is made any type of
Limitation, any one skilled in the art change or are modified to possibly also with the technology contents of the disclosure above equivalent effective
Embodiment, as long as without departing from the content of technical solution of the present invention, it is to the above embodiments according to the technical essence of the invention
Any modification or equivalent variations and modification, all of which are still within the scope of the technical scheme of the invention.
Claims (10)
1. a kind of labeled data generation method, which comprises the following steps:
S100: obtaining a data complete or collected works and is contained in the labeled data collection being marked in the data complete or collected works;
S200: the data characteristics of the labeled data collection is analyzed, and the data characteristics is met according to data characteristics manufacture
Dummy data set;
S300: expanding the dummy data set based on GAN neural network, forms an EDS extended data set;
S400: whether identification needs to mark to the data in the EDS extended data set, screens the data being marked to form an instruction
Practice data set;
S500: neural metwork training is carried out to the training dataset and forms a training pattern;
S600: based on the training pattern to the data cleansing in the data complete or collected works in addition to the labeled data collection, mark symbol
It closes the data of the training pattern and is included into the labeled data collection.
2. labeled data generation method as described in claim 1, which is characterized in that
The labeled data generation method is further comprising the steps of:
S700: judge whether the data volume in the labeled data collection is greater than or equal to an expected data amount;
S800: when the data volume in the labeled data collection is less than the expected data amount, the training dataset and institute are taken
The union of labeled data collection is stated, and executes step S500-S600 again.
3. labeled data generation method as claimed in claim 2, which is characterized in that
The step S800 replacement are as follows:
S800 ': when the data volume in the labeled data collection is less than the expected data amount, by the number in the dummy data set
According to the data replaced in the labeled data collection, and step S300-S600 is executed again.
4. labeled data generation method as described in claim 1, which is characterized in that
The labeled data generation method is further comprising the steps of:
S900: the data are removed based on the labeled data collection formed in step S600 and/or training dataset training
Other data sets outside complete or collected works.
5. labeled data generation method as described in claim 1, which is characterized in that
The dummy data set is expanded based on GAN neural network, the step S300 for forming an EDS extended data set includes:
S310: building one generates model and discrimination model;
S320: the discrimination model is configured to be greater than 0.5 to the differentiation probability value output of the data in the dummy data set, base
In the differentiation probability value deep learning to the data in dummy data set to the differentiation probability value of the data in the non-dummy data set
Output;
S330: the model that generates is generated based on the data in the dummy data set to EDS extended data set;
S340: the generation model is input to the discrimination model by the dummy data set and to EDS extended data set;
S350: data of the differentiation probability value greater than 0.5 of the discrimination model output are collected to form the EDS extended data set.
6. labeled data generation method as described in claim 1, which is characterized in that
Whether identification needs to mark to the data in the EDS extended data set, screens the data being marked to form a training data
The step S400 of collection includes:
S410: the data in the EDS extended data set are verified according to the labeled data collection and the data characteristics;
S420: extracting to verification result is the data for identifying mark, and is not identification by expanding data concentration verification result
The data of mark are deleted.
7. labeled data generation method as claimed in claim 6, which is characterized in that
The step S410 that the data in the EDS extended data set are verified according to the labeled data collection and the data characteristics includes:
S411: the data concentrated using the labeled data verify the data that the expanding data is concentrated as model;
S412: it is more than the half grade in the model or all consistent for result to the data verification, then it is tested described in judgement
Demonstrate,proving result is identification mark.
8. labeled data generation method as described in claim 1, which is characterized in that
The data characteristics includes: the background of the data, the unit number of the data, the digital gap of the data, institute
State one of the target of data, noise of the data or a variety of.
9. a kind of labeled data generating device, the labeled data generating device includes memory, processor and is stored in storage
On device and the computer program that can run on a processor, which is characterized in that when the processor executes the computer program
Realize such as the described in any item labeled data generation methods of claim 1-8.
10. a kind of computer readable storage medium, is stored thereon with computer program, which is characterized in that the computer program
Such as claim 1-8 described in any item labeled data generation methods are realized when being executed by processor.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810609646.8A CN108960409B (en) | 2018-06-13 | 2018-06-13 | Method and device for generating annotation data and computer-readable storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810609646.8A CN108960409B (en) | 2018-06-13 | 2018-06-13 | Method and device for generating annotation data and computer-readable storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108960409A true CN108960409A (en) | 2018-12-07 |
CN108960409B CN108960409B (en) | 2021-08-03 |
Family
ID=64488602
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810609646.8A Active CN108960409B (en) | 2018-06-13 | 2018-06-13 | Method and device for generating annotation data and computer-readable storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108960409B (en) |
Cited By (18)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109816019A (en) * | 2019-01-25 | 2019-05-28 | 上海小萌科技有限公司 | A kind of image data automation auxiliary mask method |
CN109978029A (en) * | 2019-03-13 | 2019-07-05 | 北京邮电大学 | A kind of invalid image pattern screening technique based on convolutional neural networks |
CN110189351A (en) * | 2019-04-16 | 2019-08-30 | 浙江大学城市学院 | A kind of scratch image data amplification method based on production confrontation network |
CN110569379A (en) * | 2019-08-05 | 2019-12-13 | 广州市巴图鲁信息科技有限公司 | Method for manufacturing picture data set of automobile parts |
CN110874484A (en) * | 2019-10-16 | 2020-03-10 | 众安信息技术服务有限公司 | Data processing method and system based on neural network and federal learning |
CN111143617A (en) * | 2019-12-12 | 2020-05-12 | 浙江大学 | Automatic generation method and system for picture or video text description |
CN111177132A (en) * | 2019-12-20 | 2020-05-19 | 中国平安人寿保险股份有限公司 | Label cleaning method, device, equipment and storage medium for relational data |
CN111382785A (en) * | 2020-03-04 | 2020-07-07 | 武汉精立电子技术有限公司 | GAN network model and method for realizing automatic cleaning and auxiliary marking of sample |
CN111476324A (en) * | 2020-06-28 | 2020-07-31 | 平安国际智慧城市科技股份有限公司 | Traffic data labeling method, device, equipment and medium based on artificial intelligence |
CN111741018A (en) * | 2020-07-24 | 2020-10-02 | 中国航空油料集团有限公司 | Industrial control data attack sample generation method and system, electronic device and storage medium |
CN112257731A (en) * | 2019-07-05 | 2021-01-22 | 杭州海康威视数字技术股份有限公司 | Virtual data set generation method and device |
CN112308167A (en) * | 2020-11-09 | 2021-02-02 | 上海风秩科技有限公司 | Data generation method and device, storage medium and electronic equipment |
CN112508000A (en) * | 2020-11-26 | 2021-03-16 | 上海展湾信息科技有限公司 | Method and equipment for generating OCR image recognition model training data |
CN112580310A (en) * | 2020-12-28 | 2021-03-30 | 河北省讯飞人工智能研究院 | Missing character/word completion method and electronic equipment |
WO2021084471A1 (en) * | 2019-10-31 | 2021-05-06 | International Business Machines Corporation | Artificial intelligence transparency |
CN113239205A (en) * | 2021-06-10 | 2021-08-10 | 阳光保险集团股份有限公司 | Data annotation method and device, electronic equipment and computer readable storage medium |
CN114926709A (en) * | 2022-05-26 | 2022-08-19 | 成都极米科技股份有限公司 | Data labeling method and device and electronic equipment |
CN116451087A (en) * | 2022-12-20 | 2023-07-18 | 石家庄七彩联创光电科技有限公司 | Character matching method, device, terminal and storage medium |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107392125A (en) * | 2017-07-11 | 2017-11-24 | 中国科学院上海高等研究院 | Training method/system, computer-readable recording medium and the terminal of model of mind |
CN107622056A (en) * | 2016-07-13 | 2018-01-23 | 百度在线网络技术(北京)有限公司 | The generation method and device of training sample |
CN107644235A (en) * | 2017-10-24 | 2018-01-30 | 广西师范大学 | Image automatic annotation method based on semi-supervised learning |
CN108009589A (en) * | 2017-12-12 | 2018-05-08 | 腾讯科技(深圳)有限公司 | Sample data processing method, device and computer-readable recording medium |
-
2018
- 2018-06-13 CN CN201810609646.8A patent/CN108960409B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107622056A (en) * | 2016-07-13 | 2018-01-23 | 百度在线网络技术(北京)有限公司 | The generation method and device of training sample |
CN107392125A (en) * | 2017-07-11 | 2017-11-24 | 中国科学院上海高等研究院 | Training method/system, computer-readable recording medium and the terminal of model of mind |
CN107644235A (en) * | 2017-10-24 | 2018-01-30 | 广西师范大学 | Image automatic annotation method based on semi-supervised learning |
CN108009589A (en) * | 2017-12-12 | 2018-05-08 | 腾讯科技(深圳)有限公司 | Sample data processing method, device and computer-readable recording medium |
Non-Patent Citations (1)
Title |
---|
王坤峰: ""生成式对抗网络GAN 的研究进展与展望"", 《自动化学报》 * |
Cited By (28)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109816019A (en) * | 2019-01-25 | 2019-05-28 | 上海小萌科技有限公司 | A kind of image data automation auxiliary mask method |
CN109978029A (en) * | 2019-03-13 | 2019-07-05 | 北京邮电大学 | A kind of invalid image pattern screening technique based on convolutional neural networks |
CN109978029B (en) * | 2019-03-13 | 2021-02-09 | 北京邮电大学 | Invalid image sample screening method based on convolutional neural network |
CN110189351A (en) * | 2019-04-16 | 2019-08-30 | 浙江大学城市学院 | A kind of scratch image data amplification method based on production confrontation network |
CN112257731A (en) * | 2019-07-05 | 2021-01-22 | 杭州海康威视数字技术股份有限公司 | Virtual data set generation method and device |
CN110569379A (en) * | 2019-08-05 | 2019-12-13 | 广州市巴图鲁信息科技有限公司 | Method for manufacturing picture data set of automobile parts |
CN110874484A (en) * | 2019-10-16 | 2020-03-10 | 众安信息技术服务有限公司 | Data processing method and system based on neural network and federal learning |
JP7461699B2 (en) | 2019-10-31 | 2024-04-04 | インターナショナル・ビジネス・マシーンズ・コーポレーション | Artificial Intelligence Transparency |
US11651276B2 (en) | 2019-10-31 | 2023-05-16 | International Business Machines Corporation | Artificial intelligence transparency |
WO2021084471A1 (en) * | 2019-10-31 | 2021-05-06 | International Business Machines Corporation | Artificial intelligence transparency |
CN111143617A (en) * | 2019-12-12 | 2020-05-12 | 浙江大学 | Automatic generation method and system for picture or video text description |
CN111177132A (en) * | 2019-12-20 | 2020-05-19 | 中国平安人寿保险股份有限公司 | Label cleaning method, device, equipment and storage medium for relational data |
CN111382785A (en) * | 2020-03-04 | 2020-07-07 | 武汉精立电子技术有限公司 | GAN network model and method for realizing automatic cleaning and auxiliary marking of sample |
CN111382785B (en) * | 2020-03-04 | 2023-09-01 | 武汉精立电子技术有限公司 | GAN network model and method for realizing automatic cleaning and auxiliary marking of samples |
CN111476324B (en) * | 2020-06-28 | 2020-10-02 | 平安国际智慧城市科技股份有限公司 | Traffic data labeling method, device, equipment and medium based on artificial intelligence |
CN111476324A (en) * | 2020-06-28 | 2020-07-31 | 平安国际智慧城市科技股份有限公司 | Traffic data labeling method, device, equipment and medium based on artificial intelligence |
CN111741018B (en) * | 2020-07-24 | 2020-12-01 | 中国航空油料集团有限公司 | Industrial control data attack sample generation method and system, electronic device and storage medium |
CN111741018A (en) * | 2020-07-24 | 2020-10-02 | 中国航空油料集团有限公司 | Industrial control data attack sample generation method and system, electronic device and storage medium |
CN112308167A (en) * | 2020-11-09 | 2021-02-02 | 上海风秩科技有限公司 | Data generation method and device, storage medium and electronic equipment |
CN112508000A (en) * | 2020-11-26 | 2021-03-16 | 上海展湾信息科技有限公司 | Method and equipment for generating OCR image recognition model training data |
CN112508000B (en) * | 2020-11-26 | 2023-04-07 | 上海展湾信息科技有限公司 | Method and equipment for generating OCR image recognition model training data |
CN112580310A (en) * | 2020-12-28 | 2021-03-30 | 河北省讯飞人工智能研究院 | Missing character/word completion method and electronic equipment |
CN112580310B (en) * | 2020-12-28 | 2023-04-18 | 河北省讯飞人工智能研究院 | Missing character/word completion method and electronic equipment |
CN113239205B (en) * | 2021-06-10 | 2023-09-01 | 阳光保险集团股份有限公司 | Data labeling method, device, electronic equipment and computer readable storage medium |
CN113239205A (en) * | 2021-06-10 | 2021-08-10 | 阳光保险集团股份有限公司 | Data annotation method and device, electronic equipment and computer readable storage medium |
CN114926709A (en) * | 2022-05-26 | 2022-08-19 | 成都极米科技股份有限公司 | Data labeling method and device and electronic equipment |
CN116451087A (en) * | 2022-12-20 | 2023-07-18 | 石家庄七彩联创光电科技有限公司 | Character matching method, device, terminal and storage medium |
CN116451087B (en) * | 2022-12-20 | 2023-12-26 | 石家庄七彩联创光电科技有限公司 | Character matching method, device, terminal and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN108960409B (en) | 2021-08-03 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108960409A (en) | Labeled data generation method, equipment and computer readable storage medium | |
CN109145939B (en) | Semantic segmentation method for small-target sensitive dual-channel convolutional neural network | |
CN108074244B (en) | Safe city traffic flow statistical method integrating deep learning and background difference method | |
CN109948425A (en) | A kind of perception of structure is from paying attention to and online example polymerize matched pedestrian's searching method and device | |
CN112199608B (en) | Social media rumor detection method based on network information propagation graph modeling | |
CN107291688A (en) | Judgement document's similarity analysis method based on topic model | |
CN110532379B (en) | Electronic information recommendation method based on LSTM (least Square TM) user comment sentiment analysis | |
CN107945153A (en) | A kind of road surface crack detection method based on deep learning | |
CN108960499A (en) | A kind of Fashion trend predicting system merging vision and non-vision feature | |
CN102262642B (en) | Web image search engine and realizing method thereof | |
CN105893483A (en) | Construction method of general framework of big data mining process model | |
CN109086794B (en) | Driving behavior pattern recognition method based on T-LDA topic model | |
WO2022062419A1 (en) | Target re-identification method and system based on non-supervised pyramid similarity learning | |
CN111210111B (en) | Urban environment assessment method and system based on online learning and crowdsourcing data analysis | |
CN109002492A (en) | A kind of point prediction technique based on LightGBM | |
CN109063649A (en) | Pedestrian's recognition methods again of residual error network is aligned based on twin pedestrian | |
CN108764282A (en) | A kind of Class increment Activity recognition method and system | |
CN109413023A (en) | The training of machine recognition model and machine identification method, device, electronic equipment | |
CN106960017A (en) | E-book is classified and its training method, device and equipment | |
CN110737805B (en) | Method and device for processing graph model data and terminal equipment | |
CN108647800A (en) | A kind of online social network user missing attribute forecast method based on node insertion | |
CN112528934A (en) | Improved YOLOv3 traffic sign detection method based on multi-scale feature layer | |
CN109949174A (en) | A kind of isomery social network user entity anchor chain connects recognition methods | |
CN112733602B (en) | Relation-guided pedestrian attribute identification method | |
CN108229567A (en) | Driver identity recognition methods and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |