CN108960409A - Labeled data generation method, equipment and computer readable storage medium - Google Patents

Labeled data generation method, equipment and computer readable storage medium Download PDF

Info

Publication number
CN108960409A
CN108960409A CN201810609646.8A CN201810609646A CN108960409A CN 108960409 A CN108960409 A CN 108960409A CN 201810609646 A CN201810609646 A CN 201810609646A CN 108960409 A CN108960409 A CN 108960409A
Authority
CN
China
Prior art keywords
data
labeled
collection
data set
labeled data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810609646.8A
Other languages
Chinese (zh)
Other versions
CN108960409B (en
Inventor
郑斌
徐晖
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanchang Black Shark Technology Co Ltd
Original Assignee
Nanchang Black Shark Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanchang Black Shark Technology Co Ltd filed Critical Nanchang Black Shark Technology Co Ltd
Priority to CN201810609646.8A priority Critical patent/CN108960409B/en
Publication of CN108960409A publication Critical patent/CN108960409A/en
Application granted granted Critical
Publication of CN108960409B publication Critical patent/CN108960409B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computational Linguistics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The present invention provides a kind of labeled data generation method, equipment and computer readable storage mediums.Labeled data generation method, comprising the following steps: S100: obtaining data complete or collected works and is contained in the labeled data collection being marked in data complete or collected works;S200: the data characteristics of analysis labeled data collection, and the dummy data set for meeting data characteristics is manufactured according to data characteristics;S300: dummy data set is expanded based on GAN neural network, forms EDS extended data set;S400: whether identification needs to mark to the data in EDS extended data set, screens the data being marked to form training dataset;S500: neural metwork training is carried out to training dataset and forms training pattern;S600: based on training pattern to the data cleansing outside labeled data collection in data complete or collected works, mark meets the data of training pattern and is included into labeled data collection, to, based on low volume data, it can quickly and efficiently generate high with sample data matching degree, and the training manifold that randomness is strong, to expand the data volume of labeled data.

Description

Labeled data generation method, equipment and computer readable storage medium
Technical field
The present invention relates to data model field more particularly to a kind of labeled data generation methods, equipment and computer-readable Storage medium.
Background technique
With fast development of the application program on intelligent terminal, and it is based on application program, the artificial intelligence skill built Art comes into people's lives more and more widely.Whether the fields such as routine use, game, work will need to based on original The study of beginning sample data understands the use habit in the field, to intelligently judge.
Deep neural network technology can be used in study for raw sample data.The deep neural network technology is close It quickly grows within several years, obtains the precision of remote super anticipation in field of image recognition, gratifying application is achieved in many fields.But It is that in practical engineering applications, many special image recognition demands lack for trained data set, and deep neural network Model accuracy greatly depend on the size and quality of data set.To solve the case where training data lacks, the prior art is usual It is random cropping, rotation, stretching, reversion to be carried out to existing labeled data, but have the following deficiencies:
1, the corresponding raw image data length and width of certain models are smaller, and the data bulk that random cropping can extend is limited.
2, when raw sample data is less, the data obtained by these methods are since feature not enough disperses to be easy to make mould Type over-fitting.
3, certain models stretch data sensitive, and discrimination reduces more apparent after stretching;
4, artificial simultaneously labeled data of collecting can expend a large amount of manpowers, energy.
Therefore, it is necessary to a kind of novel labeled data generation method, can marked sample data it is less in the case where, fastly Speed and in large quantities the generation stronger trained manifold of randomness simplify the collection and mark work of training data.
Summary of the invention
In order to overcome the above technical defects, the purpose of the present invention is to provide a kind of labeled data generation method, equipment and Computer readable storage medium is quickly and efficiently generated high with sample data matching degree and random based on low volume data The strong training manifold of property, to expand the data volume of labeled data.
The invention discloses a kind of labeled data generation methods, comprising the following steps:
S100: obtaining a data complete or collected works and is contained in the labeled data collection being marked in the data complete or collected works;
S200: the data characteristics of the labeled data collection is analyzed, and the data are met according to data characteristics manufacture The dummy data set of feature;
S300: expanding the dummy data set based on GAN neural network, forms an EDS extended data set;
S400: whether identification needs to mark to the data in the EDS extended data set, screens the data being marked to be formed One training dataset;
S500: neural metwork training is carried out to the training dataset and forms a training pattern;
S600: based on the training pattern to the data cleansing outside the labeled data collection in the data complete or collected works, mark Note meets the data of the training pattern and is included into the labeled data collection.
Preferably, the labeled data generation method is further comprising the steps of:
S700: judge whether the data volume in the labeled data collection is greater than or equal to an expected data amount;
S800: when the data volume in the labeled data collection is less than the expected data amount, the training dataset is taken And the union of the labeled data collection, and step S500-S600 is executed again.
Preferably, the step S800 replacement are as follows:
S800 ':, will be in the dummy data set when the data volume in the labeled data collection is less than the expected data amount Data replace with the data in the labeled data collection, and again execute step S300-S600.
Preferably, the labeled data generation method is further comprising the steps of:
S900: based on the labeled data collection formed in step S600 and/or training dataset training except described Other data sets outside data complete or collected works.
It is preferably based on GAN neural network to expand the dummy data set, forms the step S300 of an EDS extended data set Include:
S310: building one generates model and discrimination model;
S320: the discrimination model is configured to be greater than the differentiation probability value output of the data in the dummy data set 0.5, based on differentiation of the differentiation probability value deep learning to the data in the non-dummy data set to the data in dummy data set The output of probability value;
S330: the model that generates is generated based on the data in the dummy data set to EDS extended data set;
S340: the generation model is input to the discrimination model by the dummy data set and to EDS extended data set;
S350: data of the differentiation probability value greater than 0.5 of the discrimination model output are collected to form the expanding data Collection.
Preferably, whether identification needs to mark to the data in the EDS extended data set, screens the data being marked with shape Include: at the step S400 of a training dataset
S410: the data in the EDS extended data set are verified according to the labeled data collection and the data characteristics;
S420: extracting to verification result is the data for identifying mark, and is not by expanding data concentration verification result The data of identification mark are deleted.
Preferably, the step of the data in the EDS extended data set is verified according to the labeled data collection and the data characteristics Suddenly S410 includes:
S411: the data concentrated using the labeled data verify the data that the expanding data is concentrated as model;
S412: it is more than the half grade in the model or all consistent for result to the data verification, then determine institute Verification result is stated as identification mark.
Preferably, the data characteristics includes: the background of the data, unit number, the data of the data Digital gap, the target of the data, the data one of noise or a variety of.
The invention also discloses a kind of labeled data generating device, the labeled data generating device includes memory, place The computer program managing device and storage on a memory and can running on a processor, the processor execute the computer journey Labeled data generation method as described above is realized when sequence.
The present invention discloses a kind of computer readable storage medium again, is stored thereon with computer program, the computer Labeled data generation method as described above is realized when program is executed by processor.
After above-mentioned technical proposal, compared with prior art, have the advantages that
1. even if the data set comprising a large amount of labeled data can also be quickly generated the amount of sample data is less;
2. the case where data randomness is strong, is not susceptible to over-fitting improves the quality of labeled data collection;
3. identified and marked other data by the model that dummy data set generates, expand the big of labeled data collection Small and richness, the process can recycle forward iteration, accelerate the training speed and precision of deep neural network model.
Detailed description of the invention
Fig. 1 is to meet the flow diagram that data creation method is marked in one embodiment of the present invention;
Fig. 2 is the data for meeting dummy data set in one embodiment of the present invention;
Fig. 3 is to meet the flow diagram that data creation method is marked in a further preferred embodiments of the invention;
Fig. 4 is to meet the flow diagram that data creation method is marked in another further preferred embodiments of the invention;
Fig. 5 is the flow diagram for meeting the step S300 that data creation method is marked in one embodiment of the present invention;
Fig. 6 is the flow diagram for meeting the step S400 that data creation method is marked in one embodiment of the present invention.
Specific embodiment
Below in conjunction with attached drawing, the advantages of the present invention are further explained with specific embodiment.
Example embodiments are described in detail here, and the example is illustrated in the accompanying drawings.Following description is related to When attached drawing, unless otherwise indicated, the same numbers in different drawings indicate the same or similar elements.Following exemplary embodiment Described in embodiment do not represent all implementations consistent with this disclosure.On the contrary, they be only with it is such as appended The example of the consistent device and method of some aspects be described in detail in claims, the disclosure.
It is only to be not intended to be limiting the disclosure merely for for the purpose of describing particular embodiments in the term that the disclosure uses. The "an" of the singular used in disclosure and the accompanying claims book, " described " and "the" are also intended to including majority Form, unless the context clearly indicates other meaning.It is also understood that term "and/or" used herein refers to and wraps It may be combined containing one or more associated any or all of project listed.
It will be appreciated that though various information, but this may be described using term first, second, third, etc. in the disclosure A little information should not necessarily be limited by these terms.These terms are only used to for same type of information being distinguished from each other out.For example, not departing from In the case where disclosure range, the first information can also be referred to as the second information, and similarly, the second information can also be referred to as One information.Depending on context, word as used in this " if " can be construed to " ... when " or " when ... When " or " in response to determination "
In the description of the present invention, it is to be understood that, term " longitudinal direction ", " transverse direction ", "upper", "lower", "front", "rear", The orientation or positional relationship of the instructions such as "left", "right", "vertical", "horizontal", "top", "bottom" "inner", "outside" is based on attached drawing institute The orientation or positional relationship shown, is merely for convenience of description of the present invention and simplification of the description, rather than the dress of indication or suggestion meaning It sets or element must have a particular orientation, be constructed and operated in a specific orientation, therefore should not be understood as to limit of the invention System.
In the description of the present invention, unless otherwise specified and limited, it should be noted that term " installation ", " connected ", " connection " shall be understood in a broad sense, for example, it may be mechanical connection or electrical connection, the connection being also possible to inside two elements can , can also indirectly connected through an intermediary, for the ordinary skill in the art to be to be connected directly, it can basis Concrete condition understands the concrete meaning of above-mentioned term.
In subsequent description, it is only using the suffix for indicating such as " module ", " component " or " unit " of element Be conducive to explanation of the invention, itself there is no specific meanings.Therefore, " module " can mixedly make with " component " With.
Refering to fig. 1, the flow diagram to meet mark data creation method in one embodiment of the present invention.In the mark It infuses in data creation method, to have to phenomenon, financial circles service, social presence in have in an application program, life etc. The data suggestion mode of mass data is simultaneously analyzed, and by according to the data model collected, marks whether other data meet The rule of the data model of collection, and further, generation data can be extended to from data are differentiated, to realize in artificial intelligence The purpose of look-ahead, data processing.It is that a data are complete for the data comprising having marked, not marked all in a certain field Collect U.Sample data, that is, the data being marked, for the acquired number for being determined as annotation results of user in data complete or collected works U According to these data form a labeled data collection A.In the embodiment, identification will be established based on data complete or collected works U and labeled data collection A Model, and the data of data complete or collected works U not marked are accurately marked, specifically, comprising the following steps:
S100: obtaining a data complete or collected works U and is contained in the labeled data collection A being marked in data complete or collected works U;
For the data complete or collected works U comprising all data, it can be all data for including in a certain application program, in full It is formed by data set after the data acquisition systems such as word, letter, character, is also possible to if purchaser, user are when liveness, enlivening Between, on-line time, enliven the data acquisition systems such as duration after be formed by data set.In certain embodiments, excessively due to data volume Data huge, that data complete or collected works U is included are essentially identical to the number that all numbers, letter, character, picture etc. may be combined to form According to collection.Data volume regardless of data complete or collected works U inside includes part by user annotation, that is, determines that authenticity is genuine Data, for labeled data as sample data, gathering formation is the labeled data collection A for being really contained in data complete or collected works U for these.It is right In the acquisition of above-mentioned data complete or collected works U and labeled data collection A, by the basic operation as subsequent step.
S200: the data characteristics of analysis labeled data collection A, and the pseudo- data for meeting data characteristics are manufactured according to data characteristics Collect F;
After labeled data collection A will be analyzed.Specifically, since the current possessed and data validity of user is true Data in fixed only labeled data collection A, therefore will be to expand basis with labeled data collection A, analysis labeled data is needed first The data characteristics of collection, in including but not limited to labeled data collection A between the background of data, the unit number of data, the number of data One of noise of gap, the target of data, data is a variety of.As shown in Fig. 2, in an embodiment, be in an application program Data in the labeled data collection A of acquisition, for digital " 2589 ", then can analyze the data characteristics that the data include includes: The gray scale of background picture, brightness containing number 2,5,8,9, and when indicating these numbers, fonts of these numbers, number key Gap etc. data, after analyzing one by one, each data in labeled data collection A will all have the data characteristics for corresponding to itself.
Based on the resulting data characteristics of above-mentioned analysis, after manufacture met into these data characteristicses, but be different from mark Other data of data in data set A.When manufacture, need to meet used number is had by data in labeled data collection A Number (any one or more in such as 0-9), interdigital gap meets number possessed by data in labeled data collection A Gap (can preferably be not completely equivalent, established with ± the error of 10%-15%) etc..As shown in Fig. 2, being based on labeled data collection Interior " 2589 " data can produce such as " 25 ", " 52 ", " 89 " pseudo- data.Dummy data set is due to meet in labeled data The data characteristicses of data manufactures, and therefore, can also reconfigure for each unit of data in labeled data, can root when combination According to required pseudo- data volume, manufacturing condition is arbitrarily set.It is understood that if desired a large amount of pseudo- data, can only extract symbol The data characteristics of the number of labeled data is closed, and other data characteristicses are arbitrarily chosen to form the pseudo- data of big data quantity, it is on the contrary ?.
As possessed by dummy data set F puppet data are to reconfigure to each unit of data in labeled data, There are pseudo- data can be the data for being not equal to any data in labeled data in dummy data set F.For example, in labeled data collection The different data extracted in A includes number 1 and 6, but number 1 and number 6 are not individually composed to form data 16 or 61.But In dummy data set F, can include the numbers 1 such as 16,61,116,661 be optionally combined with digital 6.Therefore, when labeled data is concentrated Data when there are the digital units of 0-9, dummy data set can manufacture any real number based on this according to data characteristics, thus pole The earth expands the data volume of dummy data set F.
S300: dummy data set F is expanded based on GAN neural network, forms an EDS extended data set G;
Since dummy data set F is when forming, raw sample data is still conformed to, that is, meets the data characteristics of labeled data, It therefore, is the lateral extension of labeled data, but such as its background of the interior data based on same numbers of application program, intercharacter When the display of gap, font on different devices changes, the data in dummy data set F possibly can not be completely covered.Therefore, exist On the basis of dummy data set F, dummy data set is expanded based on GAN neural network, to guarantee the randomness and multiplicity of data Property, the data acquisition system after expansion forms an EDS extended data set G.Expansion for data can increase former data characteristics, and increase On the basis of data characteristics after adding, number is reconfigured.
False sample set is formed by the expansion to dummy data set F using the principle of GAN neural network, while being marked On the basis of this genuine data set of data set A, fought with a module with discrimination function.
The module with discrimination function solves two classification problems for having supervision, for judging that input data is mark number According to collection A (dummy data set F) or EDS extended data set G.In the training process, the expansion that labeled data collection A and generation network generate Data set G, into the discrimination model, discriminates whether to be true using the discrimination model by stochastic inputs.Pass through the engineering of competitive mode Habit mechanism makes to generate network and differentiates that the performance of network is constantly promoted, and when two model parameters are stablized, training is completed.
After training, new expanding data can be generated according to generation network new at present, the new expansion in the part Data are more close to the data in dummy data set F and labeled data collection A.
S400: whether identification needs to mark to the data in EDS extended data set G, screens the data being marked to form an instruction Practice data set T;
For the fitness for further ensuring that the expanding data in EDS extended data set G and the data in labeled data collection A, will borrow The training pattern obtained after the training of GAN neural network depth is helped to carry out identification mark to expanding data, to clean to expanding data. Since cleaning action is operated for EDS extended data set G, obtain the portion that the data in training dataset T are EDS extended data set G Divided data, i.e. training dataset T are really contained in EDS extended data set G.
S500: neural metwork training is carried out to training dataset T and forms a training pattern;
Data in training dataset T after the completion of cleaning have certain cleannes and with the number in labeled data collection A According to degree of fitting, and training dataset T in have in addition to all data in labeled data collection A, mark is met in expanding data The data of the data characteristics of data set A, data volume are greater than the data volume of labeled data collection A.And it is understood that expand number According to the increase of the data volume of collection G and training dataset T, in step S300-S400, the expanding data that manufactures and obtained after screening It also can be more.It is more rich to be formed by training pattern when carrying out neural metwork training to training dataset T for more data volumes It is full, on the one hand, the training pattern meets the data characteristics of the data in labeled data collection A, on the other hand, enriches due to mark The characteristics of shortage of data of data set A, has expanded the data of labeled data collection A under the conditions of different background, font, spacing etc. Deformation, be the training pattern for being better than labeled data collection A again based on labeled data collection A.
S600: based on training pattern to the data cleansing in data complete or collected works U in addition to labeled data collection A, mark meets training The data of model are simultaneously included into labeled data collection A.
After training pattern, the standard differentiated when by according in training pattern to data mark, such as whether meeting data Number in feature, digital gap, digital background, digital object etc., also or the variant etc. pair of the variant of number, digital background Other data (data in addition to the data in labeled data collection A) in data complete or collected works U are analyzed and are differentiated, according to differentiation As a result the data capable of washing for not meeting training pattern, and meet the data of training pattern, then will retain and be included into labeled data Collect in A, to be picked out in a manner of high accuracy in the range of data complete or collected works U and meet original sample, i.e. labeled data collection Other data of data in A, to expand the data volume of original sample.And the it is understood that number based on original sample According to being gradually expanded for amount, it can be used for trained labeled data amount and also increase, after successive ignition, the magnitude and instruction of labeled data Practice model maturity will exponentially other growth, it is subsequent to rely on big data when differentiating to same class data, and Differentiate accurate, efficient.
In different embodiments, resulting new labeled data collection A ' after being executed according to step S600, if its data volume is not Foot in different embodiments, will execute different steps when successive ignition being needed to accumulate more data volumes.
Embodiment one
Refering to Fig. 3, in a preferred embodiment, labeled data generation method is further comprising the steps of:
S700: judge whether the data volume in labeled data collection A ' is greater than or equal to an expected data amount;
It is whether full firstly the need of the quality of the data volume and training pattern that judge currently obtained labeled data collection A ' Sufficient expected data amount or desired quality, that is to say, that when finally the data in data complete or collected works U being analyzed and differentiated, the instruction Practice whether model can accurately be labeled data, under the mode of verifying can assume that training pattern is accurate using experimental judgment How is the fitness of labeled data and labeled data.
S800: when the data volume in labeled data collection A is less than expected data amount, training dataset T and labeled data are taken Collect the union of A, and executes step S500-S600 again.
When assert that the data volume in labeled data collection A is less than expected data amount, then need to increase in labeled data collection A Data volume.Therefore, in the case where not re-starting the quality based on GAN neural network raising EDS extended data set G, it is only necessary to mention The data volume of the training dataset T obtained in high step S400.Therefore, training dataset T and labeled data collection A are taken into union, Obtain new training dataset T ', i.e. T '=T ∪ A.Then based on new training dataset T ', iteration returns step S500, again Neural metwork training is carried out to training dataset and forms a new training pattern, then executes step S600 again, obtains iteration New labeled data collection A " afterwards.If being still unable to satisfy the condition of model training, iterates and execute step S500- repeatedly S600, until finally obtained new labeled data collection A " meets the condition of model training.
Embodiment two
In another embodiment, labeled data generation method is further comprising the steps of:
S700: judge whether the data volume in labeled data collection A ' is greater than or equal to an expected data amount;
It is whether full firstly the need of the quality of the data volume and training pattern that judge currently obtained labeled data collection A ' Sufficient expected data amount or desired quality, that is to say, that when finally the data in data complete or collected works U being analyzed and differentiated, the instruction Practice whether model can accurately be labeled data, under the mode of verifying can assume that training pattern is accurate using experimental judgment How is the fitness of labeled data and labeled data.
S800 ': when the data volume in labeled data collection A is less than expected data amount, the data in dummy data set F are replaced For the data in labeled data collection A, and step S300-S600 is executed again
Different from, to the trust of GAN neural metwork training result, in this embodiment, assert and needing again in embodiment one The confrontation experiment for carrying out GAN neural network, therefore, after step S600 is finished, it is believed that the data in labeled data collection A ' When amount is less than expected data amount, the data in dummy data set F are replaced with into the data in labeled data collection A, so that GAN nerve net The original sample of network is reduced into the data in labeled data collection A, although data volume reduces, due to data in labeled data collection A For labeled data, labeled data collection A, matter are then also directly relied on by EDS extended data set G resulting after GAN neural network Amount is further enhanced.
After data in dummy data set F to be replaced with to the data in labeled data collection A, then successively execute abovementioned steps S300-S600, the new labeled data collection A " after obtaining iteration.If new labeled data collection A " is still unable to satisfy model training Condition, then still elect to the iteration as needed in embodiment one or embodiment two and return S500 or S300 to re-execute.And due to The process of iteration may more than once, therefore, and embodiment one and the iterative manner of embodiment two can be freely combined, according to user couple Depending on the degree of recognition of the quality of the quality and training dataset T of dummy data set F.
Refering to Fig. 4, in this embodiment, S600, S800 or S800 are executed the step ' after, further includes:
S900: based on the labeled data collection A formed in step S600 and/or training dataset T training in addition to data complete or collected works Other data sets
In this step, according to user's needs, if data complete or collected works U when choosing, whole shapes of not a certain application program Data under state or when only having chosen the data in certain time period, in addition to data complete or collected works U, further comprise other data sets, then In this case, it using the labeled data collection A and/or training dataset T of data volume of certain scale, also or trains Model is trained and marks to other data sets.It is understood that since training dataset T is from labeled data collection A It is derivative to be formed, therefore, the data in labeled data collection A can be preferentially selected, then participate in subsequent instruction according to the quality of data is optional Practice.
Completion specifically is executed by following steps when expanding based on GAN neural network dummy data set F refering to Fig. 5:
S310: building one generates model and discrimination model;
As described above, the module with systematic function has all one's life at network, has and differentiate to generate model The module of function is discrimination model, has one to differentiate network.Generate the original of model and discrimination model based on GAN neural network It manages and carries out dual training, to improve the data generation precision and the data differentiation accuracy of discrimination model that generate model simultaneously.
S320: being configured to discrimination model to be greater than 0.5 to the output of the differentiation probability values of the data in dummy data set, based on pair Output of the differentiation probability value deep learning of data in dummy data set to the differentiation probability value of the data in non-dummy data set;
In initial training, since what input differentiated differentiates that object is the data in dummy data set F, and in dummy data set F Data be that user is the diversity for expanding data and the data that voluntarily manufacture.Differentiate that network judges input, is judged as Very (dummy data set F) or false (generating the expanding data that network generates), which makes comparisons with truth, adversely affects Model parameter, the continuous iteration process make to differentiate the difference of e-learning to true and false data.By sentencing to labeled data It is disconnected, train discrimination model to the blank of data, to, based on the true and false of preamble data, test when input user is also unknown true and false Demonstrate,prove a possibility that these data are true and false, that is to say, that based on the differentiation probability value deep learning to the data in dummy data set F, It realizes to the data in non-dummy data set F, that is, the output for differentiating probability value of the data expanded, to the same of data extending When, differentiate the true and false of data.
S330: it generates model and is generated based on the data in dummy data set F to EDS extended data set;
Model side is being generated, based on the data in dummy data set F, to number, spacing, font, the combining form in data It Deng deformation, and reconfigures again, to generate to EDS extended data set.
S340: model is generated by dummy data set and to EDS extended data set and is input to discrimination model;
With the formation to EDS extended data set, labeled data integrates A and has gradually expanded as dummy data set F, to EDS extended data set, It is input to discrimination model by dummy data set F and to EDS extended data set afterwards, output data after being differentiated by discrimination model to each data True and false probability value.
S350: data of the differentiation probability value greater than 0.5 of discrimination model output are collected to form EDS extended data set.
Since discrimination model trains discrimination standard based on the data in dummy data set F, collect discrimination model output Differentiate probability value greater than 0.5 data (data in i.e. whole dummy data set F and meet discrimination model discrimination standard wait expand Partial data in data set), taking after union is just EDS extended data set G.
Resulting EDS extended data set G through the above steps, both ensure that it is approximate with the characteristic of data in labeled data collection A, It is the random expansion and diversification processing to the data of dummy data set F again simultaneously, i.e., expands the data of labeled data collection A in an orderly manner Amount.
Refering to Fig. 6, in this embodiment, step S400 specifically executes completion by following steps:
S410: the data in EDS extended data set G are verified according to labeled data collection A data characteristics;
This existing model by the data characteristics of labeled data collection A, to the data in newly-generated EDS extended data set G into Row identification mark, verifies whether the data in EDS extended data set G should be performed subsequent identification operation.
Ballot proof method can be used in the mode of verifying, i.e., based on the principle that the minority is subordinate to the majority or adopts unanimously, carries out two Verification result of data characteristics of multiple arbiters based on labeled data collection A with accounting for more or whole accountings is respected in secondary identification. That is, including: according to the step S410 that the data characteristics of labeled data collection A verifies the data in EDS extended data set G S411: the data in A are integrated as the data in model verifying EDS extended data set G using labeled data;S412: when the half grade in model Above or all data are verified as with result is consistent, then decision verification result is identification mark.
S420: extracting to verification result is the data for identifying mark, and is not identification by expanding data concentration verification result The data of mark are deleted.
Verification result includes identification mark and unidentified mark, belongs to the data of identification mark, that is, meets labeled data collection A Data data characteristics, be retained in EDS extended data set G, and verification result is unidentified mark, will make delete processing, with Carry out data cleansing.
With the labeled data generation method in any of the above-described embodiment, can be applied in a labeled data generating device, Such as intelligent terminal, server, work station, virtual server, virtual workstation.Labeled data generating device include memory, Processor and storage on a memory and the computer program that can run on a processor, processor execution computer program When, computer program realizes above-mentioned labeled data generation method according to preconfigured original language.
Above-mentioned labeled data generation method may alternatively be integrated in a computer readable storage medium, be stored thereon with computer Program realizes sound control method as described above when computer program is executed by processor, the computer readable storage medium It can be indicated in the form of software, software, virtual file etc..
Intelligent terminal can be implemented in a variety of manners.For example, terminal described in the present invention may include such as moving Phone, smart phone, laptop, PDA (personal digital assistant), PAD (tablet computer), PMP (put by portable multimedia broadcasting Device), the fixed terminal of the intelligent terminal of navigation device etc. and such as number TV, desktop computer etc..Hereinafter it is assumed that eventually End is intelligent terminal.However, it will be understood by those skilled in the art that other than the element for being used in particular for mobile purpose, root It can also apply to the terminal of fixed type according to the construction of embodiments of the present invention.
It should be noted that the embodiment of the present invention has preferable implementation, and not the present invention is made any type of Limitation, any one skilled in the art change or are modified to possibly also with the technology contents of the disclosure above equivalent effective Embodiment, as long as without departing from the content of technical solution of the present invention, it is to the above embodiments according to the technical essence of the invention Any modification or equivalent variations and modification, all of which are still within the scope of the technical scheme of the invention.

Claims (10)

1. a kind of labeled data generation method, which comprises the following steps:
S100: obtaining a data complete or collected works and is contained in the labeled data collection being marked in the data complete or collected works;
S200: the data characteristics of the labeled data collection is analyzed, and the data characteristics is met according to data characteristics manufacture Dummy data set;
S300: expanding the dummy data set based on GAN neural network, forms an EDS extended data set;
S400: whether identification needs to mark to the data in the EDS extended data set, screens the data being marked to form an instruction Practice data set;
S500: neural metwork training is carried out to the training dataset and forms a training pattern;
S600: based on the training pattern to the data cleansing in the data complete or collected works in addition to the labeled data collection, mark symbol It closes the data of the training pattern and is included into the labeled data collection.
2. labeled data generation method as described in claim 1, which is characterized in that
The labeled data generation method is further comprising the steps of:
S700: judge whether the data volume in the labeled data collection is greater than or equal to an expected data amount;
S800: when the data volume in the labeled data collection is less than the expected data amount, the training dataset and institute are taken The union of labeled data collection is stated, and executes step S500-S600 again.
3. labeled data generation method as claimed in claim 2, which is characterized in that
The step S800 replacement are as follows:
S800 ': when the data volume in the labeled data collection is less than the expected data amount, by the number in the dummy data set According to the data replaced in the labeled data collection, and step S300-S600 is executed again.
4. labeled data generation method as described in claim 1, which is characterized in that
The labeled data generation method is further comprising the steps of:
S900: the data are removed based on the labeled data collection formed in step S600 and/or training dataset training Other data sets outside complete or collected works.
5. labeled data generation method as described in claim 1, which is characterized in that
The dummy data set is expanded based on GAN neural network, the step S300 for forming an EDS extended data set includes:
S310: building one generates model and discrimination model;
S320: the discrimination model is configured to be greater than 0.5 to the differentiation probability value output of the data in the dummy data set, base In the differentiation probability value deep learning to the data in dummy data set to the differentiation probability value of the data in the non-dummy data set Output;
S330: the model that generates is generated based on the data in the dummy data set to EDS extended data set;
S340: the generation model is input to the discrimination model by the dummy data set and to EDS extended data set;
S350: data of the differentiation probability value greater than 0.5 of the discrimination model output are collected to form the EDS extended data set.
6. labeled data generation method as described in claim 1, which is characterized in that
Whether identification needs to mark to the data in the EDS extended data set, screens the data being marked to form a training data The step S400 of collection includes:
S410: the data in the EDS extended data set are verified according to the labeled data collection and the data characteristics;
S420: extracting to verification result is the data for identifying mark, and is not identification by expanding data concentration verification result The data of mark are deleted.
7. labeled data generation method as claimed in claim 6, which is characterized in that
The step S410 that the data in the EDS extended data set are verified according to the labeled data collection and the data characteristics includes:
S411: the data concentrated using the labeled data verify the data that the expanding data is concentrated as model;
S412: it is more than the half grade in the model or all consistent for result to the data verification, then it is tested described in judgement Demonstrate,proving result is identification mark.
8. labeled data generation method as described in claim 1, which is characterized in that
The data characteristics includes: the background of the data, the unit number of the data, the digital gap of the data, institute State one of the target of data, noise of the data or a variety of.
9. a kind of labeled data generating device, the labeled data generating device includes memory, processor and is stored in storage On device and the computer program that can run on a processor, which is characterized in that when the processor executes the computer program Realize such as the described in any item labeled data generation methods of claim 1-8.
10. a kind of computer readable storage medium, is stored thereon with computer program, which is characterized in that the computer program Such as claim 1-8 described in any item labeled data generation methods are realized when being executed by processor.
CN201810609646.8A 2018-06-13 2018-06-13 Method and device for generating annotation data and computer-readable storage medium Active CN108960409B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810609646.8A CN108960409B (en) 2018-06-13 2018-06-13 Method and device for generating annotation data and computer-readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810609646.8A CN108960409B (en) 2018-06-13 2018-06-13 Method and device for generating annotation data and computer-readable storage medium

Publications (2)

Publication Number Publication Date
CN108960409A true CN108960409A (en) 2018-12-07
CN108960409B CN108960409B (en) 2021-08-03

Family

ID=64488602

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810609646.8A Active CN108960409B (en) 2018-06-13 2018-06-13 Method and device for generating annotation data and computer-readable storage medium

Country Status (1)

Country Link
CN (1) CN108960409B (en)

Cited By (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109816019A (en) * 2019-01-25 2019-05-28 上海小萌科技有限公司 A kind of image data automation auxiliary mask method
CN109978029A (en) * 2019-03-13 2019-07-05 北京邮电大学 A kind of invalid image pattern screening technique based on convolutional neural networks
CN110189351A (en) * 2019-04-16 2019-08-30 浙江大学城市学院 A kind of scratch image data amplification method based on production confrontation network
CN110569379A (en) * 2019-08-05 2019-12-13 广州市巴图鲁信息科技有限公司 Method for manufacturing picture data set of automobile parts
CN110874484A (en) * 2019-10-16 2020-03-10 众安信息技术服务有限公司 Data processing method and system based on neural network and federal learning
CN111143617A (en) * 2019-12-12 2020-05-12 浙江大学 Automatic generation method and system for picture or video text description
CN111177132A (en) * 2019-12-20 2020-05-19 中国平安人寿保险股份有限公司 Label cleaning method, device, equipment and storage medium for relational data
CN111382785A (en) * 2020-03-04 2020-07-07 武汉精立电子技术有限公司 GAN network model and method for realizing automatic cleaning and auxiliary marking of sample
CN111476324A (en) * 2020-06-28 2020-07-31 平安国际智慧城市科技股份有限公司 Traffic data labeling method, device, equipment and medium based on artificial intelligence
CN111741018A (en) * 2020-07-24 2020-10-02 中国航空油料集团有限公司 Industrial control data attack sample generation method and system, electronic device and storage medium
CN112257731A (en) * 2019-07-05 2021-01-22 杭州海康威视数字技术股份有限公司 Virtual data set generation method and device
CN112308167A (en) * 2020-11-09 2021-02-02 上海风秩科技有限公司 Data generation method and device, storage medium and electronic equipment
CN112508000A (en) * 2020-11-26 2021-03-16 上海展湾信息科技有限公司 Method and equipment for generating OCR image recognition model training data
CN112580310A (en) * 2020-12-28 2021-03-30 河北省讯飞人工智能研究院 Missing character/word completion method and electronic equipment
WO2021084471A1 (en) * 2019-10-31 2021-05-06 International Business Machines Corporation Artificial intelligence transparency
CN113239205A (en) * 2021-06-10 2021-08-10 阳光保险集团股份有限公司 Data annotation method and device, electronic equipment and computer readable storage medium
CN114926709A (en) * 2022-05-26 2022-08-19 成都极米科技股份有限公司 Data labeling method and device and electronic equipment
CN116451087A (en) * 2022-12-20 2023-07-18 石家庄七彩联创光电科技有限公司 Character matching method, device, terminal and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107392125A (en) * 2017-07-11 2017-11-24 中国科学院上海高等研究院 Training method/system, computer-readable recording medium and the terminal of model of mind
CN107622056A (en) * 2016-07-13 2018-01-23 百度在线网络技术(北京)有限公司 The generation method and device of training sample
CN107644235A (en) * 2017-10-24 2018-01-30 广西师范大学 Image automatic annotation method based on semi-supervised learning
CN108009589A (en) * 2017-12-12 2018-05-08 腾讯科技(深圳)有限公司 Sample data processing method, device and computer-readable recording medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107622056A (en) * 2016-07-13 2018-01-23 百度在线网络技术(北京)有限公司 The generation method and device of training sample
CN107392125A (en) * 2017-07-11 2017-11-24 中国科学院上海高等研究院 Training method/system, computer-readable recording medium and the terminal of model of mind
CN107644235A (en) * 2017-10-24 2018-01-30 广西师范大学 Image automatic annotation method based on semi-supervised learning
CN108009589A (en) * 2017-12-12 2018-05-08 腾讯科技(深圳)有限公司 Sample data processing method, device and computer-readable recording medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
王坤峰: ""生成式对抗网络GAN 的研究进展与展望"", 《自动化学报》 *

Cited By (28)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109816019A (en) * 2019-01-25 2019-05-28 上海小萌科技有限公司 A kind of image data automation auxiliary mask method
CN109978029A (en) * 2019-03-13 2019-07-05 北京邮电大学 A kind of invalid image pattern screening technique based on convolutional neural networks
CN109978029B (en) * 2019-03-13 2021-02-09 北京邮电大学 Invalid image sample screening method based on convolutional neural network
CN110189351A (en) * 2019-04-16 2019-08-30 浙江大学城市学院 A kind of scratch image data amplification method based on production confrontation network
CN112257731A (en) * 2019-07-05 2021-01-22 杭州海康威视数字技术股份有限公司 Virtual data set generation method and device
CN110569379A (en) * 2019-08-05 2019-12-13 广州市巴图鲁信息科技有限公司 Method for manufacturing picture data set of automobile parts
CN110874484A (en) * 2019-10-16 2020-03-10 众安信息技术服务有限公司 Data processing method and system based on neural network and federal learning
JP7461699B2 (en) 2019-10-31 2024-04-04 インターナショナル・ビジネス・マシーンズ・コーポレーション Artificial Intelligence Transparency
US11651276B2 (en) 2019-10-31 2023-05-16 International Business Machines Corporation Artificial intelligence transparency
WO2021084471A1 (en) * 2019-10-31 2021-05-06 International Business Machines Corporation Artificial intelligence transparency
CN111143617A (en) * 2019-12-12 2020-05-12 浙江大学 Automatic generation method and system for picture or video text description
CN111177132A (en) * 2019-12-20 2020-05-19 中国平安人寿保险股份有限公司 Label cleaning method, device, equipment and storage medium for relational data
CN111382785A (en) * 2020-03-04 2020-07-07 武汉精立电子技术有限公司 GAN network model and method for realizing automatic cleaning and auxiliary marking of sample
CN111382785B (en) * 2020-03-04 2023-09-01 武汉精立电子技术有限公司 GAN network model and method for realizing automatic cleaning and auxiliary marking of samples
CN111476324B (en) * 2020-06-28 2020-10-02 平安国际智慧城市科技股份有限公司 Traffic data labeling method, device, equipment and medium based on artificial intelligence
CN111476324A (en) * 2020-06-28 2020-07-31 平安国际智慧城市科技股份有限公司 Traffic data labeling method, device, equipment and medium based on artificial intelligence
CN111741018B (en) * 2020-07-24 2020-12-01 中国航空油料集团有限公司 Industrial control data attack sample generation method and system, electronic device and storage medium
CN111741018A (en) * 2020-07-24 2020-10-02 中国航空油料集团有限公司 Industrial control data attack sample generation method and system, electronic device and storage medium
CN112308167A (en) * 2020-11-09 2021-02-02 上海风秩科技有限公司 Data generation method and device, storage medium and electronic equipment
CN112508000A (en) * 2020-11-26 2021-03-16 上海展湾信息科技有限公司 Method and equipment for generating OCR image recognition model training data
CN112508000B (en) * 2020-11-26 2023-04-07 上海展湾信息科技有限公司 Method and equipment for generating OCR image recognition model training data
CN112580310A (en) * 2020-12-28 2021-03-30 河北省讯飞人工智能研究院 Missing character/word completion method and electronic equipment
CN112580310B (en) * 2020-12-28 2023-04-18 河北省讯飞人工智能研究院 Missing character/word completion method and electronic equipment
CN113239205B (en) * 2021-06-10 2023-09-01 阳光保险集团股份有限公司 Data labeling method, device, electronic equipment and computer readable storage medium
CN113239205A (en) * 2021-06-10 2021-08-10 阳光保险集团股份有限公司 Data annotation method and device, electronic equipment and computer readable storage medium
CN114926709A (en) * 2022-05-26 2022-08-19 成都极米科技股份有限公司 Data labeling method and device and electronic equipment
CN116451087A (en) * 2022-12-20 2023-07-18 石家庄七彩联创光电科技有限公司 Character matching method, device, terminal and storage medium
CN116451087B (en) * 2022-12-20 2023-12-26 石家庄七彩联创光电科技有限公司 Character matching method, device, terminal and storage medium

Also Published As

Publication number Publication date
CN108960409B (en) 2021-08-03

Similar Documents

Publication Publication Date Title
CN108960409A (en) Labeled data generation method, equipment and computer readable storage medium
CN109145939B (en) Semantic segmentation method for small-target sensitive dual-channel convolutional neural network
CN108074244B (en) Safe city traffic flow statistical method integrating deep learning and background difference method
CN109948425A (en) A kind of perception of structure is from paying attention to and online example polymerize matched pedestrian's searching method and device
CN112199608B (en) Social media rumor detection method based on network information propagation graph modeling
CN107291688A (en) Judgement document's similarity analysis method based on topic model
CN110532379B (en) Electronic information recommendation method based on LSTM (least Square TM) user comment sentiment analysis
CN107945153A (en) A kind of road surface crack detection method based on deep learning
CN108960499A (en) A kind of Fashion trend predicting system merging vision and non-vision feature
CN102262642B (en) Web image search engine and realizing method thereof
CN105893483A (en) Construction method of general framework of big data mining process model
CN109086794B (en) Driving behavior pattern recognition method based on T-LDA topic model
WO2022062419A1 (en) Target re-identification method and system based on non-supervised pyramid similarity learning
CN111210111B (en) Urban environment assessment method and system based on online learning and crowdsourcing data analysis
CN109002492A (en) A kind of point prediction technique based on LightGBM
CN109063649A (en) Pedestrian's recognition methods again of residual error network is aligned based on twin pedestrian
CN108764282A (en) A kind of Class increment Activity recognition method and system
CN109413023A (en) The training of machine recognition model and machine identification method, device, electronic equipment
CN106960017A (en) E-book is classified and its training method, device and equipment
CN110737805B (en) Method and device for processing graph model data and terminal equipment
CN108647800A (en) A kind of online social network user missing attribute forecast method based on node insertion
CN112528934A (en) Improved YOLOv3 traffic sign detection method based on multi-scale feature layer
CN109949174A (en) A kind of isomery social network user entity anchor chain connects recognition methods
CN112733602B (en) Relation-guided pedestrian attribute identification method
CN108229567A (en) Driver identity recognition methods and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant