CN108960409B - Method and device for generating annotation data and computer-readable storage medium - Google Patents
Method and device for generating annotation data and computer-readable storage medium Download PDFInfo
- Publication number
- CN108960409B CN108960409B CN201810609646.8A CN201810609646A CN108960409B CN 108960409 B CN108960409 B CN 108960409B CN 201810609646 A CN201810609646 A CN 201810609646A CN 108960409 B CN108960409 B CN 108960409B
- Authority
- CN
- China
- Prior art keywords
- data
- data set
- labeled
- training
- model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Computational Linguistics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Evolutionary Biology (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Image Analysis (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention provides a method and equipment for generating annotation data and a computer-readable storage medium. The method for generating the labeling data comprises the following steps: s100: acquiring a data corpus and a labeled data set which is contained in the data corpus and is labeled; s200: analyzing the data characteristics of the marked data set, and manufacturing a pseudo data set according with the data characteristics according to the data characteristics; s300: expanding the pseudo data set based on a GAN neural network to form an expanded data set; s400: identifying whether the data in the extended data set needs to be labeled or not, and screening the labeled data to form a training data set; s500: carrying out neural network training on the training data set to form a training model; s600: and cleaning data outside the labeled data set in the data full set based on the training model, labeling the data conforming to the training model and putting the labeled data set in the labeled data set, so that a training number set which is high in matching degree with sample data and strong in randomness can be generated quickly and efficiently on the basis of a small amount of data, and the data volume of the labeled data is enlarged.
Description
Technical Field
The present invention relates to the field of data models, and in particular, to a method and an apparatus for generating labeled data, and a computer-readable storage medium.
Background
With the rapid development of application programs on intelligent terminals and artificial intelligence technologies built based on the application programs, people have increasingly and widely entered the lives of people. Regardless of the fields of daily use, games, work and the like, learning based on original sample data is needed to know the use habits in the fields so as to make intelligent judgment.
For learning of the original sample data, a deep neural network technique may be employed. The deep neural network technology is rapidly developed in recent years, obtains far-expected precision in the field of image recognition, and obtains popular application in many fields. In practical engineering applications, however, many special image recognition requirements lack a data set available for training, and the model accuracy of the deep neural network greatly depends on the size and quality of the data set. In order to solve the problem of lack of training data, the prior art generally performs random cropping, rotation, stretching and inversion on the existing labeled data, but has the following defects:
1. the original image data corresponding to some models is small in length and width, and the quantity of data which can be expanded by random cropping is limited.
2. When the original sample data is small, the data obtained by these methods is prone to overfitting the model due to insufficient feature dispersion.
3. Some models are sensitive to data stretching, and the recognition rate is obviously reduced after the stretching;
4. manually collecting and labeling data can consume a great deal of labor and energy.
Therefore, a new method for generating labeled data is needed, which can generate a large amount of training number sets with strong randomness rapidly and under the condition of less labeled sample data, and simplify the work of collecting and labeling training data.
Disclosure of Invention
In order to overcome the above technical drawbacks, an object of the present invention is to provide a method, an apparatus, and a computer-readable storage medium for generating labeled data, which are capable of quickly and efficiently generating a training number set with high matching degree with sample data and strong randomness based on a small amount of data, thereby expanding the data size of the labeled data.
The invention discloses a method for generating marking data, which comprises the following steps:
s100: acquiring a data corpus and a labeled data set which is contained in the data corpus and is labeled;
s200: analyzing the data characteristics of the marked data set, and manufacturing a pseudo data set according with the data characteristics according to the data characteristics;
s300: expanding the pseudo data set based on a GAN neural network to form an expanded data set;
s400: identifying whether the data in the extended data set needs to be labeled or not, and screening the labeled data to form a training data set;
s500: carrying out neural network training on the training data set to form a training model;
s600: and cleaning data outside the labeled data set in the data corpus based on the training model, labeling the data conforming to the training model and putting the data into the labeled data set.
Preferably, the annotation data generation method further comprises the following steps:
s700: judging whether the data volume in the labeled data set is larger than or equal to an expected data volume;
s800: and when the data volume in the labeled data set is smaller than the expected data volume, taking the union of the training data set and the labeled data set, and executing the steps S500-S600 again.
Preferably, the step S800 is replaced by:
s800': and when the data volume in the labeled data set is smaller than the expected data volume, replacing the data in the pseudo data set with the data in the labeled data set, and executing the steps S300-S600 again.
Preferably, the annotation data generation method further comprises the following steps:
s900: training the other data sets except the data corpus based on the labeled data set and/or the training data set formed in step S600.
Preferably, the step S300 of expanding the pseudo data set based on the GAN neural network to form an expanded data set includes:
s310: constructing a generation model and a discrimination model;
s320: configuring the discrimination model to output discrimination probability values of data in the pseudo data set to be more than 0.5, and deeply learning the output of the discrimination probability values of data in non-pseudo data sets based on the discrimination probability values of the data in the pseudo data set;
s330: the generation model generates a data set to be expanded based on the data in the pseudo data set;
s340: the generation model inputs the pseudo data set and the data set to be expanded into the discrimination model;
s350: and collecting data with the discriminant probability value larger than 0.5 output by the discriminant model to form the expanded data set.
Preferably, the step S400 of identifying whether labeling is required for the data in the augmented data set, and filtering the labeled data to form a training data set includes:
s410: verifying the data in the extended data set according to the labeled data set and the data characteristics;
s420: and extracting the data with the identification mark as the verification result, and deleting the data with the verification result not being the identification mark in the expansion data set.
Preferably, the step S410 of verifying the data in the augmented data set according to the annotation data set and the data feature comprises:
s411: verifying the data in the extended data set by taking the data in the labeled data set as a model;
s412: and when the data are verified to be consistent by more than half of levels or all levels in the model, judging that the verification result is the identification label.
Preferably, the data features include: one or more of a background of the data, a unit number of the data, a digital gap of the data, an object of the data, and a noise of the data.
The invention also discloses labeled data generating equipment, which comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor, wherein the labeled data generating method is realized when the processor executes the computer program.
The present invention also discloses a computer-readable storage medium on which a computer program is stored, which, when executed by a processor, implements the annotation data generation method as described above.
After the technical scheme is adopted, compared with the prior art, the method has the following beneficial effects:
1. even if the amount of sample data is small, a data set containing a large amount of annotation data can be generated quickly;
2. the data randomness is strong, the overfitting situation is not easy to occur, and the quality of the labeled data set is improved;
3. the model generated by the pseudo data set is used for identifying and labeling other data, the size and the richness of the labeled data set are expanded, forward iteration can be circulated in the process, and the training speed and the training precision of the deep neural network model are improved.
Drawings
FIG. 1 is a flow chart illustrating a method for generating annotation data in accordance with a preferred embodiment of the present invention;
FIG. 2 is data of a pseudo data set in accordance with a preferred embodiment of the present invention;
FIG. 3 is a flow chart illustrating a method for generating annotation data in accordance with a further preferred embodiment of the present invention;
FIG. 4 is a schematic flow chart illustrating a method for generating annotation data in accordance with a further preferred embodiment of the present invention;
FIG. 5 is a flowchart illustrating the step S300 of the annotation data generation method according to a preferred embodiment of the invention;
FIG. 6 is a flowchart illustrating the step S400 of the annotation data generation method according to a preferred embodiment of the invention.
Detailed Description
The advantages of the invention are further illustrated in the following description of specific embodiments in conjunction with the accompanying drawings.
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.
The terminology used in the present disclosure is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. As used in this disclosure and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.
It is to be understood that although the terms first, second, third, etc. may be used herein to describe various information, such information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the present disclosure. Depending on context, the word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination"
In the description of the present invention, it is to be understood that the terms "longitudinal", "lateral", "upper", "lower", "front", "rear", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outer", and the like, indicate orientations or positional relationships based on those shown in the drawings, and are used merely for convenience of description and for simplicity of description, and do not indicate or imply that the referenced devices or elements must have a particular orientation, be constructed in a particular orientation, and be operated, and thus, are not to be construed as limiting the present invention.
In the description of the present invention, unless otherwise specified and limited, it is to be noted that the terms "mounted," "connected," and "connected" are to be interpreted broadly, and may be, for example, a mechanical connection or an electrical connection, a communication between two elements, a direct connection, or an indirect connection via an intermediate medium, and specific meanings of the terms may be understood by those skilled in the art according to specific situations.
In the following description, suffixes such as "module", "component", or "unit" used to denote elements are used only for facilitating the explanation of the present invention, and have no specific meaning in themselves. Thus, "module" and "component" may be used in a mixture.
Fig. 1 is a schematic flow chart of a method for generating annotation data according to a preferred embodiment of the present invention. According to the method for generating the labeled data, a data model with a large amount of data, such as a phenomenon in life, financial service, social representation and the like, in an application program is suggested and analyzed, whether other data accord with the rules of the collected data model or not is labeled according to the collected data model, and further, the data can be extended to generated data from judgment data, so that the purposes of predicting in advance, processing the data and the like in artificial intelligence are achieved. All the marked and unmarked data in a certain field are a data complete set U. The sample data, i.e. the data that has been labeled, is the data that has been obtained by the user in the data corpus U and is determined to be the labeling result, and these data form a labeled data set a. In this embodiment, an identification model is established based on the data corpus U and the labeled data set a, and the unlabeled data of the data corpus U is labeled accurately, specifically, the method includes the following steps:
s100: acquiring a data corpus U and a labeled data set A which is contained in the data corpus U and is labeled;
the data corpus U containing all data may be a data set formed by collecting all data included in an application program, such as numbers, letters, characters, and the like, or a data set formed by a purchaser and a user after collecting data such as activity, active time, online time, active duration, and the like. In some embodiments, the data volume is too large, and the data contained in the data complete set U is substantially equivalent to a data set formed by combining all numbers, letters, characters, pictures and the like. Regardless of the data volume of the data corpus U, the data with parts marked by the user, that is, the data with the authenticity determined as true, is included in the data corpus U, and these marked data are used as sample data to form a marked data set a which is really contained in the data corpus U. The acquisition of the data full set U and the annotation data set A is taken as the basic operation of the subsequent steps.
S200: analyzing the data characteristics of the marked data set A, and manufacturing a pseudo data set F according with the data characteristics according to the data characteristics;
the annotated dataset a will then be analyzed. Specifically, since the user currently has only the data in the labeled data set a with the determined authenticity of the data, the data characteristics of the labeled data set, including but not limited to one or more of the background of the data in the labeled data set a, the unit number of the data, the number interval of the data, the target of the data, and the noise of the data, need to be analyzed first based on the labeled data set a as an extended basis. As shown in fig. 2, in an embodiment, the data in the tagged data set a obtained in an application, which is the number "2589", may be analyzed to include data characteristics including: the data including numbers 2, 5, 8 and 9, gray scale and brightness of the background picture when representing the numbers, font of the numbers, and interval of the number keys, etc. after analyzing one by one, each data in the labeled data set a will have its own data characteristics.
Based on the data characteristics obtained from the above analysis, other data that fit these data characteristics, but are different from the data in the annotation data set a, will then be produced. It is desirable that the numbers used are the numbers (e.g., any one or more of 0-9) of the data in the annotation data set a, that the gaps between the numbers satisfy the gaps (preferably, they may not be completely equal, and are established with a ± error of 10% -15%) of the numbers of the data in the annotation data set a, and so on. As shown in FIG. 2, dummy data such as "25", "52", "89" can be produced based on "2589" data within the annotation data set. Since the dummy data set is manufactured in accordance with the data characteristics of the data in the label data, it may be a recombination of the respective units of the data in the label data, and when the units are combined, the manufacturing conditions may be arbitrarily set in accordance with the required dummy data amount. It will be appreciated that if a large amount of dummy data is required, only the data features corresponding to the digits of the annotation data may be extracted, while the other data features are arbitrarily chosen to form a large amount of dummy data, and vice versa.
Since the dummy data in the dummy data set F is a combination of the elements of the data in the annotation data, the dummy data in the dummy data set F may be data that is not identical to any of the data in the annotation data. For example, the different data extracted in the annotation data set a includes the numbers 1 and 6, but the numbers 1 and 6 do not constitute the data 16 or 61 separately. However, the dummy data set F may include any arbitrary combination of the number 1 and the number 6, such as 16, 61, 116, 661, etc. Therefore, when the data in the labeled data set appears in a number unit of 0-9, the pseudo data set can be used for manufacturing any real number based on the data characteristics, and the data quantity of the pseudo data set F is greatly enlarged.
S300: expanding the pseudo data set F based on a GAN neural network to form an expanded data set G;
the pseudo data set F is a lateral extension of the annotation data because it still conforms to the original sample data, i.e. conforms to the data characteristics of the annotation data, when it is formed, but the data in the pseudo data set F may not be completely covered when the display of the background, character gaps, and fonts of the data based on the same number in the application program changes on different devices, for example. Therefore, on the basis of the dummy data set F, the dummy data set is expanded based on a GAN neural network to ensure the randomness and diversity of data, and the expanded data set forms an expanded data set G. For data expansion, original data characteristics can be added, and the numbers are recombined on the basis of the added data characteristics.
By utilizing the principle of the GAN neural network, a false sample set is formed by expanding the false data set F, and meanwhile, the false sample set is confronted with a module with a distinguishing function on the basis of marking the true data set A.
The module with discrimination solves a supervised binary problem for determining whether the input data is a label data set a (pseudo data set F) or an extended data set G. In the training process, the label data set A and the expanded data set G generated by the generation network are randomly input into the discrimination model, and whether the discrimination model is true or not is judged by using the discrimination model. And the performance of generating the network and judging the network is continuously improved through a competitive machine learning mechanism, and when the parameters of the two models are stable, the training is finished.
After training, new extended data can be generated according to the current new generation network, and the part of new extended data is closer to the data in the pseudo data set F and the labeled data set A.
S400: identifying whether the data in the extended data set G needs to be labeled or not, and screening the labeled data to form a training data set T;
in order to further ensure the fitting property of the extended data in the extended data set G and the data in the labeled data set A, the extended data is identified and labeled by a training model obtained by means of deep training of a GAN neural network, so that the extended data is cleaned. Since the cleansing action is performed on the augmented data set G, the data in the training data set T is obtained as part of the augmented data set G, i.e. the training data set T is actually contained in the augmented data set G.
S500: training the training data set T by a neural network to form a training model;
the data in the training data set T after cleaning has certain cleanliness and fitting degree with the data in the labeling data set A, the training data set T has data which is in accordance with the data characteristics of the labeling data set A except all the data in the labeling data set A, and the data volume of the data is larger than that of the labeling data set A. It is understood that the amount of data in the extended data set G and the training data set T increases, and the extended data obtained after manufacturing and screening is more in steps S300-S400. The more data volume, when training the training data set T by the neural network, the more full training model is formed, on one hand, the training model accords with the data characteristics of the data in the labeled data set A, on the other hand, the characteristic of data loss of the labeled data set A is enriched, the deformation of the data of the labeled data set A under the conditions of different backgrounds, fonts, intervals and the like is expanded, and the training model is based on the labeled data set A and is superior to the training model of the labeled data set A.
S600: and cleaning the data in the data full set U except the labeled data set A based on the training model, labeling the data conforming to the training model and putting the labeled data in the labeled data set A.
After the training model is provided, other data (data except data in the labeled data set A) in the data complete set U is analyzed and judged according to the standard for judging when the data is labeled in the training model, such as whether the data meets the numbers, the number gaps, the number backgrounds, the number targets and the like in the data characteristics or the variations of the numbers, the variations of the number backgrounds and the like, the data which does not meet the training model can be cleaned according to the judgment result, the data which meets the training model is reserved and classified into the labeled data set A, and therefore the other data which meet the original sample, namely the data in the labeled data set A, are selected in the range of the data complete set U in a high-accuracy mode, and the data quantity of the original sample is enlarged. And it can be understood that the marking data amount for training is increased based on the gradual expansion of the data amount of the original sample, the magnitude of the marking data and the maturity of the training model are exponentially increased after multiple iterations, and the subsequent discrimination on the same type of data can be based on big data and is accurate and efficient in discrimination.
In different embodiments, if the data amount of the new labeled data set a' obtained after the step S600 is executed is insufficient and more data amount needs to be accumulated through multiple iterations, different steps are executed in different embodiments.
Example one
Referring to fig. 3, in a preferred embodiment, the method for generating the annotation data further includes the following steps:
s700: judging whether the data volume in the label data set A' is larger than or equal to an expected data volume;
firstly, it is necessary to judge whether the data volume of the currently obtained labeled data set a' and the quality of the training model satisfy the expected data volume or the expected quality, that is, whether the training model can accurately label the data when analyzing and judging the data in the data corpus U, and the verification method can adopt an experiment to judge how the fitting between the labeled data and the labeled data is supposed to be accurate under the training model.
S800: and when the data volume in the annotation data set A is smaller than the expected data volume, taking the union of the training data set T and the annotation data set A, and executing the steps S500-S600 again.
When the amount of data in the annotation data set A is deemed to be smaller than the desired amount of data, the amount of data in the annotation data set A needs to be increased. Therefore, it is only necessary to increase the data amount of the training data set T obtained in step S400 without newly performing the quality improvement of the extended data set G based on the GAN neural network. Therefore, the training data set T and the labeling data set a are merged to obtain a new training data set T ', that is, T' ═ tuo. And then, based on the new training data set T ', iterating back to the step S500, performing neural network training on the training data set again to form a new training model, and then executing the step S600 again to obtain a new label data set A' after iteration. If the condition of model training cannot be met, repeating the steps S500-S600 until the finally obtained new annotation data set A' meets the condition of model training.
Example two
In another embodiment, the annotation data generation method further comprises the steps of:
s700: judging whether the data volume in the label data set A' is larger than or equal to an expected data volume;
firstly, it is necessary to judge whether the data volume of the currently obtained labeled data set a' and the quality of the training model satisfy the expected data volume or the expected quality, that is, whether the training model can accurately label the data when analyzing and judging the data in the data corpus U, and the verification method can adopt an experiment to judge how the fitting between the labeled data and the labeled data is supposed to be accurate under the training model.
S800': when the data amount in the annotation data set A is smaller than the expected data amount, the data in the dummy data set F is replaced with the data in the annotation data set A, and the steps S300-S600 are executed again
Different from the trust of the GAN neural network training result in the first embodiment, in this embodiment, it is determined that the countermeasure experiment of the GAN neural network needs to be performed again, and therefore, after the step S600 is completed, when the data volume in the labeled data set a' is considered to be smaller than the expected data volume, the data in the dummy data set F is replaced with the data in the labeled data set a, so that the original sample of the GAN neural network is reduced to the data in the labeled data set a.
And after replacing the data in the pseudo data set F with the data in the labeled data set A, sequentially executing the steps S300-S600 to obtain a new labeled data set A' after iteration. If the new annotation data set a ″ still fails to satisfy the condition of model training, it can be selected to iterate back to S500 or S300 for re-execution as needed in the first or second embodiment. And because the iterative process may be more than once, the iterative manner of the first embodiment and the second embodiment can be freely combined according to the recognition degree of the user on the quality of the dummy data set F and the quality of the training data set T.
Referring to fig. 4, in this embodiment, after the step S600, S800 or S800' is executed, the method further includes:
s900: training other data sets except the full data set based on the labeled data set A and/or the training data set T formed in step S600
In this step, if the data corpus U includes other data sets besides the data corpus U when the data corpus U is selected, but not data in all states of an application program, or only data within a certain time period is selected, according to the user's needs, in this case, the labeled data set a and/or the training data set T having a certain amount of data already in a certain scale, or the training model may be used to train and label the other data sets. It will be appreciated that, since the training data set T is derived from the annotation data set a, the data in the annotation data set a can be preferentially selected and then optionally participate in subsequent training according to the quality of the data.
Referring to fig. 5, when the dummy data set F is augmented based on the GAN neural network, the method specifically comprises the following steps:
s310: constructing a generation model and a discrimination model;
as described above, the module having a generating function, which is a generating model, has a generating network, the module having a discriminating function, which is a discriminating model, has a discriminating network. The generative model and the discriminant model are subjected to antagonistic training based on the principle of the GAN neural network, so that the data generation precision of the generative model and the data discrimination accuracy of the discriminant model are improved simultaneously.
S320: configuring the discrimination model to output the discrimination probability value of the data in the pseudo data set to be more than 0.5, and deeply learning the output of the discrimination probability value of the data in the non-pseudo data set based on the discrimination probability value of the data in the pseudo data set;
in the initial training, the discrimination target to be input is data in the pseudo data set F, and the data in the pseudo data set F is data that is manufactured by the user to expand the diversity of the data. The discrimination network makes a judgment on the input, judges whether the input is true (a false data set F) or false (expanded data generated by the generation network), compares the judgment result with the real situation, reversely influences the model parameters, and continuously iterates the process, so that the discrimination network learns the difference of true and false data. Through the judgment of the marked data, the prototype of the data is trained by the judgment model, so that when the input user is unknown true or false, the possibility of the true or false of the data is verified based on the true or false of the preorder data, that is, based on the deep learning of the judgment probability value of the data in the false data set F, the output of the judgment probability value of the data in the non-false data set F, that is, the expanded data is realized, and the true or false of the data is judged while the data is expanded.
S330: generating a model to generate a data set to be expanded based on data in the pseudo data set F;
on the model generation side, based on the data in the pseudo data set F, the numbers, the intervals, the fonts, the combination forms and the like in the data are deformed and recombined again, so that a data set to be expanded is generated.
S340: the generation model inputs the pseudo data set and the data set to be expanded into the discrimination model;
along with the formation of the data set to be expanded, the marking data set A is gradually expanded into a pseudo data set F and a data set to be expanded, then the pseudo data set F and the data set to be expanded are input into a discrimination model, and the discrimination model discriminates the data and outputs the true-false probability value of the data.
S350: and collecting data with the discriminant probability value larger than 0.5 output by the discriminant model to form an expanded data set.
Because the discrimination model trains the discrimination standard based on the data in the pseudo data set F, the data with the discrimination probability value larger than 0.5 (namely the data in all the pseudo data sets F and partial data in the data set to be expanded which accords with the discrimination standard of the discrimination model) output by the discrimination model are collected, and the merged data set is the expanded data set G.
The expansion data set G obtained through the steps not only ensures the characteristic approximation of the data in the labeled data set A, but also randomly expands and diversifies the data in the pseudo data set F, namely, orderly expands the data volume of the labeled data set A.
Referring to fig. 6, in this embodiment, step S400 is specifically performed by the following steps:
s410: verifying the data in the extended data set G according to the data characteristics of the marked data set A;
and identifying and labeling the data in the newly generated extended data set G through an existing model of labeling the data characteristics of the data set A, and verifying whether the data in the extended data set G should be subjected to subsequent identification operation.
The verification mode can adopt a voting verification method, namely secondary identification is carried out based on the principle that minority obeys majority or uniformly passes, and the verification result that a plurality of discriminators use more or all proportion based on the data characteristics of the labeled data set A is respected. That is, the step S410 of verifying the data in the augmented data set G according to the data feature of the annotation data set a includes: s411: verifying the data in the extended data set G by taking the data in the labeled data set A as a model; s412: and when the data verification result of more than half of the stages or all the stages in the model is consistent, judging that the verification result is the identification mark.
S420: and extracting the data with the identification mark as the verification result, and deleting the data with the verification result not being the identification mark in the expansion data set.
The verification result comprises an identification label and an unidentified label, the data belonging to the identification label, namely the data characteristics of the data conforming to the labeled data set A, are retained in the expanded data set G, and the verification result is unidentified label and is deleted for data cleaning.
The method for generating the annotation data in any embodiment can be applied to an annotation data generation device, such as an intelligent terminal, a server, a workstation, a virtual server, a virtual workstation, and the like. The annotated data generating device comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor, and when the processor executes the computer program, the computer program realizes the annotated data generating method according to a pre-configured source language.
The above-mentioned annotation data generation method can also be integrated into a computer-readable storage medium, on which a computer program is stored, and the computer program can be executed by a processor to implement the above-mentioned voice control method, and the computer-readable storage medium can be represented in the form of software, virtual file, etc.
The smart terminal may be implemented in various forms. For example, the terminal described in the present invention may include an intelligent terminal such as a mobile phone, a smart phone, a notebook computer, a PDA (personal digital assistant), a PAD (tablet computer), a PMP (portable multimedia player), a navigation device, etc., and a fixed terminal such as a digital TV, a desktop computer, etc. In the following, it is assumed that the terminal is a smart terminal. However, it will be understood by those skilled in the art that the configuration according to the embodiment of the present invention can be applied to a fixed type terminal in addition to elements particularly used for moving purposes.
It should be noted that the embodiments of the present invention have been described in terms of preferred embodiments, and not by way of limitation, and that those skilled in the art can make modifications and variations of the embodiments described above without departing from the spirit of the invention.
Claims (10)
1. A method for generating annotation data is characterized by comprising the following steps:
s100: acquiring a data corpus and a labeled data set which is contained in the data corpus and is labeled;
s200: analyzing the data characteristics of the labeled data set, manufacturing a pseudo data set which accords with the data characteristics according to the data characteristics, and manufacturing other data which accords with the data characteristics and is different from the data in the labeled data set on the basis of the data characteristics obtained by the analysis, wherein the used numbers are the numbers of the data in the labeled data set when manufacturing, and the gaps among the numbers meet the gaps among the numbers of the data in the labeled data set, so that the pseudo data in the pseudo data set is the recombination of each unit of the data in the labeled data;
s300: expanding the pseudo data set based on a GAN neural network to form an expanded data set;
s400: identifying whether the data in the extended data set needs to be labeled or not, and screening the labeled data to form a training data set;
s500: carrying out neural network training on the training data set to form a training model;
s600: and cleaning the data in the data corpus except the labeled data set based on the training model, labeling the data conforming to the training model and putting the data into the labeled data set.
2. The annotation data generation method of claim 1,
the method for generating the annotation data further comprises the following steps:
s700: judging whether the data volume in the labeled data set is larger than or equal to an expected data volume;
s800: and when the data volume in the labeled data set is smaller than the expected data volume, taking the union of the training data set and the labeled data set, and executing the steps S500-S600 again.
3. The annotation data generation method of claim 2,
step S800 is replaced with:
s800': and when the data volume in the labeled data set is smaller than the expected data volume, replacing the data in the pseudo data set with the data in the labeled data set, and executing the steps S300-S600 again.
4. The annotation data generation method of claim 1,
the method for generating the annotation data further comprises the following steps:
s900: training the other data sets except the data corpus based on the labeled data set and/or the training data set formed in step S600.
5. The annotation data generation method of claim 1,
expanding the pseudo data set based on the GAN neural network to form an expanded data set S300, including:
s310: constructing a generation model and a discrimination model;
s320: configuring the discrimination model to output discrimination probability values of data in the pseudo data set to be more than 0.5, and deeply learning the output of the discrimination probability values of data in non-pseudo data sets based on the discrimination probability values of the data in the pseudo data set;
s330: the generation model generates a data set to be expanded based on the data in the pseudo data set;
s340: the generation model inputs the pseudo data set and the data set to be expanded into the discrimination model;
s350: and collecting data with the discriminant probability value larger than 0.5 output by the discriminant model to form the expanded data set.
6. The annotation data generation method of claim 1,
identifying whether tagging is required for data in the augmented data set, and the step S400 of filtering the tagged data to form a training data set includes:
s410: verifying the data in the extended data set according to the labeled data set and the data characteristics;
s420: and extracting the data with the identification mark as the verification result, and deleting the data with the verification result not being the identification mark in the expansion data set.
7. The annotation data generation method of claim 6,
the step S410 of verifying the data in the augmented data set according to the annotated data set and the data features comprises:
s411: verifying the data in the extended data set by taking the data in the labeled data set as a model;
s412: and when the data are verified to be consistent by more than half of levels or all levels in the model, judging that the verification result is the identification label.
8. The annotation data generation method of claim 1,
the data characteristics include: one or more of a background of the data, a unit number of the data, a digital gap of the data, an object of the data, and a noise of the data.
9. An annotation data generation apparatus comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the annotation data generation method according to any one of claims 1 to 8 when executing the computer program.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the annotation data generation method according to any one of claims 1 to 8.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810609646.8A CN108960409B (en) | 2018-06-13 | 2018-06-13 | Method and device for generating annotation data and computer-readable storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810609646.8A CN108960409B (en) | 2018-06-13 | 2018-06-13 | Method and device for generating annotation data and computer-readable storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108960409A CN108960409A (en) | 2018-12-07 |
CN108960409B true CN108960409B (en) | 2021-08-03 |
Family
ID=64488602
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810609646.8A Active CN108960409B (en) | 2018-06-13 | 2018-06-13 | Method and device for generating annotation data and computer-readable storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108960409B (en) |
Families Citing this family (18)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109816019A (en) * | 2019-01-25 | 2019-05-28 | 上海小萌科技有限公司 | A kind of image data automation auxiliary mask method |
CN109978029B (en) * | 2019-03-13 | 2021-02-09 | 北京邮电大学 | Invalid image sample screening method based on convolutional neural network |
CN110189351A (en) * | 2019-04-16 | 2019-08-30 | 浙江大学城市学院 | A kind of scratch image data amplification method based on production confrontation network |
CN112257731B (en) * | 2019-07-05 | 2024-06-28 | 杭州海康威视数字技术股份有限公司 | Virtual data set generation method and device |
CN110569379A (en) * | 2019-08-05 | 2019-12-13 | 广州市巴图鲁信息科技有限公司 | Method for manufacturing picture data set of automobile parts |
CN110874484A (en) * | 2019-10-16 | 2020-03-10 | 众安信息技术服务有限公司 | Data processing method and system based on neural network and federal learning |
US11651276B2 (en) | 2019-10-31 | 2023-05-16 | International Business Machines Corporation | Artificial intelligence transparency |
CN111143617A (en) * | 2019-12-12 | 2020-05-12 | 浙江大学 | Automatic generation method and system for picture or video text description |
CN111177132B (en) * | 2019-12-20 | 2024-07-30 | 中国平安人寿保险股份有限公司 | Method, device, equipment and storage medium for cleaning label of relational data |
CN111382785B (en) * | 2020-03-04 | 2023-09-01 | 武汉精立电子技术有限公司 | GAN network model and method for realizing automatic cleaning and auxiliary marking of samples |
CN111476324B (en) * | 2020-06-28 | 2020-10-02 | 平安国际智慧城市科技股份有限公司 | Traffic data labeling method, device, equipment and medium based on artificial intelligence |
CN111741018B (en) * | 2020-07-24 | 2020-12-01 | 中国航空油料集团有限公司 | Industrial control data attack sample generation method and system, electronic device and storage medium |
CN112308167A (en) * | 2020-11-09 | 2021-02-02 | 上海风秩科技有限公司 | Data generation method and device, storage medium and electronic equipment |
CN112508000B (en) * | 2020-11-26 | 2023-04-07 | 上海展湾信息科技有限公司 | Method and equipment for generating OCR image recognition model training data |
CN112580310B (en) * | 2020-12-28 | 2023-04-18 | 河北省讯飞人工智能研究院 | Missing character/word completion method and electronic equipment |
CN113239205B (en) * | 2021-06-10 | 2023-09-01 | 阳光保险集团股份有限公司 | Data labeling method, device, electronic equipment and computer readable storage medium |
CN114926709A (en) * | 2022-05-26 | 2022-08-19 | 成都极米科技股份有限公司 | Data labeling method and device and electronic equipment |
CN116451087B (en) * | 2022-12-20 | 2023-12-26 | 石家庄七彩联创光电科技有限公司 | Character matching method, device, terminal and storage medium |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107392125A (en) * | 2017-07-11 | 2017-11-24 | 中国科学院上海高等研究院 | Training method/system, computer-readable recording medium and the terminal of model of mind |
CN107622056A (en) * | 2016-07-13 | 2018-01-23 | 百度在线网络技术(北京)有限公司 | The generation method and device of training sample |
CN107644235A (en) * | 2017-10-24 | 2018-01-30 | 广西师范大学 | Image automatic annotation method based on semi-supervised learning |
CN108009589A (en) * | 2017-12-12 | 2018-05-08 | 腾讯科技(深圳)有限公司 | Sample data processing method, device and computer-readable recording medium |
-
2018
- 2018-06-13 CN CN201810609646.8A patent/CN108960409B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107622056A (en) * | 2016-07-13 | 2018-01-23 | 百度在线网络技术(北京)有限公司 | The generation method and device of training sample |
CN107392125A (en) * | 2017-07-11 | 2017-11-24 | 中国科学院上海高等研究院 | Training method/system, computer-readable recording medium and the terminal of model of mind |
CN107644235A (en) * | 2017-10-24 | 2018-01-30 | 广西师范大学 | Image automatic annotation method based on semi-supervised learning |
CN108009589A (en) * | 2017-12-12 | 2018-05-08 | 腾讯科技(深圳)有限公司 | Sample data processing method, device and computer-readable recording medium |
Non-Patent Citations (1)
Title |
---|
"生成式对抗网络GAN 的研究进展与展望";王坤峰;《自动化学报》;20170331;第43卷(第3期);第1-4节 * |
Also Published As
Publication number | Publication date |
---|---|
CN108960409A (en) | 2018-12-07 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108960409B (en) | Method and device for generating annotation data and computer-readable storage medium | |
CN110097094B (en) | Multiple semantic fusion few-sample classification method for character interaction | |
CN111488931B (en) | Article quality evaluation method, article recommendation method and corresponding devices | |
CN109993102B (en) | Similar face retrieval method, device and storage medium | |
CN112434721A (en) | Image classification method, system, storage medium and terminal based on small sample learning | |
CN108288051B (en) | Pedestrian re-recognition model training method and device, electronic equipment and storage medium | |
CN112464865A (en) | Facial expression recognition method based on pixel and geometric mixed features | |
CN109299258A (en) | A kind of public sentiment event detecting method, device and equipment | |
CN108229588B (en) | Machine learning identification method based on deep learning | |
CN109948735B (en) | Multi-label classification method, system, device and storage medium | |
WO2018196718A1 (en) | Image disambiguation method and device, storage medium, and electronic device | |
CN109993057A (en) | Method for recognizing semantics, device, equipment and computer readable storage medium | |
CN112016601B (en) | Network model construction method based on knowledge graph enhanced small sample visual classification | |
CN110610193A (en) | Method and device for processing labeled data | |
CN108537119A (en) | A kind of small sample video frequency identifying method | |
CN111325237B (en) | Image recognition method based on attention interaction mechanism | |
CN107871314A (en) | A kind of sensitive image discrimination method and device | |
CN112148994B (en) | Information push effect evaluation method and device, electronic equipment and storage medium | |
CN112733602B (en) | Relation-guided pedestrian attribute identification method | |
CN112784921A (en) | Task attention guided small sample image complementary learning classification algorithm | |
CN113704623A (en) | Data recommendation method, device, equipment and storage medium | |
CN117726884B (en) | Training method of object class identification model, object class identification method and device | |
CN115272692A (en) | Small sample image classification method and system based on feature pyramid and feature fusion | |
CN108830302B (en) | Image classification method, training method, classification prediction method and related device | |
CN108229692B (en) | Machine learning identification method based on dual contrast learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |