WO2023113294A1

WO2023113294A1 - Training data management device and method

Info

Publication number: WO2023113294A1
Application number: PCT/KR2022/018893
Authority: WO
Inventors: 이춘식; 전혜경; 하광림; 강인호
Original assignee: 주식회사 씨에스리
Priority date: 2021-12-16
Filing date: 2022-11-25
Publication date: 2023-06-22

Abstract

The present invention relates to managing training data, and more specifically relates to: collecting, labeling, and adjusting imbalances in training data for training an artificial intelligence model; updating the artificial intelligence model through the training data; and correcting for performance improvement. According to an embodiment of the present invention, data similar to learned data can be excluded from training data to reduce the cost of labeling.

Description

Learning data management device and method

The present invention relates to the management of learning data, and more particularly, to the collection, labeling, and adjustment of imbalance of learning data for learning an artificial intelligence model, updating an artificial intelligence model through learning data, and correction for performance improvement. It is about.

Training data including appropriate labels are required for training of artificial intelligence models. Labels may be directly matched by a person for each data or automatically set corresponding to each data according to a preset algorithm.

In order to provide training data, a lot of time and money are consumed for tasks such as data collection and labeling. In particular, the process of manually labeling is time consuming and costly. If there are many similar data, it is inefficient for a person to directly label each data.

In addition, the performance of AI models improves when appropriate learning is performed. However, in order to process new input data, a new training data set for the corresponding process is required, but it is difficult to secure a training data set corresponding to a pattern of the changed input data.

On the other hand, since the requirements of the artificial intelligence model vary, an update process through separate learning data is required. However, since the renewal process of an AI model is expensive, the appropriate renewal timing is important.

The present invention provides a learning data management apparatus and method capable of reducing the cost of data collection for model generation and renewal, and improving the performance of a model by constructing learning data through data generated from actual services.

According to one aspect of the present invention, an apparatus for managing learning data is provided.

When receiving unlabeled data, the learning data management apparatus according to an embodiment of the present invention performs clustering on the unlabeled data according to a predetermined pattern to form one or more clusters, and a cluster of unlabeled data. A similarity calculation unit that calculates the similarity between the unlabeled cluster and the learning cluster, which is a cluster of pre-learning data, and the unlabeled data included in the unlabeled cluster is excluded from the target of the training data according to the similarity, or the unlabeled data included in the unlabeled cluster is excluded from the target of the training data. A labeling unit performing labeling on label data may be included.

According to another aspect of the present invention, a learning data management method and a computer program executing the same are provided.

A learning data management method and a computer program executing the method according to an embodiment of the present invention, when receiving unlabeled data, perform clustering on the unlabeled data according to a predetermined pattern to form one or more clusters; The step of calculating the similarity between the unlabeled cluster, which is a cluster of label data, and the learning cluster, which is a cluster of pre-learning data, and according to the similarity, unlabeled data included in the unlabeled cluster is excluded from the target of the training data, or is included in the unlabeled cluster. and performing labeling on included unlabeled data.

According to an embodiment of the present invention, the cost required for labeling can be reduced by excluding data similar to pre-learned data from the training data.

According to an embodiment of the present invention, it is possible to efficiently determine the renewal time of an artificial intelligence model in operation, thereby reducing the cost required for renewal while maintaining the performance of the artificial intelligence model.

In addition, according to an embodiment of the present invention, the update time of the model can be determined while managing the input data of the artificial intelligence model in service as learning data.

According to an embodiment of the present invention, imbalance between classes can be alleviated by matching the ratio with the underclass by adjusting the ratio through oversized class sampling.

In addition, according to an embodiment of the present invention, a data set close to data with low information loss and strong against data noise can be constructed.

According to an embodiment of the present invention, the performance of a model can be improved through data generated from a service without newly learning an artificial intelligence model through new data.

According to an embodiment of the present invention, it is possible to reduce the cost of data collection for model generation and renewal, and it is possible to improve the performance of the model by constructing learning data through data generated in actual services.

1 is a block diagram illustrating a learning data management device according to an embodiment of the present invention.

2 is a block diagram illustrating a labeler of a learning data management device according to an embodiment of the present invention.

3 is a block diagram illustrating a model learning unit of a learning data management device according to an embodiment of the present invention.

4 is a block diagram illustrating the structure of a learning data providing unit of a learning data management device according to an embodiment of the present invention.

5 is a flowchart illustrating a method of performing labeling by an apparatus for managing learning data according to an embodiment of the present invention.

6 is a flowchart illustrating a process of labeling learning data by an apparatus for managing learning data according to an embodiment of the present invention;

7 is a flowchart illustrating a process of sampling learning data by an apparatus for managing learning data according to an embodiment of the present invention;

Since the present invention can make various changes and have various embodiments, specific embodiments are illustrated in the drawings and will be described in detail through detailed description. However, this is not intended to limit the present invention to specific embodiments, and should be understood to include all modifications, equivalents, and substitutes included in the spirit and scope of the present invention. In describing the present invention, if it is determined that a detailed description of related known technologies may unnecessarily obscure the subject matter of the present invention, the detailed description will be omitted. Also, as used in this specification and claims, the terms "a" and "an" are generally to be construed to mean "one or more" unless stated otherwise.

Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings. do it with

1 is a block diagram illustrating an apparatus for managing learning data according to an embodiment of the present invention.

Referring to FIG. 1 , the learning data management device 10 according to an embodiment of the present invention includes a labeler 100, a model learning unit 200, a learning data providing unit 300, and a learning data collecting unit 400. and a model correction unit 500 .

The labeler 100 receives unlabeled data from the outside, and data having a similarity with pre-learned learning data stored in the training data providing unit 300 (hereinafter, referred to as pre-learning data) equal to or greater than a predetermined first threshold. Excluding from the learning data, labeling is performed on the remaining data and transmitted to the learning data providing unit 300. Hereinafter, a specific operation of the labeler 100 will be described in detail with reference to FIG. 2 below.

Referring to FIG. 2 , the labeler 100 includes a clustering unit 110 , a similarity calculating unit 120 and a labeling unit 130 .

When receiving unlabeled data (hereinafter, referred to as unlabeled data), the clustering unit 110 performs clustering on each unlabeled data according to a predetermined pattern to form one or more clusters. At this time, the clustering unit 110 may form clusters through techniques such as K-means, hierarchical clustering, spectral clustering, and DBSCAN. For example, when receiving unlabeled data including {swimsuit, seawater, echo}, the clustering unit 110 performs clustering on the corresponding data to form a first cluster {swimsuit, seawater} and a second cluster. Can form {echo}.

The similarity calculation unit 120 calculates the similarity between each cluster and the pre-learning data used for model learning. For example, the learning data providing unit (300 ) is stored in The similarity calculating unit 120 may calculate a similarity between {swimsuit, seawater}, which is the first cluster, and the clusters of pre-trained data classified into each class. Also, the similarity calculating unit 120 may calculate a similarity between {echo}, which is the second cluster, and the clusters of pre-trained data classified into each class. The similarity calculation unit 120 may calculate the similarity through techniques such as cosine similarity, Jacquard similarity, and Euclidean similarity techniques. The similarity calculating unit 120 transmits each similarity to the labeling unit 130 .

The labeling unit 130 excludes the corresponding unlabeled data cluster from the training data when a similarity between the unlabeled data cluster and one or more clusters of the previously trained data is greater than or equal to a first threshold value. That is, the labeling unit 130 performs labeling on clusters of unlabeled data whose similarity with all clusters is less than a first threshold value among the previously learned data, and uses the labeled data as new training data. The training data providing unit 300 save to

For example, the labeling unit 130 may determine the similarity between {swimsuit, seawater}, which is the first group of unlabeled data, and {water, wave, swimming, etc.}, which is data classified as "sea" (eg, each classification When the average of similarities between data) is greater than or equal to the first threshold, data of the first cluster may be excluded from the training data. The labeling unit 130 performs labeling on the second cluster {echo} when the similarity between the second cluster {echo} and the cluster of pre-learning data is less than the first threshold value, and provides the learning data as training data ( 300). At this time, the labeling unit 130 determines whether the similarity between the second cluster {echo} and the pre-learning data is less than the first threshold and greater than or equal to the second threshold (in this case, the first threshold is equal to or greater than the second threshold). greater natural number), the label for the data of the second cluster may be automatically set to the same label as the cluster of the class having the most similarity with each data of the second cluster. In addition, the labeling unit 130 performs labeling by directly receiving a new label for each data of the second cluster from the user when the similarity between {echo}, which is the second cluster, and the clusters of the previously learned data is less than the second threshold. can do.

Therefore, the labeler 100 prevents overfitting for a specific class by preventing unlabeled data having a very high similarity to the pre-learning data from being used as training data among unlabeled data, and By performing automatic labeling on labeled data, the number of unlabeled data to which new labels are added can be reduced, thereby reducing the load of the labeling task.

Referring back to FIG. 1 , the model learning unit 200 learns a model constituting artificial intelligence through learning data stored in the learning data providing unit 300 . In this case, the model may be a model implemented by a system such as a neural network constituting artificial intelligence. When receiving additional training data from the training data providing unit 300, the model learning unit 200 compares patterns of the pre-learning data (data and labels) with patterns of the additional training data. The model learning unit 200 updates the model through the additional training data when the similarity between the pattern of the pre-learning data and the pattern of the additional training data is less than a predetermined threshold. Conversely, the model learning unit 200 does not update the model through the additional training data when the similarity between the pattern of the pre-learning data and the pattern of the additional training data is greater than or equal to a predetermined threshold. Hereinafter, a detailed structure of the model learning unit 200 will be described with reference to FIG. 3 .

Referring to FIG. 3 , the model learning unit 200 includes a pattern storage unit 210 , an update determination unit 220 and a learning unit 230 .

The pattern storage unit 210 stores patterns of learning data. For example, when a model for predicting a season according to clothing is generated with a training data set using clothing training data and a season label, the pattern storage unit 210 stores <short sleeve - summer>, <long sleeve - winter> and Similarly, you can store patterns that associate clothing with seasons.

The update determination unit 220 updates the model through the additional training data when the pattern similarity between the additional training data and the pre-learning data is less than or equal to a designated threshold when receiving an additional training data pattern from the training data providing unit 300. is performed through the learning unit 230. For example, when the pattern of the additional learning data is <Short Sleeve - Summer>, since the pattern of the pre-learning data and the pattern of the additional learning data are the same, the update determining unit 220 performs an update according to <Short Sleeve - Summer>. I never do that. On the other hand, if the pattern of the additional training data is <Shorts - Summer>, and the similarity between the pattern of the pre-learning data and the pattern of the additional training data is less than or equal to a specified threshold, the learning unit 230 performs a model update according to <Shorts - Summer>. can be done through Conversely, if the similarity between the pattern of the pre-learning data and the pattern of the additional learning data exceeds a specified threshold, the update determination unit 220 suspends the update according to the additional learning data, and through the additional learning data similar to the pre-learning data Through the update, it is possible to prevent performing an update that is not highly effective.

In addition, the update determination unit 220 determines the learning unit 230 when the ratio of the number of additional learning data to the total learning data received from the learning data providing unit 300 is greater than or equal to a predetermined threshold (eg, 30%). You can perform model update through . Alternatively, when the ratio of the number of additional training data to the total training data is equal to or greater than a predetermined threshold, the update determination unit 220 determines that the pattern similarity between the additional training data and the pre-learning data is equal to or less than a designated threshold, and determines the additional training data. Update of the model through can be performed through the learning unit 230.

The learning unit 230 performs model learning through training data and additional training data.

Therefore, the learning data management device performs a model update only when the pattern of the additional training data differs from the pattern of the pre-trained data by a certain amount or more to prevent an inefficient situation in which the model is updated according to the addition of all additional training data. can

In addition, the learning data management device can determine the renewal time of the artificial intelligence model while managing the input value of the artificial intelligence model in service as learning data using the previously added label.

Referring back to FIG. 1 , the learning data providing unit 300 stores learning data and provides the learning data to the model learning unit 200 . When the learning data providing unit 300 classifies the learning data into a plurality of classes, it detects an excessive class having the largest number of learning data in the class and a small class having the smallest number of learning data in the class. The training data providing unit 300 adjusts the number of training data for the excessive class by performing sampling on the excessive class so that the ratio of the number of training data of the excessive class and the small class falls within a specified range. In addition, the learning data providing unit 330 generates noise data by using the training data of the exaggerated class, inputs the noise data to an unsupervised learning generative adversarial network (GAN) to generate noise learning data, and generates noise learning data in the existing training data. Noise learning data may be added and provided to the model learning unit 200 . Hereinafter, a detailed structure of the learning data provider 300 will be described with reference to FIG. 4 .

Referring to FIG. 4 , the learning data providing unit 300 includes an imbalance distribution checking unit 310, a sampling unit 320, and a noise adding unit 330.

When the learning data is classified into a plurality of classes, the imbalanced distribution checking unit 310 determines whether a ratio of the number of training data of the overclass and underclass falls within a specified range. If the ratio of the number of training data of the large class to the small class exceeds the specified range, a sampling request signal requesting sampling of the large class is transmitted to the sampling unit 320 .

When receiving the sampling request signal, the sampling unit 320 performs sampling on the excessive class. The sampling unit 320 may calculate a central point of a feature vector corresponding to the training data of the exaggerated class. The sampling unit 320 samples learning data in which the distance of a specific vector is within a certain range based on the calculated center point, or simple random sampling, two-step sampling, sampling by layer, cluster/colony sampling, systematic sampling, etc., which are stochastic sampling methods. technique can be applied

The noise adder 330 generates noise data using the training data and inputs the noise data to an unsupervised learning GAN to generate noise training data. The noise adder 330 may add noise learning data to the sampling-completed training data and provide it to the model learner 200 .

Therefore, the learning data management apparatus according to an embodiment of the present invention prevents a model from being learned in a biased direction that predicts only the result of an exaggerated class due to an enlargement of a specific class, and the model through noise learning data generated in consideration of noise. This noise can be robustly learned.

Referring back to FIG. 1 , the learning data collection unit 400 collects learning data through input data and output data generated from a service through a learned model. For example, the learning data collection unit 400 may collect input data and output data generated from a service through a learned model. The learning data collection unit 400 sets an additional learning data set in which input data is set as learning data by labeling output data existing within the label range (a set of labels set for the previously learned data) of the pre-learning data among the output data. can be configured. At this time, the learning data collection unit 400 may include the corresponding input data and output data in the additional training data set only when the input data corresponding to the output data existing within the label range of the pre-learning data is different from the pre-learning data. there is. The learning data collection unit 400 transmits an additional learning data set to the learning data providing unit 300 .

Therefore, the learning data management device according to an embodiment of the present invention can reduce the cost of data collection for model creation and renewal, and improve the performance of the model by configuring learning data through data generated in actual services. there is.

The model correction unit 500 improves the performance of the model by correcting data output by the model. For example, the model correction unit 500 may monitor input data and output data generated in a service through the learned model through the learning data collection unit 400 . When the model corrector 500 takes the pre-learning data used for model learning as an input, the first output output from the model and input data similar to the pre-learning data (cosine similarity, Jacquard similarity, Euclidean similarity technique and Through the same technique, a deviation of the second output output from the model may be calculated through input of input data whose similarity with the pre-learning data is equal to or greater than a specified threshold. The model correction unit 500 may set the center value of the class to which the difference between the first output and the second output belongs as an error correction value. At this time, the model correction unit 500 may preset classes for classifying deviations between the first output and the second output, and may set a center value of deviations corresponding to each class as an error correction value. The model correction unit 500 may correct output data corresponding to the input data by applying the error correction value to input data or output data. At this time, the model correction unit 500 may add or subtract an error correction value to the output data, or add or subtract an error correction value to the input data and then input the input data to the model so that the output data of the model is corrected.

Therefore, the learning data management device according to an embodiment of the present invention can improve the performance of a model through data generated from a service without newly learning an artificial intelligence model through new data.

5 is a flowchart illustrating a method of performing labeling by an apparatus for managing learning data according to an embodiment of the present invention. Each step described below is a process performed by each functional unit constituting the learning data management device described above with reference to FIG.

Referring to FIG. 5 , in step 510, the learning data management device labels learning data for which no label has been set. At this time, a specific process of performing labeling will be described in detail with reference to FIG. 6 later.

In step 515, the learning data management device collects additional learning data through data input and output to the model from the service through the model. For example, the learning data management device may collect input data and output data generated from a service through a learned model. The learning data management device may configure an additional training data set in which input data is set as training data by labeling output data existing within a label range of the training data among the output data. In this case, the learning data management device may include the corresponding input data and output data in the additional training data set only when the input data corresponding to the output data existing within the label range of the pre-learning data is different from the pre-learning data.

In step 520, the learning data management device performs sampling on the learning data and adjusts the number of learning data belonging to the excessive class. At this time, a specific process of performing the sampling will be described in detail with reference to FIG. 7 later.

In step 525, the learning data management device determines whether the model needs to be updated. In this case, the learning data management device may determine whether the model needs to be updated according to the similarity between the previous learning data and the additional learning data pattern or the ratio of the additional learning data to the total learning data. For example, the learning data management device may store and manage patterns of additional learning data and pre-learning data. The learning data management device may determine that the model needs to be updated through the additional learning data when the similarity between the patterns of the additional learning data and the pre-learning data is less than or equal to a specified threshold. Alternatively, the learning data management device may determine that the model needs to be updated through the additional training data when the ratio of the additional training data to the total training data is greater than or equal to a specified threshold (eg, 30%).

If the model needs to be updated in step 525, in step 530, the learning data management device updates the model using the additional learning data.

In step 525, when updating of the model is not required, in step 535, the learning data management device outputs the first output from the model when the pre-learning data is used as an input, and the pre-learning among the input data received during service provision through the model. The output data of the model is corrected according to the deviation of the second output output from the model through the input of input data similar to the data. In this case, the learning data management device may preset classes for classifying deviations between the first output and the second output, and may set a center value of deviations corresponding to each class as an error correction value. The learning data management device may correct output data corresponding to the input data by applying the error correction value to the output data. Alternatively, the learning data management device may apply the error correction value to the input data and then input the error correction value to the model so that the value of the output data is corrected.

6 is a flowchart illustrating a process of labeling learning data by a learning data management device according to an embodiment of the present invention. Each process described below may be a process corresponding to step 510 of FIG. 5 .

Referring to FIG. 6 , in step 610, the learning data management apparatus performs clustering on each unlabeled data according to a predetermined pattern to form one or more clusters.

In step 620, the learning data management apparatus calculates a similarity between each cluster of unlabeled data and a cluster of pre-learning data used for model learning. The learning data management device may calculate the similarity through techniques such as cosine similarity, Jacquard similarity, and Euclidean similarity.

In step 630, the learning data management apparatus excludes the corresponding unlabeled data cluster from the training data when the similarity between the unlabeled data cluster and one or more clusters of the previously learned data is greater than or equal to a first threshold value. Alternatively, the learning data management device may determine if the similarity of a cluster of unlabeled data with all clusters in the previously learned data is less than a first threshold and greater than or equal to a second threshold (in this case, the first threshold exceeds the second threshold). value), the label corresponding to each data can be automatically set to the same label as the cluster of the class that has the most similarity with each data included in the corresponding cluster. Alternatively, the learning data management apparatus may perform labeling by directly receiving a new label for each data from the user when the similarity between the clusters of the unlabeled data and all clusters among the previously learned data is less than the second threshold.

7 is a flowchart illustrating a process of sampling learning data by an apparatus for managing learning data according to an embodiment of the present invention. Each process described below is a process corresponding to 520 in FIG. 5 .

Referring to FIG. 7 , in step 710, when learning data is classified into a plurality of classes, the learning data management apparatus detects an overclass having the largest number of learning data in the class and an underclass having the smallest number of learning data in the class.

In step 720, the learning data management device adjusts the number of training data for the excessive class by performing sampling on the excessive class so that the ratio of the number of training data of the excessive class and the small class falls within a specified range. For example, the learning data management device may calculate a central point of a feature vector corresponding to the learning data of an exaggerated class, and perform sampling on the exaggerated class based on the central point.

In step 730, the learning data management device generates noise data using the training data of the overclass, inputs the noise data into an unsupervised learning generative adversarial network (GAN) to generate noise learning data, and learns noise on the existing training data Add data.

The learning data management method according to an embodiment of the present invention may be implemented in the form of program instructions that can be executed through various computer means and recorded in a computer readable medium. Computer readable media may include program instructions, data files, data structures, etc. alone or in combination. Program instructions recorded on a computer readable medium may be specially designed and configured for the present invention or may be known and usable to those skilled in the art of computer software. Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks and magnetic tapes, optical media such as CD-ROMs and DVDs, and magnetic media such as floptical disks. - Includes hardware devices specially configured to store and execute program instructions, such as magneto-optical media and ROM, RAM, flash memory, etc. In addition, the above-described medium may be a transmission medium such as light including a carrier wave for transmitting a signal designating a program command, data structure, or the like, or a metal wire or a waveguide. Examples of program instructions include high-level language codes that can be executed by a computer using an interpreter, as well as machine language codes such as those produced by a compiler. The hardware device described above may be configured to operate as one or more software modules to perform the operations of the present invention, and vice versa.

So far, the present invention has been looked at mainly by its embodiments. Those skilled in the art to which the present invention pertains will be able to understand that the present invention can be implemented in a modified form without departing from the essential characteristics of the present invention. Therefore, the disclosed embodiments should be considered from a descriptive point of view rather than a limiting point of view. The scope of the present invention is shown in the claims rather than the foregoing description, and all differences within the equivalent scope will be construed as being included in the present invention.

Modes for carrying out the invention have been described together in the best mode for carrying out the invention above.

The present invention can reduce the cost of data collection for model generation and renewal, and can improve the performance of the model by configuring learning data through data generated in actual services, so it has industrial applicability.

Claims

a clustering unit configured to form one or more clusters by performing clustering on the unlabeled data according to a predetermined pattern when receiving unlabeled data;

a similarity calculation unit calculating a similarity between the unlabeled cluster, which is a group of unlabeled data, and the learning cluster, which is a group of previously learned data; and

and a labeling unit configured to exclude the unlabeled data included in the unlabeled cluster from the target of learning data or label the unlabeled data included in the unlabeled cluster according to the degree of similarity. .
According to claim 1,

The labeling unit,

and excluding the unlabeled data included in the unlabeled cluster from a target of learning data when the similarity is equal to or greater than a predetermined first threshold.
The method of claim 2, wherein the labeling unit,

When the similarity is less than the first threshold and greater than or equal to a predetermined second threshold, setting a label of the unlabeled data included in the unlabeled cluster to be the same as a label corresponding to the learning cluster;

The learning data management device, characterized in that the first threshold value is a natural number greater than the second threshold value.
According to claim 3,

The labeling unit,

When the similarity is less than the second threshold, included in the unlabeled cluster

A learning data management device for receiving and setting a new label of the unlabeled data.
Collect input data and output data from services through models;

Learning to generate input data in which the output data is set as a label as training data

data collection unit; and

A learning data management device comprising a learning data providing unit that stores the learning data.
According to claim 5,

The learning data collection unit,

The learning data management device, characterized in that for setting the input data as the learning data by labeling the output data existing within the label range of the previous learning data among the output data.
a learning data collection unit that collects learning data; and

A learning data management device comprising a learning data providing unit that performs sampling on an excessive class among the learning data.
According to claim 7,

The learning data providing unit,

Detecting the overclass and underclass in the learning data;

The learning data management device, characterized in that, when the ratio of the number of training data of the over class and the under class is out of a specified range, sampling is performed for the over class.
According to claim 7,

The learning data providing unit,

generating noise learning data by inputting the noise data into an unsupervised learning generative adversarial network (GAN);

The learning data management device characterized in that the noise learning data is included in the learning data set.
In the method for the learning data management device to perform learning data management,

forming one or more clusters by performing clustering on the unlabeled data according to a predetermined pattern when unlabeled data is received;

calculating a similarity between the unlabeled cluster, which is the cluster of unlabeled data, and the learning cluster, which is a cluster of pre-learning data; and

and excluding the unlabeled data included in the unlabeled cluster from the target of training data according to the similarity, or labeling the unlabeled data included in the unlabeled cluster. .
According to claim 10,

The step of excluding the unlabeled data included in the unlabeled cluster from the target of training data or labeling the unlabeled data included in the unlabeled cluster according to the similarity,

and excluding the unlabeled data included in the unlabeled cluster from a target of training data when the degree of similarity is equal to or greater than a predetermined first threshold.
According to claim 11,

The step of excluding the unlabeled data included in the unlabeled cluster from the target of training data or labeling the unlabeled data included in the unlabeled cluster according to the similarity,

Setting a label of the unlabeled data included in the unlabeled cluster to be the same as a label corresponding to the learning cluster when the similarity is less than the first threshold and greater than or equal to a predetermined second threshold; ,

The learning data management method, characterized in that the first threshold value is a natural number greater than the second threshold value.
According to claim 12,

The step of excluding the unlabeled data included in the unlabeled cluster from the target of training data or labeling the unlabeled data included in the unlabeled cluster according to the similarity,

The learning data management method further comprising receiving and setting a new label of the unlabeled data included in the unlabeled cluster when the similarity is less than the second threshold.
A method for managing learning data by a learning data management device,

Collecting input data and output data generated from a service through a model;

generating input data obtained by setting the output data as labels as learning data; and

A learning data management method comprising storing the learning data.
According to claim 14,

The step of generating the input data of which the output data is set as a label as training data is a step of setting the input data as training data by using output data existing within a label range of the previous training data among the output data as a label. How to manage learning data.
A method for managing learning data by a learning data management device,

Collecting learning data; and

Learning data management method comprising: performing sampling on an excessive class among the learning data.
According to claim 16,

The step of performing sampling on the exaggerated class of the learning data,

detecting the overclass and underclass in the learning data; and

and performing sampling on the excessive class when the ratio of the number of training data of the excessive class and the small class is out of a specified range.
According to claim 17,

generating noise learning data by inputting the noise data into an unsupervised learning generative adversarial network (GAN); and

The learning data management method further comprising the step of including the noise learning data in a learning data set.
A computer program recorded on a computer-readable recording medium for executing the learning data management method according to any one of claims 10 to 18.