WO2022107262A1

WO2022107262A1 - Determination device, determination method, and determination program

Info

Publication number: WO2022107262A1
Application number: PCT/JP2020/043087
Authority: WO
Inventors: 俊樹芝原; 大紀千葉; 満昭秋山
Original assignee: 日本電信電話株式会社
Priority date: 2020-11-18
Filing date: 2020-11-18
Publication date: 2022-05-27

Abstract

A determination device (10) uses second training data, which is obtained by removing a part of first training data, to construct a model. The determination device (10) then makes a comparison between a first model constructed with use of the first training data and a second model constructed with use of the second training data, and outputs a difference between the first model and the second model. If the difference between the first model and the second model is large, the determination device (10) determines that the first training data is insufficient for model construction. Conversely, if the difference between the first model and the second model is small, the determination device (10) determines that the first training data is sufficient for model construction.

Description

Judgment device, judgment method, and judgment program

The present invention relates to a determination device, a determination method, and a determination program for determining whether or not the teacher data used for model construction is sufficient.

In machine learning, it is necessary to collect a large amount of data in advance and build a highly accurate model. However, for example, in fields where the cost of data collection is high, such as security and medical care, it is not easy to collect a large amount of data in advance for machine learning. In such fields, in order to reduce the data collection cost, it is desirable to end the data collection when the data is sufficiently collected and the accuracy of the model does not improve even if the data is trained.

Conventionally, as an evaluation method for a model learned by machine learning, for example, evaluation based on cross-validation and evaluation based on an information criterion (Non-Patent Documents 1 and 2) have been proposed.

However, although any of the conventional techniques can judge whether the accuracy of each model is good or bad, it cannot judge whether the teacher data used for constructing the model is sufficient. Therefore, it is an object of the present invention to solve the above-mentioned problems and to be able to determine whether or not the teacher data used for model construction is sufficient.

In order to solve the above-mentioned problems, the present invention includes a model construction unit that constructs a second model using the second teacher data, which is data obtained by removing a part of the data from the first teacher data, and the above-mentioned. Of the feature quantities contained in the first teacher data and the second teacher data, the classification of data between the first model constructed by using the first teacher data and the second model. A feature amount whose difference in the degree of influence on the result is equal to or more than a predetermined threshold is specified, and the number of data including the specified feature amount or the number of the specified feature amount is determined between the first model and the second model. Based on the difference between the model comparison unit output as the difference between the first model and the second model, the determination unit for determining whether or not the first teacher data is sufficient data for model construction. It is characterized by having and.

According to the present invention, it is possible to determine whether or not the teacher data used for constructing the model is sufficient.

FIG. 1 is a diagram showing a configuration example of a determination device. FIG. 2 is a diagram for explaining an analysis example of the difference between the first model and the second model. FIG. 3 is a diagram showing an analysis example of the difference between the first model and the second model. FIG. 4 is a flowchart showing an example of the processing procedure of the determination device of the first embodiment. FIG. 5 is a flowchart showing a specific example of the processing procedure of the determination device of the first embodiment. FIG. 6 is a diagram for explaining the processing of the comparison result analysis unit of FIG. FIG. 7 is a flowchart showing an example of the processing procedure of the determination device of the second embodiment. FIG. 8 is a flowchart showing a specific example of the processing procedure of the determination device of the second embodiment. FIG. 9 is a flowchart showing a specific example of the processing procedure of the determination device of the second embodiment. FIG. 10 is a diagram showing a configuration example of a computer that executes a determination program.

Hereinafter, embodiments (embodiments) for carrying out the present invention will be described separately for the first embodiment and the second embodiment with reference to the drawings. The present invention is not limited to the following embodiments.

[First Embodiment]
[Overview]
The outline of the determination device of the first embodiment will be described. The determination device constructs a model (second model) by using data excluding a part of the teacher data of a certain model (first model) as teacher data. Then, the determination device compares the first model and the second model, and if the difference between the first model and the second model is large, the teacher data used for constructing the first model is a model. Judged as insufficient for construction. On the other hand, when the difference between the first model and the second model is small, the determination device determines that the teacher data is sufficient for model construction.

[Configuration example]
Next, a configuration example of the determination device 10 will be described with reference to FIG. As shown in FIG. 1, the determination device 10 includes, for example, a teacher data storage unit 11, a data removal unit 12, a model construction unit 13, a model comparison unit 14, and a determination unit 15. The comparison result analysis unit 16 shown by the broken line may or may not be equipped, and the case where it is equipped will be described in the second embodiment.

[Teacher data storage]
The teacher data storage unit 11 stores teacher data used when the model construction unit 13 constructs a model. The teacher data storage unit 11 stores, for example, the first teacher data used for constructing the first model described above.

[Data removal unit]
The data removing unit 12 removes a part of the data from the first teacher data. For example, the data removal unit 12 may randomly remove data from the first teacher data, the time when the data was acquired, the data label (malignant or benign, etc.), the data type (malware family, etc.). ) May be removed.

For example, the data removal unit 12 may remove a part of the malicious data or a part of the data of a specific malware family from the first teacher data based on the label of the data. , You may remove the same number of data from all malware families. Further, the data removal unit 12 may remove the data in order from the data with the latest acquisition time.

The number of data to be removed by the data removal unit 12 is arbitrary, but when the first teacher data is insufficient for model construction, it is constructed by the teacher data (second teacher data) after data removal. The number is assumed to affect the classification result of the model. The number of data to be removed here is determined, for example, by checking whether the teacher data is judged to be insufficient for model construction by using the teacher data whose data is clearly insufficient for model construction. Can be done.

[Model Construction Department]
The model building unit 13 builds a model using the teacher data. For example, the model building unit 13 builds the first model using the first teacher data. Further, the model building unit 13 constructs a model (second model) using the teacher data (second teacher data) from which a part of the first teacher data is removed by the data removing unit 12.

The type of learning algorithm used by the model building unit 13 to build the first model and the second model (for example, SVM (Support Vector Machine), random forest, neural network, etc.) is arbitrary. be.

The machine learning task will be described by taking the case of data classification as an example, but it may also be regression. Furthermore, the method of determining the parameters at the time of training of both models (for example, using a predetermined one, optimizing by cross validation, etc.) is arbitrary.

[Model comparison section]
The model comparison unit 14 analyzes the difference between the first model and the second model constructed by the model construction unit 13.

For example, when the data used in the training of both models are classified by the model, the model comparison unit 14 analyzes which features contained in the data affect the classification of the model to what extent.

The analysis method is, for example, a method based on game theory (SHAP), a method based on the gradient of the prediction result (sensitivity map), a method based on approximation with a simple model (LIME), and the like. However, if there are models with the same output for various inputs, the method in which the analysis results of each model match is adopted.

For example, the model comparison unit 14 determines the degree of influence of the features contained in the first teacher data and the second teacher data on the classification result of the data in the first model and the data in the second model. The feature amount whose difference from the degree of influence on the classification result of is equal to or more than a predetermined threshold is specified.

For example, assuming that the degree of influence of a certain feature amount in the first model is x _a and the degree of influence of the feature amount in the second model is x _p , the model comparison unit 14 uses the calculation method 1 shown in the equation (1). Or, the difference between the influence degree x _a of the feature amount and the influence degree x _p of the feature amount is calculated by the calculation method 2 shown in the equation (2).

Note that the model comparison unit 14 may calculate the difference in the degree of influence between the two models having the same feature amount by the calculation method 3 shown in the equation (3). According to the calculation method 3, the model comparison unit 14 can capture the change when the feature amount having a large influence in the first model becomes smaller in the second model.

FIG. 2 shows an example of analysis of the difference between the first model and the second model by the model comparison unit 14. Here, the model comparison unit 14 calculates the difference in the degree of influence on the classification results of both models for the feature quantities 1, 2, and 3 included in the data 1, 2, respectively. The calculation method is the above calculation methods 1, 2, and 3. In addition, c = 0.010 in the calculation method 3 was set.

As shown in FIG. 2, in the case of the calculation method 1, when the degree of influence in the first model and the degree of influence in the second model are large, the difference in the degree of influence becomes a large value (see reference numeral 201). .. Further, in the case of the calculation method 2, when the degree of influence in the first model and the degree of influence in the second model are small, the difference in the degree of influence becomes a large value (see reference numeral 202).

On the other hand, in the case of calculation method 3, even when the degree of influence in the first model and the degree of influence in the second model are large, the difference in the degree of influence does not become a large value. Further, when the degree of influence in the first model and the degree of influence in the second model are small, the difference in the degree of influence does not become a large value. When the degree of influence in the first model is large → the degree of influence in the second model is small, the difference in the degree of influence becomes a large value (see reference numeral 203).

The model comparison unit 14 specifies a feature amount in which the difference between the degree of influence in the first model and the degree of influence in the second model is equal to or greater than a predetermined threshold value. Then, the model comparison unit 14 uses, for example, the number of data having the above-mentioned specified feature amount or the number of the feature amount (type number) in the first teacher data or the second teacher data as the first model. It is output as a difference from the second model.

FIG. 3 shows an output example of the difference between the first model and the second model by the model comparison unit 14. Here, the model comparison unit 14 shows a case where the difference between the degree of influence of the feature amount in the first model and the degree of influence of the feature amount in the second model is calculated by the calculation method 3. The threshold value here is "5".

In this case, the model comparison unit 14 outputs that the number of data including the feature amount in which the difference between the degree of influence in the first model and the degree of influence in the second model is 5 or more is "2". Alternatively, in the model comparison unit 14, since the feature amount having a difference of 5 or more between the influence degree in the first model and the influence degree in the second model is only "feature amount 1", the difference in the influence degree is equal to or more than the threshold value. It is output that the number of types of the feature amount of is "1".

The model comparison unit 14 is a type of feature amount in which the difference between the degree of influence in the first model and the degree of influence in the second model is equal to or greater than the threshold value as the difference between the first model and the second model. May be specified, and the number of data having k or more types of feature quantities (predetermined value) may be output. At this time, the model comparison unit 14 may output the corresponding number of data for each data label or each data type.

Further, the model comparison unit 14 outputs, as a difference between the first model and the second model, the number of types of feature quantities in which the difference in the degree of influence is equal to or greater than the threshold value in k or more data. May be good. At this time, the model comparison unit 14 may output the number of types of the corresponding feature amount for each data label or for each data type.

[Judgment unit]
Returning to the description of FIG. The determination unit 15 determines whether or not the first teacher data is sufficient data for model construction based on the difference between the first model and the second model output by the model comparison unit 14.

For example, when the model comparison unit 14 outputs the number of data having a feature amount in which the difference in the degree of influence between the two models is equal to or greater than a predetermined threshold in the first teacher data or the second teacher data, the determination unit 15 determines whether or not the first teacher data is sufficient data for model construction as follows.

For example, the determination unit 15 calculates the ratio of the number of output data to the number of data of the first teacher data or the second teacher data. Then, when the ratio is equal to or higher than a predetermined threshold value, the determination unit 15 determines that the first teacher data is not sufficient data for model construction. On the other hand, when the above ratio is less than a predetermined threshold value, the model comparison unit 14 determines that the first teacher data is sufficient data for model construction.

For example, when the model comparison unit 14 outputs the number of types of feature quantities in which the difference in the degree of influence between the two models in the first teacher data or the second teacher data is equal to or greater than a predetermined threshold value. The determination unit 15 determines whether or not the first teacher data is sufficient data for model construction as follows.

For example, the determination unit 15 calculates the ratio of the number of types of the output feature amount to the number of types of the feature amount included in the first teacher data or the second teacher data. Then, when the ratio is equal to or higher than a predetermined threshold value, the determination unit 15 determines that the first teacher data is not sufficient data for model construction. On the other hand, when the above ratio is less than a predetermined threshold value, the determination unit 15 determines that the first teacher data is sufficient data for model construction.

As described above, the determination unit 15 determines whether or not the first teacher data is sufficient data for model construction, and outputs the determination result.

[Example of processing procedure]
Next, an example of the processing procedure of the determination device 10 will be described with reference to FIG. First, the model building unit 13 of the determination device 10 builds the first model using the first teacher data (S1). Next, the data removing unit 12 generates data (second teacher data) obtained by removing a part of the data from the first teacher data (S2). After that, the model building unit 13 builds the second model using the second teacher data (S3).

After S3, the model comparison unit 14 compares the first model and the second model, and outputs the difference between the first model and the second model (S4). After that, the determination unit 15 determines whether or not the first teacher data is sufficient data for model construction based on the difference between the first model and the second model output by the model comparison unit 14 (S5). ). Then, the determination unit 15 outputs the determination result of S5 (S6).

[Specific example of processing procedure]
Next, a specific example of the processing procedure of the determination device 10 will be described with reference to FIG. Here, the determination device 10 describes a model for detecting malware by taking as an example a case where it is determined whether or not the data of a specific malware family in the teacher data used for constructing the model is sufficient for constructing the model. do. Here, the model for detecting the above malware is referred to as the first model, and the teacher data used for constructing the model is referred to as the first teacher data.

First, the model building unit 13 of the determination device 10 builds the first model using the first teacher data (S11). Next, the data removal unit 12 generates data obtained by removing the data of the target malware family from the first teacher data as the second teacher data (S12). After that, the model building unit 13 builds the second model using the second teacher data (S13).

After S13, the model comparison unit 14 calculates the difference in the degree of influence of the feature amount between the first model and the second model by using, for example, the above equation (3), and the difference is A feature amount having a predetermined value or more is specified (S14). Then, the model comparison unit 14 outputs the number of data of the first teacher data or the second teacher data including the feature amount specified in S14 (S15: outputs the number of data including the specified feature amount). Next, the determination unit 15 calculates the following ratio using the number of data output in S15, and determines whether or not the calculated ratio is equal to or greater than a predetermined threshold value (S16).

Number of data output in S15 / (number of data of the first teacher data or the second teacher data)

Here, when the determination unit 15 determines that the calculated ratio is equal to or higher than a predetermined threshold value (Yes in S16), the determination unit 15 determines that the data of the target malware family among the first teacher data is insufficient for model construction. (S17). Then, the determination unit 15 outputs the above determination result (S18).

On the other hand, when the determination unit 15 determines that the calculated ratio is less than a predetermined threshold value (No in S16), the determination unit 15 determines that the data of the target malware family among the first teacher data is sufficient for model construction (No). S19). Then, the determination unit 15 outputs the above determination result (S20).

By doing so, the determination device 10 determines whether or not the data of the specific malware family in the teacher data used for constructing the model is sufficient for constructing the model for the model for detecting malware. Can be done.

[Second Embodiment]
The determination device 10 may further include a comparison result analysis unit 16 shown by a broken line in FIG. The embodiment in this case will be described as the second embodiment of the present invention. The comparison result analysis unit 16 carries out a detailed analysis of the comparison result between the first model and the second model by the model comparison unit 14.

For example, the comparison result analysis unit 16 is among the data removed by the data removal unit 12 based on the comparison result between the first model and the second model output by the model comparison unit 14, between the two models. Specify data having the same feature amount as the feature amount whose influence degree difference is equal to or more than a predetermined threshold value. Then, the comparison result analysis unit 16 outputs information such as the label of the specified data, the type of data, and the feature amount in which the difference in the degree of influence is equal to or more than a predetermined threshold value as the analysis result.

For example, consider the case where the data used for constructing both models are the data 1 and 2 shown in FIG. 6 and the excluded data are the data 3 and 4. Further, among the feature amounts included in the data 1 and 2, the feature amount having a large difference in the degree of influence between the two models (the difference in the degree of influence is equal to or greater than the threshold value of the predetermined value) is "feature amount 1", and the feature is the feature amount. It is assumed that the value of the quantity is "1".

In this case, the comparison result analysis unit 16 has the data 3 (the value of the feature amount 1 is the value of the feature amount 1) as the data having the same feature amount as the feature amount having a large difference in the degree of influence between the two models among the excluded data 3 and 4. Identify "1"). Then, the comparison result analysis unit 16 outputs information regarding the data 3. For example, the comparison result analysis unit 16 outputs information (“ransomware”) of the malware family to which the data 3 belongs.

[Example of processing procedure]
Next, an example of the processing procedure of the determination device 10 will be described with reference to FIG. 7. Since the processes of S21 to S25 in FIG. 7 are the same as the processes of S1 to S5 in FIG. 4, the description thereof will be omitted, and the processes will be described from S26 in FIG.

When the determination unit 15 determines in S25 of FIG. 7 that the first teacher data is sufficient data for model construction (No in S26), the determination unit 15 outputs the above determination result (S27). On the other hand, when the determination unit 15 determines that the first teacher data is not sufficient data for model construction (Yes in S26), the comparison result analysis unit 16 determines the degree of influence between the model comparison unit 14 and both models. Acquires a feature amount whose difference between the two is equal to or greater than a predetermined threshold value. Then, the comparison result analysis unit 16 analyzes which data the feature amount having the difference in the degree of influence between the above two models is equal to or greater than a predetermined threshold value is included in the data removed in S22 (S28). Then, the comparison result analysis unit 16 outputs the analysis result (S29).

[Specific example of processing procedure]
Next, a specific example of the processing procedure of the determination device 10 of the second embodiment will be described with reference to FIG. Here, the determination device 10 describes a model for detecting malware as an example of determining whether or not the data of each of a plurality of malware families in the teacher data used for constructing the model is sufficient. Here, too, the model for detecting the above malware is referred to as the first model, and the teacher data used for constructing the model is referred to as the first teacher data.

First, the model building unit 13 of the determination device 10 builds the first model using the first teacher data (S31). Next, the data removal unit 12 generates data obtained by removing the data of each target malware family from the first teacher data as the second teacher data (S32). After that, the model building unit 13 builds the second model using the second teacher data (S33).

After S33, the model comparison unit 14 calculates the difference in the degree of influence of the feature amount between the first model and the second model using the above equation (3), and the difference is a predetermined value. The above features are specified (S34). Then, the determination unit 15 determines whether or not the first teacher data is sufficient data for model construction by using the feature amount specified in S34 (S35).

When the determination unit 15 determines in S35 that the first teacher data is sufficient data for model construction (No in S36), the determination unit 15 outputs the above determination result (S37).

On the other hand, when the determination unit 15 determines that the first teacher data is not sufficient data for model construction (Yes in S36), the comparison result analysis unit 16 determines the degree of influence between the model comparison unit 14 and both models. Acquires a feature amount whose difference between the two is equal to or greater than a predetermined threshold value. Then, the comparison result analysis unit 16 analyzes which malware family data removed in S32 contains a feature amount in which the difference in the degree of influence between the above two models is equal to or greater than a predetermined threshold value (S38). ).

Then, the comparison result analysis unit 16 aggregates the number of types of the above-mentioned feature amounts included in the data of the malware family for each malware family based on the analysis result, and the number of types of the above-mentioned feature amounts is equal to or higher than a predetermined threshold value. Identify the malware family of (S39). Next, the comparison result analysis unit 16 outputs an analysis result indicating that the malware family data specified in S39 among the first teacher data is insufficient for model construction (S40).

By doing so, the determination device 10 can analyze which malware family data is insufficient in the teacher data used for constructing the model for the model that detects malware.

[Modified example of the second embodiment]
The determination device 10 of the second embodiment may analyze what kind of change has occurred in the model when the newly collected (acquired) data is added to the construction of the model. The model to be analyzed by the determination device 10 will be described by taking the case of a model for detecting malware as an example, as in the above case. Further, the model for detecting the above malware is referred to as the first model, and the teacher data including the newly collected data used for constructing the model is referred to as the first teacher data.

An example of the processing procedure of the determination device 10 in this case will be described with reference to FIG. First, the model building unit 13 of the determination device 10 builds the first model using the first teacher data (S51). Next, the data removal unit 12 generates data obtained by removing newly collected data (for example, data whose time stamp value is after a predetermined date and time) from the first teacher data as the second teacher data (for example). S52).

After S52, the model building unit 13 builds the second model using the second teacher data (S53). After S53, the model comparison unit 14 calculates the difference in the degree of influence of the feature amount between the first model and the second model by using, for example, the following equation (4), and the difference is A feature amount having a predetermined value or more is specified (S54). Then, the model comparison unit 14 specifies the data including the feature amount specified in S54 (S55). For example, the model comparison unit 14 identifies data including the feature amount specified in S54 from the first teacher data or the second teacher data.

After S55, in the determination unit 15, the difference in the degree of influence of the feature amount between the first model and the second model does not seriously affect the false detection or oversight of the data specified in S55. It is determined whether or not it is a thing (S56). Then, the determination result is output (S57).

For example, the determination unit 15 determines that the difference in the degree of influence of the feature amount between the first model and the second model does not seriously affect the false detection or oversight of the data specified in S55. If it is determined, it is determined that the change from the first model to the second model is a reasonable change, and the determination result is output. On the other hand, the determination unit 15 determines that the difference in the degree of influence of the feature amount between the first model and the second model has a serious effect on the false detection or oversight of the data specified in S55. If so, the judgment result is output.

Further, in the comparison result analysis unit 16, the feature amount in which the difference in the degree of influence of the feature amount between the first model and the second model, which is specified by the model comparison unit 14 in S54, is equal to or larger than a predetermined threshold value. , Which data is included in the data removed in S52 is analyzed (S58). Then, the comparison result analysis unit 16 determines that the change from the first model to the second model is a reasonable change if there is no difference between the analysis result of S58 and the characteristics of the data, and analyzes that fact. Output as a result (S59: Output of analysis result). On the other hand, if there is a difference between the analysis result of S58 and the characteristics of the data, the comparison result analysis unit 16 determines that the change from the first model to the second model is not a valid change, and determines that the change is not appropriate. (S59: Output of analysis result).

By doing so, when the determination device 10 adds the newly collected (acquired) data to the teacher data for constructing the model, what kind of change has occurred in the model and what has happened. It is possible to analyze whether or not the change is a reasonable change.

[System configuration, etc.]
Further, each component of each of the illustrated parts is a functional concept, and does not necessarily have to be physically configured as shown in the figure. That is, the specific form of distribution / integration of each device is not limited to the one shown in the figure, and all or part of them may be functionally or physically distributed / physically in arbitrary units according to various loads and usage conditions. Can be integrated and configured. Further, each processing function performed by each device may be realized by a CPU and a program executed by the CPU, or may be realized as hardware by wired logic.

Further, among the processes described in the above-described embodiment, all or part of the processes described as being automatically performed can be manually performed, or the processes described as being manually performed can be performed. All or part of it can be done automatically by a known method. In addition, the processing procedure, control procedure, specific name, and information including various data and parameters shown in the above document and drawings can be arbitrarily changed unless otherwise specified.

[program]
The determination device 10 described above can be implemented by installing a program as package software or online software on a desired computer. For example, by causing the information processing device to execute the above program, the information processing device can function as the determination device 10 of each embodiment. The information processing device referred to here includes a desktop type or notebook type personal computer. In addition, the information processing device includes smartphones, mobile communication terminals such as mobile phones and PHS (Personal Handyphone System), and terminals such as PDAs (Personal Digital Assistants).

Further, the determination device 10 can be implemented as a server device in which the terminal device used by the user is a client and the service related to the above processing is provided to the client. In this case, the server device may be implemented as a Web server, or may be implemented as a cloud that provides services related to the above processing by outsourcing.

FIG. 10 is a diagram showing an example of a computer that executes a determination program. The computer 1000 has, for example, a memory 1010 and a CPU 1020. The computer 1000 also has a hard disk drive interface 1030, a disk drive interface 1040, a serial port interface 1050, a video adapter 1060, and a network interface 1070. Each of these parts is connected by a bus 1080.

The memory 1010 includes a ROM (Read Only Memory) 1011 and a RAM (Random Access Memory) 1012. The ROM 1011 stores, for example, a boot program such as a BIOS (Basic Input Output System). The hard disk drive interface 1030 is connected to the hard disk drive 1090. The disk drive interface 1040 is connected to the disk drive 1100. For example, a removable storage medium such as a magnetic disk or an optical disk is inserted into the disk drive 1100. The serial port interface 1050 is connected to, for example, a mouse 1110 and a keyboard 1120. The video adapter 1060 is connected to, for example, the display 1130.

The hard disk drive 1090 stores, for example, OS1091, application program 1092, program module 1093, and program data 1094. That is, the program that defines each process executed by the determination device 10 is implemented as a program module 1093 in which a code that can be executed by a computer is described. The program module 1093 is stored in, for example, the hard disk drive 1090. For example, the program module 1093 for executing the same processing as the functional configuration in the determination device 10 is stored in the hard disk drive 1090. The hard disk drive 1090 may be replaced by an SSD.

Further, each data used in the processing of the above-described embodiment is stored as program data 1094 in, for example, a memory 1010 or a hard disk drive 1090. Then, the CPU 1020 reads the program module 1093 and the program data 1094 stored in the memory 1010 and the hard disk drive 1090 into the RAM 1012 and executes them as needed.

The program module 1093 and the program data 1094 are not limited to those stored in the hard disk drive 1090, but may be stored in, for example, a removable storage medium and read by the CPU 1020 via the disk drive 1100 or the like. Alternatively, the program module 1093 and the program data 1094 may be stored in another computer connected via a network (LAN (Local Area Network), WAN (Wide Area Network), etc.). Then, the program module 1093 and the program data 1094 may be read from another computer by the CPU 1020 via the network interface 1070.

10 Judgment device 11 Teacher data storage unit 12 Data removal unit 13 Model construction unit 14 Model comparison unit 15 Judgment unit 16 Comparison result analysis unit

Claims

A model construction unit that constructs a second model using the second teacher data, which is data obtained by removing some data from the first teacher data.
Of the feature quantities contained in the first teacher data and the second teacher data, data classification between the first model constructed using the first teacher data and the second model. The number of data including the specified feature amount or the number of the specified feature amount is determined by specifying the feature amount whose difference in the degree of influence on the result of the above is equal to or more than a predetermined threshold value. The model comparison part that outputs as a difference from
A determination device including a determination unit for determining whether or not the first teacher data is sufficient data for model construction based on the difference between the first model and the second model. ..
The model comparison unit
The difference between the first model and the second model in terms of the number of data having the specified feature amount or the number of the specified feature amount existing in the first teacher data or the second teacher data. The determination device according to claim 1, wherein the data is output as.
The model comparison unit
The difference between the first model and the second model is that when the number of data having the specified feature amount existing in the first teacher data or the second teacher data is output,
The determination unit
When the ratio of the number of data having the specified feature amount to the number of data of the first teacher data or the second teacher data is equal to or more than a predetermined threshold, the first teacher data is sufficient data for model construction. The determination device according to claim 2, wherein it is determined that the data is not the same.
The model comparison unit
As a difference between the first model and the second model, when the number of the specified feature quantities present in the first teacher data or the second teacher data is output.
The determination unit
When the ratio of the number of the specified feature amount to the number of the feature amount existing in the first teacher data or the second teacher data is equal to or more than a predetermined threshold value, the first teacher data is sufficient for model construction. The determination device according to claim 2, wherein it is determined that the data is not such data.
It is a judgment method executed by the judgment device.
A model construction process for constructing a second model using the second teacher data, which is data obtained by removing some data from the first teacher data, and
Of the feature quantities contained in the first teacher data and the second teacher data, data classification between the first model constructed using the first teacher data and the second model. The number of data including the specified feature amount or the number of the specified feature amount is determined by specifying the feature amount whose difference in the degree of influence on the result of the above is equal to or more than a predetermined threshold value. The model comparison process that outputs as a difference from
A determination method comprising a determination step of determining whether or not the first teacher data is sufficient data for model construction based on the difference between the first model and the second model. ..
A model construction process for constructing a second model using the second teacher data, which is data obtained by removing some data from the first teacher data, and
Of the feature quantities contained in the first teacher data and the second teacher data, data classification between the first model constructed using the first teacher data and the second model. The number of data including the specified feature amount or the number of the specified feature amount is determined by specifying the feature amount whose difference in the degree of influence on the result of the above is equal to or more than a predetermined threshold value. The model comparison process that outputs as a difference from
Based on the difference between the first model and the second model, the first teacher data is characterized by having a computer execute a determination step of determining whether or not the data is sufficient for model construction. Judgment program to do.