WO2022107262A1 - Determination device, determination method, and determination program - Google Patents

Determination device, determination method, and determination program Download PDF

Info

Publication number
WO2022107262A1
WO2022107262A1 PCT/JP2020/043087 JP2020043087W WO2022107262A1 WO 2022107262 A1 WO2022107262 A1 WO 2022107262A1 JP 2020043087 W JP2020043087 W JP 2020043087W WO 2022107262 A1 WO2022107262 A1 WO 2022107262A1
Authority
WO
WIPO (PCT)
Prior art keywords
model
data
teacher data
feature amount
difference
Prior art date
Application number
PCT/JP2020/043087
Other languages
French (fr)
Japanese (ja)
Inventor
俊樹 芝原
大紀 千葉
満昭 秋山
Original Assignee
日本電信電話株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 日本電信電話株式会社 filed Critical 日本電信電話株式会社
Priority to PCT/JP2020/043087 priority Critical patent/WO2022107262A1/en
Publication of WO2022107262A1 publication Critical patent/WO2022107262A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Definitions

  • the present invention relates to a determination device, a determination method, and a determination program for determining whether or not the teacher data used for model construction is sufficient.
  • Non-Patent Documents 1 and 2 Conventionally, as an evaluation method for a model learned by machine learning, for example, evaluation based on cross-validation and evaluation based on an information criterion (Non-Patent Documents 1 and 2) have been proposed.
  • any of the conventional techniques can judge whether the accuracy of each model is good or bad, it cannot judge whether the teacher data used for constructing the model is sufficient. Therefore, it is an object of the present invention to solve the above-mentioned problems and to be able to determine whether or not the teacher data used for model construction is sufficient.
  • the present invention includes a model construction unit that constructs a second model using the second teacher data, which is data obtained by removing a part of the data from the first teacher data, and the above-mentioned.
  • the classification of data between the first model constructed by using the first teacher data and the second model is specified, and the number of data including the specified feature amount or the number of the specified feature amount is determined between the first model and the second model.
  • the present invention it is possible to determine whether or not the teacher data used for constructing the model is sufficient.
  • FIG. 1 is a diagram showing a configuration example of a determination device.
  • FIG. 2 is a diagram for explaining an analysis example of the difference between the first model and the second model.
  • FIG. 3 is a diagram showing an analysis example of the difference between the first model and the second model.
  • FIG. 4 is a flowchart showing an example of the processing procedure of the determination device of the first embodiment.
  • FIG. 5 is a flowchart showing a specific example of the processing procedure of the determination device of the first embodiment.
  • FIG. 6 is a diagram for explaining the processing of the comparison result analysis unit of FIG.
  • FIG. 7 is a flowchart showing an example of the processing procedure of the determination device of the second embodiment.
  • FIG. 8 is a flowchart showing a specific example of the processing procedure of the determination device of the second embodiment.
  • FIG. 9 is a flowchart showing a specific example of the processing procedure of the determination device of the second embodiment.
  • FIG. 10 is a diagram showing a configuration example of a computer that executes
  • the determination device constructs a model (second model) by using data excluding a part of the teacher data of a certain model (first model) as teacher data. Then, the determination device compares the first model and the second model, and if the difference between the first model and the second model is large, the teacher data used for constructing the first model is a model. Judged as insufficient for construction. On the other hand, when the difference between the first model and the second model is small, the determination device determines that the teacher data is sufficient for model construction.
  • the determination device 10 includes, for example, a teacher data storage unit 11, a data removal unit 12, a model construction unit 13, a model comparison unit 14, and a determination unit 15.
  • the comparison result analysis unit 16 shown by the broken line may or may not be equipped, and the case where it is equipped will be described in the second embodiment.
  • the teacher data storage unit 11 stores teacher data used when the model construction unit 13 constructs a model.
  • the teacher data storage unit 11 stores, for example, the first teacher data used for constructing the first model described above.
  • the data removing unit 12 removes a part of the data from the first teacher data.
  • the data removal unit 12 may randomly remove data from the first teacher data, the time when the data was acquired, the data label (malignant or benign, etc.), the data type (malware family, etc.). ) May be removed.
  • the data removal unit 12 may remove a part of the malicious data or a part of the data of a specific malware family from the first teacher data based on the label of the data. , You may remove the same number of data from all malware families. Further, the data removal unit 12 may remove the data in order from the data with the latest acquisition time.
  • the number of data to be removed by the data removal unit 12 is arbitrary, but when the first teacher data is insufficient for model construction, it is constructed by the teacher data (second teacher data) after data removal. The number is assumed to affect the classification result of the model.
  • the number of data to be removed here is determined, for example, by checking whether the teacher data is judged to be insufficient for model construction by using the teacher data whose data is clearly insufficient for model construction. Can be done.
  • the model building unit 13 builds a model using the teacher data. For example, the model building unit 13 builds the first model using the first teacher data. Further, the model building unit 13 constructs a model (second model) using the teacher data (second teacher data) from which a part of the first teacher data is removed by the data removing unit 12.
  • the type of learning algorithm used by the model building unit 13 to build the first model and the second model is arbitrary. be.
  • SVM Small Vector Machine
  • random forest random forest
  • neural network etc.
  • the machine learning task will be described by taking the case of data classification as an example, but it may also be regression. Furthermore, the method of determining the parameters at the time of training of both models (for example, using a predetermined one, optimizing by cross validation, etc.) is arbitrary.
  • the model comparison unit 14 analyzes the difference between the first model and the second model constructed by the model construction unit 13.
  • the model comparison unit 14 analyzes which features contained in the data affect the classification of the model to what extent.
  • the analysis method is, for example, a method based on game theory (SHAP), a method based on the gradient of the prediction result (sensitivity map), a method based on approximation with a simple model (LIME), and the like.
  • HTP game theory
  • sensitivity map a method based on the gradient of the prediction result
  • LIME simple model
  • the model comparison unit 14 determines the degree of influence of the features contained in the first teacher data and the second teacher data on the classification result of the data in the first model and the data in the second model.
  • the feature amount whose difference from the degree of influence on the classification result of is equal to or more than a predetermined threshold is specified.
  • the model comparison unit 14 uses the calculation method 1 shown in the equation (1).
  • the difference between the influence degree x a of the feature amount and the influence degree x p of the feature amount is calculated by the calculation method 2 shown in the equation (2).
  • model comparison unit 14 may calculate the difference in the degree of influence between the two models having the same feature amount by the calculation method 3 shown in the equation (3). According to the calculation method 3, the model comparison unit 14 can capture the change when the feature amount having a large influence in the first model becomes smaller in the second model.
  • FIG. 2 shows an example of analysis of the difference between the first model and the second model by the model comparison unit 14.
  • the model comparison unit 14 calculates the difference in the degree of influence on the classification results of both models for the feature quantities 1, 2, and 3 included in the data 1, 2, respectively.
  • the calculation method is the above calculation methods 1, 2, and 3.
  • c 0.010 in the calculation method 3 was set.
  • the model comparison unit 14 specifies a feature amount in which the difference between the degree of influence in the first model and the degree of influence in the second model is equal to or greater than a predetermined threshold value. Then, the model comparison unit 14 uses, for example, the number of data having the above-mentioned specified feature amount or the number of the feature amount (type number) in the first teacher data or the second teacher data as the first model. It is output as a difference from the second model.
  • FIG. 3 shows an output example of the difference between the first model and the second model by the model comparison unit 14.
  • the model comparison unit 14 shows a case where the difference between the degree of influence of the feature amount in the first model and the degree of influence of the feature amount in the second model is calculated by the calculation method 3.
  • the threshold value here is "5".
  • the model comparison unit 14 outputs that the number of data including the feature amount in which the difference between the degree of influence in the first model and the degree of influence in the second model is 5 or more is "2".
  • the model comparison unit 14 since the feature amount having a difference of 5 or more between the influence degree in the first model and the influence degree in the second model is only "feature amount 1", the difference in the influence degree is equal to or more than the threshold value. It is output that the number of types of the feature amount of is "1".
  • the model comparison unit 14 is a type of feature amount in which the difference between the degree of influence in the first model and the degree of influence in the second model is equal to or greater than the threshold value as the difference between the first model and the second model. May be specified, and the number of data having k or more types of feature quantities (predetermined value) may be output. At this time, the model comparison unit 14 may output the corresponding number of data for each data label or each data type.
  • the model comparison unit 14 outputs, as a difference between the first model and the second model, the number of types of feature quantities in which the difference in the degree of influence is equal to or greater than the threshold value in k or more data. May be good. At this time, the model comparison unit 14 may output the number of types of the corresponding feature amount for each data label or for each data type.
  • the determination unit 15 determines whether or not the first teacher data is sufficient data for model construction based on the difference between the first model and the second model output by the model comparison unit 14.
  • the determination unit 15 determines whether or not the first teacher data is sufficient data for model construction as follows.
  • the determination unit 15 calculates the ratio of the number of output data to the number of data of the first teacher data or the second teacher data. Then, when the ratio is equal to or higher than a predetermined threshold value, the determination unit 15 determines that the first teacher data is not sufficient data for model construction. On the other hand, when the above ratio is less than a predetermined threshold value, the model comparison unit 14 determines that the first teacher data is sufficient data for model construction.
  • the model comparison unit 14 outputs the number of types of feature quantities in which the difference in the degree of influence between the two models in the first teacher data or the second teacher data is equal to or greater than a predetermined threshold value.
  • the determination unit 15 determines whether or not the first teacher data is sufficient data for model construction as follows.
  • the determination unit 15 calculates the ratio of the number of types of the output feature amount to the number of types of the feature amount included in the first teacher data or the second teacher data. Then, when the ratio is equal to or higher than a predetermined threshold value, the determination unit 15 determines that the first teacher data is not sufficient data for model construction. On the other hand, when the above ratio is less than a predetermined threshold value, the determination unit 15 determines that the first teacher data is sufficient data for model construction.
  • the determination unit 15 determines whether or not the first teacher data is sufficient data for model construction, and outputs the determination result.
  • the model building unit 13 of the determination device 10 builds the first model using the first teacher data (S1).
  • the data removing unit 12 generates data (second teacher data) obtained by removing a part of the data from the first teacher data (S2).
  • the model building unit 13 builds the second model using the second teacher data (S3).
  • the model comparison unit 14 compares the first model and the second model, and outputs the difference between the first model and the second model (S4).
  • the determination unit 15 determines whether or not the first teacher data is sufficient data for model construction based on the difference between the first model and the second model output by the model comparison unit 14 (S5). ). Then, the determination unit 15 outputs the determination result of S5 (S6).
  • the determination device 10 describes a model for detecting malware by taking as an example a case where it is determined whether or not the data of a specific malware family in the teacher data used for constructing the model is sufficient for constructing the model. do.
  • the model for detecting the above malware is referred to as the first model
  • the teacher data used for constructing the model is referred to as the first teacher data.
  • the model building unit 13 of the determination device 10 builds the first model using the first teacher data (S11).
  • the data removal unit 12 generates data obtained by removing the data of the target malware family from the first teacher data as the second teacher data (S12).
  • the model building unit 13 builds the second model using the second teacher data (S13).
  • the model comparison unit 14 calculates the difference in the degree of influence of the feature amount between the first model and the second model by using, for example, the above equation (3), and the difference is A feature amount having a predetermined value or more is specified (S14). Then, the model comparison unit 14 outputs the number of data of the first teacher data or the second teacher data including the feature amount specified in S14 (S15: outputs the number of data including the specified feature amount). Next, the determination unit 15 calculates the following ratio using the number of data output in S15, and determines whether or not the calculated ratio is equal to or greater than a predetermined threshold value (S16).
  • the determination unit 15 determines that the calculated ratio is equal to or higher than a predetermined threshold value (Yes in S16)
  • the determination unit 15 determines that the data of the target malware family among the first teacher data is insufficient for model construction. (S17). Then, the determination unit 15 outputs the above determination result (S18).
  • the determination unit 15 determines that the calculated ratio is less than a predetermined threshold value (No in S16)
  • the determination unit 15 determines that the data of the target malware family among the first teacher data is sufficient for model construction (No). S19). Then, the determination unit 15 outputs the above determination result (S20).
  • the determination device 10 determines whether or not the data of the specific malware family in the teacher data used for constructing the model is sufficient for constructing the model for the model for detecting malware. Can be done.
  • the determination device 10 may further include a comparison result analysis unit 16 shown by a broken line in FIG.
  • the embodiment in this case will be described as the second embodiment of the present invention.
  • the comparison result analysis unit 16 carries out a detailed analysis of the comparison result between the first model and the second model by the model comparison unit 14.
  • the comparison result analysis unit 16 is among the data removed by the data removal unit 12 based on the comparison result between the first model and the second model output by the model comparison unit 14, between the two models. Specify data having the same feature amount as the feature amount whose influence degree difference is equal to or more than a predetermined threshold value. Then, the comparison result analysis unit 16 outputs information such as the label of the specified data, the type of data, and the feature amount in which the difference in the degree of influence is equal to or more than a predetermined threshold value as the analysis result.
  • the data used for constructing both models are the data 1 and 2 shown in FIG. 6 and the excluded data are the data 3 and 4.
  • the feature amount having a large difference in the degree of influence between the two models is "feature amount 1"
  • the feature is the feature amount. It is assumed that the value of the quantity is "1".
  • the comparison result analysis unit 16 has the data 3 (the value of the feature amount 1 is the value of the feature amount 1) as the data having the same feature amount as the feature amount having a large difference in the degree of influence between the two models among the excluded data 3 and 4. Identify "1"). Then, the comparison result analysis unit 16 outputs information regarding the data 3. For example, the comparison result analysis unit 16 outputs information (“ransomware”) of the malware family to which the data 3 belongs.
  • the determination unit 15 determines in S25 of FIG. 7 that the first teacher data is sufficient data for model construction (No in S26).
  • the comparison result analysis unit 16 determines the degree of influence between the model comparison unit 14 and both models. Acquires a feature amount whose difference between the two is equal to or greater than a predetermined threshold value. Then, the comparison result analysis unit 16 analyzes which data the feature amount having the difference in the degree of influence between the above two models is equal to or greater than a predetermined threshold value is included in the data removed in S22 (S28). Then, the comparison result analysis unit 16 outputs the analysis result (S29).
  • the determination device 10 describes a model for detecting malware as an example of determining whether or not the data of each of a plurality of malware families in the teacher data used for constructing the model is sufficient.
  • the model for detecting the above malware is referred to as the first model
  • the teacher data used for constructing the model is referred to as the first teacher data.
  • the model building unit 13 of the determination device 10 builds the first model using the first teacher data (S31).
  • the data removal unit 12 generates data obtained by removing the data of each target malware family from the first teacher data as the second teacher data (S32).
  • the model building unit 13 builds the second model using the second teacher data (S33).
  • the model comparison unit 14 calculates the difference in the degree of influence of the feature amount between the first model and the second model using the above equation (3), and the difference is a predetermined value.
  • the above features are specified (S34).
  • the determination unit 15 determines whether or not the first teacher data is sufficient data for model construction by using the feature amount specified in S34 (S35).
  • the determination unit 15 determines in S35 that the first teacher data is sufficient data for model construction (No in S36), the determination unit 15 outputs the above determination result (S37).
  • the comparison result analysis unit 16 determines the degree of influence between the model comparison unit 14 and both models. Acquires a feature amount whose difference between the two is equal to or greater than a predetermined threshold value. Then, the comparison result analysis unit 16 analyzes which malware family data removed in S32 contains a feature amount in which the difference in the degree of influence between the above two models is equal to or greater than a predetermined threshold value (S38). ).
  • the comparison result analysis unit 16 aggregates the number of types of the above-mentioned feature amounts included in the data of the malware family for each malware family based on the analysis result, and the number of types of the above-mentioned feature amounts is equal to or higher than a predetermined threshold value. Identify the malware family of (S39). Next, the comparison result analysis unit 16 outputs an analysis result indicating that the malware family data specified in S39 among the first teacher data is insufficient for model construction (S40).
  • the determination device 10 can analyze which malware family data is insufficient in the teacher data used for constructing the model for the model that detects malware.
  • the determination device 10 of the second embodiment may analyze what kind of change has occurred in the model when the newly collected (acquired) data is added to the construction of the model.
  • the model to be analyzed by the determination device 10 will be described by taking the case of a model for detecting malware as an example, as in the above case. Further, the model for detecting the above malware is referred to as the first model, and the teacher data including the newly collected data used for constructing the model is referred to as the first teacher data.
  • the model building unit 13 of the determination device 10 builds the first model using the first teacher data (S51).
  • the data removal unit 12 generates data obtained by removing newly collected data (for example, data whose time stamp value is after a predetermined date and time) from the first teacher data as the second teacher data (for example). S52).
  • the model building unit 13 builds the second model using the second teacher data (S53).
  • the model comparison unit 14 calculates the difference in the degree of influence of the feature amount between the first model and the second model by using, for example, the following equation (4), and the difference is A feature amount having a predetermined value or more is specified (S54). Then, the model comparison unit 14 specifies the data including the feature amount specified in S54 (S55). For example, the model comparison unit 14 identifies data including the feature amount specified in S54 from the first teacher data or the second teacher data.
  • the determination unit 15 After S55, in the determination unit 15, the difference in the degree of influence of the feature amount between the first model and the second model does not seriously affect the false detection or oversight of the data specified in S55. It is determined whether or not it is a thing (S56). Then, the determination result is output (S57).
  • the determination unit 15 determines that the difference in the degree of influence of the feature amount between the first model and the second model does not seriously affect the false detection or oversight of the data specified in S55. If it is determined, it is determined that the change from the first model to the second model is a reasonable change, and the determination result is output. On the other hand, the determination unit 15 determines that the difference in the degree of influence of the feature amount between the first model and the second model has a serious effect on the false detection or oversight of the data specified in S55. If so, the judgment result is output.
  • the comparison result analysis unit 16 determines that the change from the first model to the second model is a reasonable change if there is no difference between the analysis result of S58 and the characteristics of the data, and analyzes that fact. Output as a result (S59: Output of analysis result).
  • the comparison result analysis unit 16 determines that the change from the first model to the second model is not a valid change, and determines that the change is not appropriate. (S59: Output of analysis result).
  • the determination device 10 adds the newly collected (acquired) data to the teacher data for constructing the model, what kind of change has occurred in the model and what has happened. It is possible to analyze whether or not the change is a reasonable change.
  • each component of each of the illustrated parts is a functional concept, and does not necessarily have to be physically configured as shown in the figure. That is, the specific form of distribution / integration of each device is not limited to the one shown in the figure, and all or part of them may be functionally or physically distributed / physically in arbitrary units according to various loads and usage conditions. Can be integrated and configured. Further, each processing function performed by each device may be realized by a CPU and a program executed by the CPU, or may be realized as hardware by wired logic.
  • the determination device 10 described above can be implemented by installing a program as package software or online software on a desired computer. For example, by causing the information processing device to execute the above program, the information processing device can function as the determination device 10 of each embodiment.
  • the information processing device referred to here includes a desktop type or notebook type personal computer.
  • the information processing device includes smartphones, mobile communication terminals such as mobile phones and PHS (Personal Handyphone System), and terminals such as PDAs (Personal Digital Assistants).
  • the determination device 10 can be implemented as a server device in which the terminal device used by the user is a client and the service related to the above processing is provided to the client.
  • the server device may be implemented as a Web server, or may be implemented as a cloud that provides services related to the above processing by outsourcing.
  • FIG. 10 is a diagram showing an example of a computer that executes a determination program.
  • the computer 1000 has, for example, a memory 1010 and a CPU 1020.
  • the computer 1000 also has a hard disk drive interface 1030, a disk drive interface 1040, a serial port interface 1050, a video adapter 1060, and a network interface 1070. Each of these parts is connected by a bus 1080.
  • the memory 1010 includes a ROM (Read Only Memory) 1011 and a RAM (Random Access Memory) 1012.
  • the ROM 1011 stores, for example, a boot program such as a BIOS (Basic Input Output System).
  • BIOS Basic Input Output System
  • the hard disk drive interface 1030 is connected to the hard disk drive 1090.
  • the disk drive interface 1040 is connected to the disk drive 1100.
  • a removable storage medium such as a magnetic disk or an optical disk is inserted into the disk drive 1100.
  • the serial port interface 1050 is connected to, for example, a mouse 1110 and a keyboard 1120.
  • the video adapter 1060 is connected to, for example, the display 1130.
  • the hard disk drive 1090 stores, for example, OS1091, application program 1092, program module 1093, and program data 1094. That is, the program that defines each process executed by the determination device 10 is implemented as a program module 1093 in which a code that can be executed by a computer is described.
  • the program module 1093 is stored in, for example, the hard disk drive 1090.
  • the program module 1093 for executing the same processing as the functional configuration in the determination device 10 is stored in the hard disk drive 1090.
  • the hard disk drive 1090 may be replaced by an SSD.
  • each data used in the processing of the above-described embodiment is stored as program data 1094 in, for example, a memory 1010 or a hard disk drive 1090. Then, the CPU 1020 reads the program module 1093 and the program data 1094 stored in the memory 1010 and the hard disk drive 1090 into the RAM 1012 and executes them as needed.
  • the program module 1093 and the program data 1094 are not limited to those stored in the hard disk drive 1090, but may be stored in, for example, a removable storage medium and read by the CPU 1020 via the disk drive 1100 or the like. Alternatively, the program module 1093 and the program data 1094 may be stored in another computer connected via a network (LAN (Local Area Network), WAN (Wide Area Network), etc.). Then, the program module 1093 and the program data 1094 may be read from another computer by the CPU 1020 via the network interface 1070.
  • LAN Local Area Network
  • WAN Wide Area Network

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Image Analysis (AREA)

Abstract

A determination device (10) uses second training data, which is obtained by removing a part of first training data, to construct a model. The determination device (10) then makes a comparison between a first model constructed with use of the first training data and a second model constructed with use of the second training data, and outputs a difference between the first model and the second model. If the difference between the first model and the second model is large, the determination device (10) determines that the first training data is insufficient for model construction. Conversely, if the difference between the first model and the second model is small, the determination device (10) determines that the first training data is sufficient for model construction.

Description

判定装置、判定方法、および、判定プログラムJudgment device, judgment method, and judgment program
 本発明は、モデル構築に用いられた教師データが充分か否かを判定する、判定装置、判定方法、および、判定プログラムに関する。 The present invention relates to a determination device, a determination method, and a determination program for determining whether or not the teacher data used for model construction is sufficient.
 機械学習では、事前に大量のデータを収集し、精度の高いモデルを構築する必要がある。しかし、例えば、セキュリティや医療等のデータ収集のコストが高い分野において、機械学習のための事前に大量のデータを収集することは容易ではない。このような分野では、データ収集コストを削減するため、データが充分収集され、当該データを用いて学習してもモデルの精度が向上しなくなったときに、データ収集を終了することが望ましい。 In machine learning, it is necessary to collect a large amount of data in advance and build a highly accurate model. However, for example, in fields where the cost of data collection is high, such as security and medical care, it is not easy to collect a large amount of data in advance for machine learning. In such fields, in order to reduce the data collection cost, it is desirable to end the data collection when the data is sufficiently collected and the accuracy of the model does not improve even if the data is trained.
 従来、機械学習により学習されたモデルの評価手法としては、例えば、クロスバリデーションに基づく評価や、情報量基準に基づく評価(非特許文献1,2)等が提案されている。 Conventionally, as an evaluation method for a model learned by machine learning, for example, evaluation based on cross-validation and evaluation based on an information criterion (Non-Patent Documents 1 and 2) have been proposed.
 しかし、従来技術のいずれもが、モデルごとの精度の良し悪しを判定することはできるが、モデルの構築に用いた教師データが充分か否かを判定することはできない。そこで、本発明は、前記した問題を解決し、モデル構築に用いた教師データが充分か否かを判定できるようにすることを課題とする。 However, although any of the conventional techniques can judge whether the accuracy of each model is good or bad, it cannot judge whether the teacher data used for constructing the model is sufficient. Therefore, it is an object of the present invention to solve the above-mentioned problems and to be able to determine whether or not the teacher data used for model construction is sufficient.
 前記した課題を解決するため、本発明は、第1の教師データから一部のデータを除去したデータである第2の教師データを用いて第2のモデルの構築を行うモデル構築部と、前記第1の教師データおよび前記第2の教師データに含まれる特徴量のうち、前記第1の教師データを用いて構築された第1のモデルと前記第2のモデルとの間でデータの分類の結果に与える影響度の差が所定の閾値以上の特徴量を特定し、前記特定した特徴量を含むデータ数または前記特定した特徴量の数を、前記第1のモデルと前記第2のモデルとの違いとして出力するモデル比較部と、前記第1のモデルと前記第2のモデルとの違いに基づき、前記第1の教師データはモデル構築に充分なデータであるか否かを判定する判定部とを備えることを特徴とする。 In order to solve the above-mentioned problems, the present invention includes a model construction unit that constructs a second model using the second teacher data, which is data obtained by removing a part of the data from the first teacher data, and the above-mentioned. Of the feature quantities contained in the first teacher data and the second teacher data, the classification of data between the first model constructed by using the first teacher data and the second model. A feature amount whose difference in the degree of influence on the result is equal to or more than a predetermined threshold is specified, and the number of data including the specified feature amount or the number of the specified feature amount is determined between the first model and the second model. Based on the difference between the model comparison unit output as the difference between the first model and the second model, the determination unit for determining whether or not the first teacher data is sufficient data for model construction. It is characterized by having and.
 本発明によれば、モデルの構築に用いた教師データが充分か否かを判定することができる。 According to the present invention, it is possible to determine whether or not the teacher data used for constructing the model is sufficient.
図1は、判定装置の構成例を示す図である。FIG. 1 is a diagram showing a configuration example of a determination device. 図2は、第1のモデルと第2のモデルとの違いの分析例を説明するための図である。FIG. 2 is a diagram for explaining an analysis example of the difference between the first model and the second model. 図3は、第1のモデルと第2のモデルとの違いの分析例を示す図である。FIG. 3 is a diagram showing an analysis example of the difference between the first model and the second model. 図4は、第1の実施形態の判定装置の処理手順の例を示すフローチャートである。FIG. 4 is a flowchart showing an example of the processing procedure of the determination device of the first embodiment. 図5は、第1の実施形態の判定装置の処理手順の具体例を示すフローチャートである。FIG. 5 is a flowchart showing a specific example of the processing procedure of the determination device of the first embodiment. 図6は、図1の比較結果分析部の処理を説明するための図である。FIG. 6 is a diagram for explaining the processing of the comparison result analysis unit of FIG. 図7は、第2の実施形態の判定装置の処理手順の例を示すフローチャートである。FIG. 7 is a flowchart showing an example of the processing procedure of the determination device of the second embodiment. 図8は、第2の実施形態の判定装置の処理手順の具体例を示すフローチャートである。FIG. 8 is a flowchart showing a specific example of the processing procedure of the determination device of the second embodiment. 図9は、第2の実施形態の判定装置の処理手順の具体例を示すフローチャートである。FIG. 9 is a flowchart showing a specific example of the processing procedure of the determination device of the second embodiment. 図10は、判定プログラムを実行するコンピュータの構成例を示す図である。FIG. 10 is a diagram showing a configuration example of a computer that executes a determination program.
 以下、図面を参照しながら、本発明を実施するための形態(実施形態)を第1の実施形態および第2の実施形態に分けて説明する。本発明は、以下の各実施形態に限定されない。 Hereinafter, embodiments (embodiments) for carrying out the present invention will be described separately for the first embodiment and the second embodiment with reference to the drawings. The present invention is not limited to the following embodiments.
[第1の実施形態]
[概要]
 第1の実施形態の判定装置の概要を説明する。判定装置は、あるモデル(第1のモデル)の教師データの一部のデータを除いたデータを教師データとして用いてモデル(第2のモデル)を構築する。そして、判定装置は、第1のモデルと第2のモデルとを比較し、第1のモデルと第2のモデルとの違いが大きかった場合、第1のモデルの構築に用いた教師データはモデル構築に不充分と判定する。一方、第1のモデルと第2のモデルとの違いが小さかった場合、判定装置は、当該教師データはモデル構築に充分と判定する。
[First Embodiment]
[Overview]
The outline of the determination device of the first embodiment will be described. The determination device constructs a model (second model) by using data excluding a part of the teacher data of a certain model (first model) as teacher data. Then, the determination device compares the first model and the second model, and if the difference between the first model and the second model is large, the teacher data used for constructing the first model is a model. Judged as insufficient for construction. On the other hand, when the difference between the first model and the second model is small, the determination device determines that the teacher data is sufficient for model construction.
[構成例]
 次に、図1を用いて、判定装置10の構成例を説明する。判定装置10は、例えば、図1に示すように、教師データ記憶部11と、データ除去部12と、モデル構築部13と、モデル比較部14と、判定部15とを備える。破線で示す比較結果分析部16は、装備される場合と装備されない場合とがあり、装備される場合については、第2の実施形態で説明する。
[Configuration example]
Next, a configuration example of the determination device 10 will be described with reference to FIG. As shown in FIG. 1, the determination device 10 includes, for example, a teacher data storage unit 11, a data removal unit 12, a model construction unit 13, a model comparison unit 14, and a determination unit 15. The comparison result analysis unit 16 shown by the broken line may or may not be equipped, and the case where it is equipped will be described in the second embodiment.
[教師データ記憶部]
 教師データ記憶部11は、モデル構築部13がモデルを構築する際に用いる教師データを記憶する。この教師データ記憶部11は、例えば、前記した第1のモデルの構築に用いられる第1の教師データを記憶する。
[Teacher data storage]
The teacher data storage unit 11 stores teacher data used when the model construction unit 13 constructs a model. The teacher data storage unit 11 stores, for example, the first teacher data used for constructing the first model described above.
[データ除去部]
 データ除去部12は、第1の教師データから一部のデータを除去する。例えば、データ除去部12は、第1の教師データから、ランダムにデータを除去してもよいし、データが取得された時刻、データのラベル(悪性or良性等)、データの種類(マルウェアファミリ等)に基づいてデータを除去してもよい。
[Data removal unit]
The data removing unit 12 removes a part of the data from the first teacher data. For example, the data removal unit 12 may randomly remove data from the first teacher data, the time when the data was acquired, the data label (malignant or benign, etc.), the data type (malware family, etc.). ) May be removed.
 例えば、データ除去部12は、データのラベルに基づき、第1の教師データから、悪性データの一部を除去してもよいし、特定のマルウェアファミリーのデータの一部を除去してもよいし、すべてのマルウェアファミリーのデータを同数ずつ除去してもよい。また、データ除去部12は、データが取得された時刻が遅いデータから順に除去してもよい。 For example, the data removal unit 12 may remove a part of the malicious data or a part of the data of a specific malware family from the first teacher data based on the label of the data. , You may remove the same number of data from all malware families. Further, the data removal unit 12 may remove the data in order from the data with the latest acquisition time.
 なお、データ除去部12が除去するデータの数は任意であるが、第1の教師データがモデル構築に不充分であった場合に、データ除去後の教師データ(第2の教師データ)により構築されたモデルの分類結果に影響があると想定される数とする。ここで除去するデータの数は、例えば、明らかにデータがモデル構築に不充分な教師データを使って、教師データがモデル構築に不充分と判定されるか否かを確認することで決定することができる。 The number of data to be removed by the data removal unit 12 is arbitrary, but when the first teacher data is insufficient for model construction, it is constructed by the teacher data (second teacher data) after data removal. The number is assumed to affect the classification result of the model. The number of data to be removed here is determined, for example, by checking whether the teacher data is judged to be insufficient for model construction by using the teacher data whose data is clearly insufficient for model construction. Can be done.
[モデル構築部]
 モデル構築部13は、教師データを用いてモデル構築する。例えば、モデル構築部13は、第1の教師データを用いて第1のモデルを構築する。また、モデル構築部13は、データ除去部12により第1の教師データの一部のデータが除去された教師データ(第2の教師データ)を用いてモデル(第2のモデル)を構築する。
[Model Construction Department]
The model building unit 13 builds a model using the teacher data. For example, the model building unit 13 builds the first model using the first teacher data. Further, the model building unit 13 constructs a model (second model) using the teacher data (second teacher data) from which a part of the first teacher data is removed by the data removing unit 12.
 なお、モデル構築部13が、上記の第1のモデルと第2のモデルとを構築する際に用いる学習アルゴリズムの種類(例えば、SVM(Support Vector Machine)、ランダムフォレスト、ニューラルネットワーク等)は任意である。 The type of learning algorithm used by the model building unit 13 to build the first model and the second model (for example, SVM (Support Vector Machine), random forest, neural network, etc.) is arbitrary. be.
 また、機械学習のタスクは、データの分類である場合を例に説明するが、回帰でもよい。さらに、両モデルの学習時のパラメータの決め方(例えば、事前に決められたものを使用する、クロスバリデーションで最適化する等)は任意である。 The machine learning task will be described by taking the case of data classification as an example, but it may also be regression. Furthermore, the method of determining the parameters at the time of training of both models (for example, using a predetermined one, optimizing by cross validation, etc.) is arbitrary.
[モデル比較部]
 モデル比較部14は、モデル構築部13により構築された、第1のモデルと第2のモデルとの違いを分析する。
[Model comparison section]
The model comparison unit 14 analyzes the difference between the first model and the second model constructed by the model construction unit 13.
 例えば、モデル比較部14は、両モデルの学習で使用されたデータをモデルで分類した際に、データに含まれるどの特徴量がどの程度、モデルの分類に影響したかを分析する。 For example, when the data used in the training of both models are classified by the model, the model comparison unit 14 analyzes which features contained in the data affect the classification of the model to what extent.
 分析方法は、例えば、ゲーム理論に基づく方法(SHAP)、予測結果の勾配に基づく方法(sensitivity map)、シンプルなモデルでの近似による手法(LIME)等である。ただし、様々な入力に対し出力が同一となるモデルがあった場合、各モデルの分析結果が一致する方法を採用する。 The analysis method is, for example, a method based on game theory (SHAP), a method based on the gradient of the prediction result (sensitivity map), a method based on approximation with a simple model (LIME), and the like. However, if there are models with the same output for various inputs, the method in which the analysis results of each model match is adopted.
 例えば、モデル比較部14は、第1の教師データおよび第2の教師データに含まれる特徴量のうち、第1のモデルでのデータの分類結果に与える影響度と、第2のモデルでのデータの分類結果に与える影響度との差が所定の閾値以上の特徴量を特定する。 For example, the model comparison unit 14 determines the degree of influence of the features contained in the first teacher data and the second teacher data on the classification result of the data in the first model and the data in the second model. The feature amount whose difference from the degree of influence on the classification result of is equal to or more than a predetermined threshold is specified.
 例えば、第1のモデルのある特徴量の影響度をxa、第2のモデルでの当該特徴量の影響度をxpとすると、モデル比較部14は、式(1)に示す計算法1、または、式(2)に示す計算方2により、特徴量の影響度xaと特徴量の影響度xpとの差を計算する。 For example, assuming that the degree of influence of a certain feature amount in the first model is x a and the degree of influence of the feature amount in the second model is x p , the model comparison unit 14 uses the calculation method 1 shown in the equation (1). Or, the difference between the influence degree x a of the feature amount and the influence degree x p of the feature amount is calculated by the calculation method 2 shown in the equation (2).
Figure JPOXMLDOC01-appb-M000001
Figure JPOXMLDOC01-appb-M000001
Figure JPOXMLDOC01-appb-M000002
Figure JPOXMLDOC01-appb-M000002
 なお、モデル比較部14は、式(3)に示す計算法3により、同じ特徴量の両モデル間での影響度の差を計算してもよい。計算法3によれば、モデル比較部14は、第1のモデルでの影響度が大きかった特徴量が、第2のモデルでの影響度が小さくなった場合の変化を捉えることができる。 Note that the model comparison unit 14 may calculate the difference in the degree of influence between the two models having the same feature amount by the calculation method 3 shown in the equation (3). According to the calculation method 3, the model comparison unit 14 can capture the change when the feature amount having a large influence in the first model becomes smaller in the second model.
Figure JPOXMLDOC01-appb-M000003
Figure JPOXMLDOC01-appb-M000003
 図2に、モデル比較部14による第1のモデルと第2のモデルとの違いの分析例を示す。ここでは、モデル比較部14は、データ1,2それぞれに含まれる特徴量1,2,3について、両モデルでの分類結果に与える影響度の差を計算している。計算方法は、上記の計算法1,2,3である。なお、計算法3におけるc=0.010とした。 FIG. 2 shows an example of analysis of the difference between the first model and the second model by the model comparison unit 14. Here, the model comparison unit 14 calculates the difference in the degree of influence on the classification results of both models for the feature quantities 1, 2, and 3 included in the data 1, 2, respectively. The calculation method is the above calculation methods 1, 2, and 3. In addition, c = 0.010 in the calculation method 3 was set.
 図2に示すように、計算法1の場合、第1のモデルでの影響度も第2のモデルでの影響度も大きいとき、影響度の差が大きな値になってしまう(符号201参照)。また、計算法2の場合、第1のモデルでの影響度も第2のモデルでの影響度も小さいとき、影響度の差が大きな値になってしまう(符号202参照)。 As shown in FIG. 2, in the case of the calculation method 1, when the degree of influence in the first model and the degree of influence in the second model are large, the difference in the degree of influence becomes a large value (see reference numeral 201). .. Further, in the case of the calculation method 2, when the degree of influence in the first model and the degree of influence in the second model are small, the difference in the degree of influence becomes a large value (see reference numeral 202).
 一方、計算法3の場合、第1のモデルでの影響度も第2のモデルでの影響度も大きいときでも、影響度の差は大きな値にならない。また、第1のモデルでの影響度も第2のモデルでの影響度も小さいときも、影響度の差が大きな値にならない。第1のモデルでの影響度が大→第2のモデルでの影響度が小の場合に、影響度の差は大きな値になる(符号203参照)。 On the other hand, in the case of calculation method 3, even when the degree of influence in the first model and the degree of influence in the second model are large, the difference in the degree of influence does not become a large value. Further, when the degree of influence in the first model and the degree of influence in the second model are small, the difference in the degree of influence does not become a large value. When the degree of influence in the first model is large → the degree of influence in the second model is small, the difference in the degree of influence becomes a large value (see reference numeral 203).
 モデル比較部14は、上記の第1のモデルでの影響度と第2のモデルでの影響度との差が所定の閾値以上の特徴量を特定する。そして、モデル比較部14は、例えば、第1の教師データまたは第2の教師データにおける、上記の特定した特徴量を持つデータ数または当該特徴量の数(種類数)を、第1のモデルと第2のモデルとの違いとして出力する。 The model comparison unit 14 specifies a feature amount in which the difference between the degree of influence in the first model and the degree of influence in the second model is equal to or greater than a predetermined threshold value. Then, the model comparison unit 14 uses, for example, the number of data having the above-mentioned specified feature amount or the number of the feature amount (type number) in the first teacher data or the second teacher data as the first model. It is output as a difference from the second model.
 図3にモデル比較部14による、第1のモデルと第2のモデルとの違いの出力例を示す。ここでは、モデル比較部14は、第1のモデルでの特徴量の影響度と第2のモデルでの当該特徴量の影響度との差を、計算法3により計算した場合について示している。ここでの閾値は「5」である。 FIG. 3 shows an output example of the difference between the first model and the second model by the model comparison unit 14. Here, the model comparison unit 14 shows a case where the difference between the degree of influence of the feature amount in the first model and the degree of influence of the feature amount in the second model is calculated by the calculation method 3. The threshold value here is "5".
 この場合、モデル比較部14は、第1のモデルでの影響度と第2のモデルでの影響度との差が5以上の特徴量を含むデータ数は「2」である旨を出力する。または、モデル比較部14は、第1のモデルでの影響度と第2のモデルでの影響度との差が5以上の特徴量は「特徴量1」のみなので、影響度の差が閾値以上の特徴量の種類数は「1」である旨を出力する。 In this case, the model comparison unit 14 outputs that the number of data including the feature amount in which the difference between the degree of influence in the first model and the degree of influence in the second model is 5 or more is "2". Alternatively, in the model comparison unit 14, since the feature amount having a difference of 5 or more between the influence degree in the first model and the influence degree in the second model is only "feature amount 1", the difference in the influence degree is equal to or more than the threshold value. It is output that the number of types of the feature amount of is "1".
 なお、モデル比較部14は、第1のモデルと第2のモデルとの違いとして、第1のモデルでの影響度と第2のモデルでの影響度との差が閾値以上の特徴量の種類を特定し、その特徴量の種類がk個(所定値)以上あるデータ数を出力してもよい。このときモデル比較部14は、データのラベルごと、または、データの種類ごとに、該当するデータ数を出力してもよい。 The model comparison unit 14 is a type of feature amount in which the difference between the degree of influence in the first model and the degree of influence in the second model is equal to or greater than the threshold value as the difference between the first model and the second model. May be specified, and the number of data having k or more types of feature quantities (predetermined value) may be output. At this time, the model comparison unit 14 may output the corresponding number of data for each data label or each data type.
 さらに、モデル比較部14は、第1のモデルと第2のモデルとの違いとして、k個以上のデータで上記の影響度の差が閾値以上となっている特徴量の種類数を出力してもよい。このときモデル比較部14は、データのラベルごと、または、データの種類ごと、該当する特徴量の種類数を出力してもよい。 Further, the model comparison unit 14 outputs, as a difference between the first model and the second model, the number of types of feature quantities in which the difference in the degree of influence is equal to or greater than the threshold value in k or more data. May be good. At this time, the model comparison unit 14 may output the number of types of the corresponding feature amount for each data label or for each data type.
[判定部]
 図1の説明に戻る。判定部15は、モデル比較部14により出力された、第1のモデルと第2のモデルとの違いに基づき、第1の教師データがモデル構築に充分なデータであるか否かを判定する。
[Judgment unit]
Returning to the description of FIG. The determination unit 15 determines whether or not the first teacher data is sufficient data for model construction based on the difference between the first model and the second model output by the model comparison unit 14.
 例えば、モデル比較部14から、第1の教師データまたは第2の教師データにおける、両モデル間での影響度の差が所定の閾値以上の特徴量を持つデータ数が出力された場合、判定部15は、以下のようにして、第1の教師データはモデル構築に充分なデータであるか否かを判定する。 For example, when the model comparison unit 14 outputs the number of data having a feature amount in which the difference in the degree of influence between the two models is equal to or greater than a predetermined threshold in the first teacher data or the second teacher data, the determination unit 15 determines whether or not the first teacher data is sufficient data for model construction as follows.
 例えば、判定部15は、第1の教師データまたは第2の教師データのデータ数に対する、上記の出力されたデータ数の割合を計算する。そして、その割合が所定の閾値以上だった場合、判定部15は、第1の教師データはモデル構築に充分なデータではないと判定する。一方、上記の割合が所定の閾値未満だった場合、モデル比較部14は、第1の教師データはモデル構築に充分なデータであると判定する。 For example, the determination unit 15 calculates the ratio of the number of output data to the number of data of the first teacher data or the second teacher data. Then, when the ratio is equal to or higher than a predetermined threshold value, the determination unit 15 determines that the first teacher data is not sufficient data for model construction. On the other hand, when the above ratio is less than a predetermined threshold value, the model comparison unit 14 determines that the first teacher data is sufficient data for model construction.
 例えば、モデル比較部14から、第1の教師データまたは第2の教師データにおける、両モデル間での影響度の差が所定の閾値以上となっている特徴量の種類数が出力された場合、判定部15は、以下のようにして、第1の教師データはモデル構築に充分なデータであるか否かを判定する。 For example, when the model comparison unit 14 outputs the number of types of feature quantities in which the difference in the degree of influence between the two models in the first teacher data or the second teacher data is equal to or greater than a predetermined threshold value. The determination unit 15 determines whether or not the first teacher data is sufficient data for model construction as follows.
 例えば、判定部15は、第1の教師データまたは第2の教師データに含まれる特徴量の種類数に対する、上記の出力された特徴量の種類数の割合を計算する。そして、その割合が所定の閾値以上だった場合、判定部15は、第1の教師データはモデル構築に充分なデータではないと判定する。一方、上記の割合が所定の閾値未満だった場合、判定部15は、第1の教師データはモデル構築に充分なデータであると判定する。 For example, the determination unit 15 calculates the ratio of the number of types of the output feature amount to the number of types of the feature amount included in the first teacher data or the second teacher data. Then, when the ratio is equal to or higher than a predetermined threshold value, the determination unit 15 determines that the first teacher data is not sufficient data for model construction. On the other hand, when the above ratio is less than a predetermined threshold value, the determination unit 15 determines that the first teacher data is sufficient data for model construction.
 判定部15は、上記のようにして、第1の教師データはモデル構築に充分なデータであるか否かを判定し、判定の結果を出力する。 As described above, the determination unit 15 determines whether or not the first teacher data is sufficient data for model construction, and outputs the determination result.
[処理手順の例]
 次に、図4を用いて、判定装置10の処理手順の例を説明する。まず、判定装置10のモデル構築部13は、第1の教師データを用いて第1のモデルを構築する(S1)。次に、データ除去部12は、第1の教師データから一部のデータを除去したデータ(第2の教師データ)を生成する(S2)。その後、モデル構築部13は、第2の教師データを用いて第2のモデルを構築する(S3)。
[Example of processing procedure]
Next, an example of the processing procedure of the determination device 10 will be described with reference to FIG. First, the model building unit 13 of the determination device 10 builds the first model using the first teacher data (S1). Next, the data removing unit 12 generates data (second teacher data) obtained by removing a part of the data from the first teacher data (S2). After that, the model building unit 13 builds the second model using the second teacher data (S3).
 S3の後、モデル比較部14は、上記の第1のモデルと第2のモデルとを比較し、第1のモデルと第2のモデルとの違いを出力する(S4)。その後、判定部15は、モデル比較部14により出力された第1のモデルと第2のモデルとの違いに基づき、第1の教師データはモデル構築に充分なデータか否かを判定する(S5)。そして、判定部15は、S5の判定結果を出力する(S6)。 After S3, the model comparison unit 14 compares the first model and the second model, and outputs the difference between the first model and the second model (S4). After that, the determination unit 15 determines whether or not the first teacher data is sufficient data for model construction based on the difference between the first model and the second model output by the model comparison unit 14 (S5). ). Then, the determination unit 15 outputs the determination result of S5 (S6).
[処理手順の具体例]
 次に、図5を用いて、判定装置10の処理手順の具体例を説明する。ここでは、判定装置10は、マルウェアの検知を行うモデルについて、当該モデルの構築に用いられた教師データにおける特定のマルウェアファミリーのデータが、モデル構築に充分か否かを判定する場合を例に説明する。ここでは、上記のマルウェアの検知を行うモデルを第1のモデル、当該モデルの構築に用いられた教師データを第1の教師データとする。
[Specific example of processing procedure]
Next, a specific example of the processing procedure of the determination device 10 will be described with reference to FIG. Here, the determination device 10 describes a model for detecting malware by taking as an example a case where it is determined whether or not the data of a specific malware family in the teacher data used for constructing the model is sufficient for constructing the model. do. Here, the model for detecting the above malware is referred to as the first model, and the teacher data used for constructing the model is referred to as the first teacher data.
 まず、判定装置10のモデル構築部13は、第1の教師データを用いて第1のモデルを構築する(S11)。次に、データ除去部12は、第1の教師データから対象のマルウェアファミリーのデータを除去したデータを第2の教師データとして生成する(S12)。その後、モデル構築部13は、第2の教師データを用いて第2のモデルを構築する(S13)。 First, the model building unit 13 of the determination device 10 builds the first model using the first teacher data (S11). Next, the data removal unit 12 generates data obtained by removing the data of the target malware family from the first teacher data as the second teacher data (S12). After that, the model building unit 13 builds the second model using the second teacher data (S13).
 S13の後、モデル比較部14は、例えば、上記の式(3)を用いて、第1のモデルと第2のモデルとの間での特徴量の影響度の差を計算し、その差が所定値以上の特徴量を特定する(S14)。そして、モデル比較部14は、S14で特定した特徴量を含む、第1の教師データまたは第2の教師データのデータ数を出力する(S15:特定した特徴量を含むデータ数を出力)。次に、判定部15は、S15で出力されたデータ数を用いて、下記の割合を計算し、計算した割合が所定の閾値以上か否かを判定する(S16)。 After S13, the model comparison unit 14 calculates the difference in the degree of influence of the feature amount between the first model and the second model by using, for example, the above equation (3), and the difference is A feature amount having a predetermined value or more is specified (S14). Then, the model comparison unit 14 outputs the number of data of the first teacher data or the second teacher data including the feature amount specified in S14 (S15: outputs the number of data including the specified feature amount). Next, the determination unit 15 calculates the following ratio using the number of data output in S15, and determines whether or not the calculated ratio is equal to or greater than a predetermined threshold value (S16).
 S15で出力されたデータ数/(第1の教師データまたは第2の教師データのデータ数) Number of data output in S15 / (number of data of the first teacher data or the second teacher data)
 ここで、判定部15は、上記の計算した割合が所定の閾値以上と判定した場合(S16でYes)、第1の教師データのうち、対象のマルウェアファミリーのデータがモデル構築に不充分と判定する(S17)。そして、判定部15は、上記の判定結果を出力する(S18)。 Here, when the determination unit 15 determines that the calculated ratio is equal to or higher than a predetermined threshold value (Yes in S16), the determination unit 15 determines that the data of the target malware family among the first teacher data is insufficient for model construction. (S17). Then, the determination unit 15 outputs the above determination result (S18).
 一方、判定部15は、上記の計算した割合が所定の閾値未満と判定した場合(S16でNo)、第1の教師データのうち、対象のマルウェアファミリーのデータはモデル構築に充分と判定する(S19)。そして、判定部15は、上記の判定結果を出力する(S20)。 On the other hand, when the determination unit 15 determines that the calculated ratio is less than a predetermined threshold value (No in S16), the determination unit 15 determines that the data of the target malware family among the first teacher data is sufficient for model construction (No). S19). Then, the determination unit 15 outputs the above determination result (S20).
 このようにすることで、判定装置10は、マルウェアの検知を行うモデルについて、当該モデルの構築に用いられた教師データにおける特定のマルウェアファミリーのデータが、モデル構築に充分か否かを判定することができる。 By doing so, the determination device 10 determines whether or not the data of the specific malware family in the teacher data used for constructing the model is sufficient for constructing the model for the model for detecting malware. Can be done.
[第2の実施形態]
 なお、判定装置10は、図1の破線で示す比較結果分析部16をさらに備えてもよい。この場合の実施形態を、本発明の第2の実施形態として説明する。比較結果分析部16は、モデル比較部14による第1のモデルと第2のモデルとの比較結果について詳細な分析を実施する。
[Second Embodiment]
The determination device 10 may further include a comparison result analysis unit 16 shown by a broken line in FIG. The embodiment in this case will be described as the second embodiment of the present invention. The comparison result analysis unit 16 carries out a detailed analysis of the comparison result between the first model and the second model by the model comparison unit 14.
 例えば、比較結果分析部16は、モデル比較部14により出力された第1のモデルと第2のモデルとの比較結果に基づき、データ除去部12により除去されたデータのうち、両モデル間での影響度の差が所定の閾値以上の特徴量と同じ特徴量を持つデータを特定する。そして、比較結果分析部16は、上記の特定したデータのラベル、データの種類、影響度の差が所定の閾値以上の特徴量等の情報を、分析結果として出力する。 For example, the comparison result analysis unit 16 is among the data removed by the data removal unit 12 based on the comparison result between the first model and the second model output by the model comparison unit 14, between the two models. Specify data having the same feature amount as the feature amount whose influence degree difference is equal to or more than a predetermined threshold value. Then, the comparison result analysis unit 16 outputs information such as the label of the specified data, the type of data, and the feature amount in which the difference in the degree of influence is equal to or more than a predetermined threshold value as the analysis result.
 例えば、両モデルの構築に用いられたデータが、図6に示すデータ1,2であり、除外されたデータがデータ3,4である場合を考える。また、データ1,2に含まれる特徴量のうち、両モデル間での影響度の差が大きい(影響度の差が所定値の閾値以上の)特徴量は「特徴量1」で、その特徴量の値は「1」であるものとする。 For example, consider the case where the data used for constructing both models are the data 1 and 2 shown in FIG. 6 and the excluded data are the data 3 and 4. Further, among the feature amounts included in the data 1 and 2, the feature amount having a large difference in the degree of influence between the two models (the difference in the degree of influence is equal to or greater than the threshold value of the predetermined value) is "feature amount 1", and the feature is the feature amount. It is assumed that the value of the quantity is "1".
 この場合、比較結果分析部16は、除外されたデータ3,4のうち、両モデル間での影響度の差が大きい特徴量と同じ特徴量を持つデータとしてデータ3(特徴量1の値が「1」)を特定する。そして、比較結果分析部16は、データ3に関する情報を出力する。例えば、比較結果分析部16は、データ3の属するマルウェアファミリーの情報(「ランサムウェア」)等を出力する。 In this case, the comparison result analysis unit 16 has the data 3 (the value of the feature amount 1 is the value of the feature amount 1) as the data having the same feature amount as the feature amount having a large difference in the degree of influence between the two models among the excluded data 3 and 4. Identify "1"). Then, the comparison result analysis unit 16 outputs information regarding the data 3. For example, the comparison result analysis unit 16 outputs information (“ransomware”) of the malware family to which the data 3 belongs.
[処理手順の例]
 次に、図7を用いて、判定装置10の処理手順の例を説明する。図7のS21~S25の処理は、図4のS1~S5と同じ処理なので、説明を省略し、図7のS26から説明する。
[Example of processing procedure]
Next, an example of the processing procedure of the determination device 10 will be described with reference to FIG. 7. Since the processes of S21 to S25 in FIG. 7 are the same as the processes of S1 to S5 in FIG. 4, the description thereof will be omitted, and the processes will be described from S26 in FIG.
 図7のS25において判定部15が、第1の教師データは、モデル構築に充分なデータと判定した場合(S26でNo)、上記の判定結果を出力する(S27)。一方、判定部15が、第1の教師データは、モデル構築に充分なデータではない判定した場合(S26でYes)、比較結果分析部16は、モデル比較部14から両モデル間での影響度の差が所定の閾値以上の特徴量を取得する。そして、比較結果分析部16は、上記の両モデル間での影響度の差が所定の閾値以上の特徴量が、S22で除去したどのデータに含まれているかを分析する(S28)。そして、比較結果分析部16は、その分析結果を出力する(S29)。 When the determination unit 15 determines in S25 of FIG. 7 that the first teacher data is sufficient data for model construction (No in S26), the determination unit 15 outputs the above determination result (S27). On the other hand, when the determination unit 15 determines that the first teacher data is not sufficient data for model construction (Yes in S26), the comparison result analysis unit 16 determines the degree of influence between the model comparison unit 14 and both models. Acquires a feature amount whose difference between the two is equal to or greater than a predetermined threshold value. Then, the comparison result analysis unit 16 analyzes which data the feature amount having the difference in the degree of influence between the above two models is equal to or greater than a predetermined threshold value is included in the data removed in S22 (S28). Then, the comparison result analysis unit 16 outputs the analysis result (S29).
[処理手順の具体例]
 次に、図8を用いて、第2の実施形態の判定装置10の処理手順の具体例を説明する。ここでは、判定装置10は、マルウェアの検知を行うモデルについて、当該モデルの構築に用いられた教師データにおける、複数のマルウェアファミリーそれぞれのデータが充分か否かを判定する場合を例に説明する。ここでも、上記のマルウェアの検知を行うモデルを第1のモデル、当該モデルの構築に用いられた教師データを第1の教師データとする。
[Specific example of processing procedure]
Next, a specific example of the processing procedure of the determination device 10 of the second embodiment will be described with reference to FIG. Here, the determination device 10 describes a model for detecting malware as an example of determining whether or not the data of each of a plurality of malware families in the teacher data used for constructing the model is sufficient. Here, too, the model for detecting the above malware is referred to as the first model, and the teacher data used for constructing the model is referred to as the first teacher data.
 まず、判定装置10のモデル構築部13は、第1の教師データを用いて第1のモデルを構築する(S31)。次に、データ除去部12は、第1の教師データから対象の各マルウェアファミリーのデータを除去したデータを第2の教師データとして生成する(S32)。その後、モデル構築部13は、第2の教師データを用いて第2のモデルを構築する(S33)。 First, the model building unit 13 of the determination device 10 builds the first model using the first teacher data (S31). Next, the data removal unit 12 generates data obtained by removing the data of each target malware family from the first teacher data as the second teacher data (S32). After that, the model building unit 13 builds the second model using the second teacher data (S33).
 S33の後、モデル比較部14は、上記の式(3)を用いて、第1のモデルと第2のモデルとの間での特徴量の影響度の差を計算し、その差が所定値以上の特徴量を特定する(S34)。そして、判定部15は、S34で特定された特徴量を用いて、第1の教師データはモデル構築に充分なデータか否かを判定する(S35)。 After S33, the model comparison unit 14 calculates the difference in the degree of influence of the feature amount between the first model and the second model using the above equation (3), and the difference is a predetermined value. The above features are specified (S34). Then, the determination unit 15 determines whether or not the first teacher data is sufficient data for model construction by using the feature amount specified in S34 (S35).
 S35において判定部15が、第1の教師データは、モデル構築に充分なデータと判定した場合(S36でNo)、上記の判定結果を出力する(S37)。 When the determination unit 15 determines in S35 that the first teacher data is sufficient data for model construction (No in S36), the determination unit 15 outputs the above determination result (S37).
 一方、判定部15が、第1の教師データは、モデル構築に充分なデータではない判定した場合(S36でYes)、比較結果分析部16は、モデル比較部14から両モデル間での影響度の差が所定の閾値以上の特徴量を取得する。そして、比較結果分析部16は、上記の両モデル間での影響度の差が所定の閾値以上の特徴量が、S32で除去されたどのマルウェアファミリーのデータに含まれているかを分析する(S38)。 On the other hand, when the determination unit 15 determines that the first teacher data is not sufficient data for model construction (Yes in S36), the comparison result analysis unit 16 determines the degree of influence between the model comparison unit 14 and both models. Acquires a feature amount whose difference between the two is equal to or greater than a predetermined threshold value. Then, the comparison result analysis unit 16 analyzes which malware family data removed in S32 contains a feature amount in which the difference in the degree of influence between the above two models is equal to or greater than a predetermined threshold value (S38). ).
 そして、比較結果分析部16は、分析結果に基づき、マルウェアファミリーごとに、当該マルウェアファミリーのデータに含まれる上記の特徴量の種類数を集計し、上記の特徴量の種類数が所定の閾値以上のマルウェアファミリーを特定する(S39)。次に、比較結果分析部16は、第1の教師データのうち、S39で特定したマルウェアファミリーのデータがモデル構築に不充分である旨の分析結果を出力する(S40)。 Then, the comparison result analysis unit 16 aggregates the number of types of the above-mentioned feature amounts included in the data of the malware family for each malware family based on the analysis result, and the number of types of the above-mentioned feature amounts is equal to or higher than a predetermined threshold value. Identify the malware family of (S39). Next, the comparison result analysis unit 16 outputs an analysis result indicating that the malware family data specified in S39 among the first teacher data is insufficient for model construction (S40).
 このようにすることで、判定装置10は、マルウェアの検知を行うモデルについて、当該モデルの構築に用いられた教師データにおける、どのマルウェアファミリーのデータが不充分かを分析することができる。 By doing so, the determination device 10 can analyze which malware family data is insufficient in the teacher data used for constructing the model for the model that detects malware.
[第2の実施形態の変形例]
 なお、第2の実施形態の判定装置10は、モデルの構築に、新たに収集(取得)されたデータを追加した際に、当該モデルにどのような変化が起こったかを分析してもよい。なお、判定装置10が分析対象とするモデルは、上記の場合と同様に、マルウェアの検知を行うモデルである場合を例に説明する。また、上記のマルウェアの検知を行うモデルを第1のモデル、当該モデルの構築に用いられた、新たに収集されたデータを含む教師データを第1の教師データとする。
[Modified example of the second embodiment]
The determination device 10 of the second embodiment may analyze what kind of change has occurred in the model when the newly collected (acquired) data is added to the construction of the model. The model to be analyzed by the determination device 10 will be described by taking the case of a model for detecting malware as an example, as in the above case. Further, the model for detecting the above malware is referred to as the first model, and the teacher data including the newly collected data used for constructing the model is referred to as the first teacher data.
 この場合の判定装置10の処理手順の例を、図9を用いて説明する。まず、判定装置10のモデル構築部13は、第1の教師データを用いて第1のモデルを構築する(S51)。次に、データ除去部12は、第1の教師データから新たに収集されたデータ(例えば、タイムスタンプの値が所定の日時以降のデータ)を除去したデータを第2の教師データとして生成する(S52)。 An example of the processing procedure of the determination device 10 in this case will be described with reference to FIG. First, the model building unit 13 of the determination device 10 builds the first model using the first teacher data (S51). Next, the data removal unit 12 generates data obtained by removing newly collected data (for example, data whose time stamp value is after a predetermined date and time) from the first teacher data as the second teacher data (for example). S52).
 S52の後、モデル構築部13は、第2の教師データを用いて第2のモデルを構築する(S53)。S53の後、モデル比較部14は、例えば、以下の式(4)を用いて、第1のモデルと第2のモデルとの間での特徴量の影響度の差を計算し、その差が所定値以上の特徴量を特定する(S54)。そして、モデル比較部14は、S54で特定した特徴量を含むデータを特定する(S55)。例えば、モデル比較部14は、第1の教師データまたは第2の教師データから、S54で特定した特徴量を含むデータを特定する。 After S52, the model building unit 13 builds the second model using the second teacher data (S53). After S53, the model comparison unit 14 calculates the difference in the degree of influence of the feature amount between the first model and the second model by using, for example, the following equation (4), and the difference is A feature amount having a predetermined value or more is specified (S54). Then, the model comparison unit 14 specifies the data including the feature amount specified in S54 (S55). For example, the model comparison unit 14 identifies data including the feature amount specified in S54 from the first teacher data or the second teacher data.
Figure JPOXMLDOC01-appb-M000004
Figure JPOXMLDOC01-appb-M000004
 S55の後、判定部15は、第1のモデルと第2のモデルとの間での特徴量の影響度の差が、S55で特定されたデータの誤検知や見逃しに深刻な影響を与えないものか否かを判定する(S56)。そして、その判定結果を出力する(S57)。 After S55, in the determination unit 15, the difference in the degree of influence of the feature amount between the first model and the second model does not seriously affect the false detection or oversight of the data specified in S55. It is determined whether or not it is a thing (S56). Then, the determination result is output (S57).
 例えば、判定部15が、第1のモデルと第2のモデルとの間での特徴量の影響度の差が、S55で特定されたデータの誤検知や見逃しに深刻な影響を与えないものと判定した場合、第1のモデルから第2のモデルへの変化は妥当な変化であると判定し、その判定結果を出力する。一方、判定部15が、第1のモデルと第2のモデルとの間での特徴量の影響度の差が、S55で特定されたデータの誤検知や見逃しに深刻な影響を与えるものと判定した場合、その判定結果を出力する。 For example, the determination unit 15 determines that the difference in the degree of influence of the feature amount between the first model and the second model does not seriously affect the false detection or oversight of the data specified in S55. If it is determined, it is determined that the change from the first model to the second model is a reasonable change, and the determination result is output. On the other hand, the determination unit 15 determines that the difference in the degree of influence of the feature amount between the first model and the second model has a serious effect on the false detection or oversight of the data specified in S55. If so, the judgment result is output.
 また、比較結果分析部16は、S54でモデル比較部14により特定された、第1のモデルと第2のモデルとの間での特徴量の影響度の差が所定の閾値以上の特徴量が、S52で除去されたどのデータに含まれるかを分析する(S58)。そして、比較結果分析部16は、S58の分析の結果とデータの特性に相違がなければ、第1のモデルから第2のモデルへの変化は妥当な変化であると判定し、その旨を分析結果として出力する(S59:分析結果の出力)。一方、比較結果分析部16は、S58の分析結果とデータの特性に相違があれば、第1のモデルから第2のモデルへの変化は妥当な変化ではないと判定し、その旨を分析結果として出力する(S59:分析結果の出力)。 Further, in the comparison result analysis unit 16, the feature amount in which the difference in the degree of influence of the feature amount between the first model and the second model, which is specified by the model comparison unit 14 in S54, is equal to or larger than a predetermined threshold value. , Which data is included in the data removed in S52 is analyzed (S58). Then, the comparison result analysis unit 16 determines that the change from the first model to the second model is a reasonable change if there is no difference between the analysis result of S58 and the characteristics of the data, and analyzes that fact. Output as a result (S59: Output of analysis result). On the other hand, if there is a difference between the analysis result of S58 and the characteristics of the data, the comparison result analysis unit 16 determines that the change from the first model to the second model is not a valid change, and determines that the change is not appropriate. (S59: Output of analysis result).
 このようにすることで判定装置10は、モデルの構築のための教師データに新たに取集(取得)されたデータを追加した際に、当該モデルにどのような変化が起こったか、また、起こった変化が妥当な変化か否かを分析することができる。 By doing so, when the determination device 10 adds the newly collected (acquired) data to the teacher data for constructing the model, what kind of change has occurred in the model and what has happened. It is possible to analyze whether or not the change is a reasonable change.
[システム構成等]
 また、図示した各部の各構成要素は機能概念的なものであり、必ずしも物理的に図示のように構成されていることを要しない。すなわち、各装置の分散・統合の具体的形態は図示のものに限られず、その全部又は一部を、各種の負荷や使用状況等に応じて、任意の単位で機能的又は物理的に分散・統合して構成することができる。さらに、各装置にて行われる各処理機能は、その全部又は任意の一部が、CPU及び当該CPUにて実行されるプログラムにて実現され、あるいは、ワイヤードロジックによるハードウェアとして実現され得る。
[System configuration, etc.]
Further, each component of each of the illustrated parts is a functional concept, and does not necessarily have to be physically configured as shown in the figure. That is, the specific form of distribution / integration of each device is not limited to the one shown in the figure, and all or part of them may be functionally or physically distributed / physically in arbitrary units according to various loads and usage conditions. Can be integrated and configured. Further, each processing function performed by each device may be realized by a CPU and a program executed by the CPU, or may be realized as hardware by wired logic.
 また、前記した実施形態において説明した処理のうち、自動的に行われるものとして説明した処理の全部又は一部を手動的に行うこともでき、あるいは、手動的に行われるものとして説明した処理の全部又は一部を公知の方法で自動的に行うこともできる。この他、上記文書中や図面中で示した処理手順、制御手順、具体的名称、各種のデータやパラメータを含む情報については、特記する場合を除いて任意に変更することができる。 Further, among the processes described in the above-described embodiment, all or part of the processes described as being automatically performed can be manually performed, or the processes described as being manually performed can be performed. All or part of it can be done automatically by a known method. In addition, the processing procedure, control procedure, specific name, and information including various data and parameters shown in the above document and drawings can be arbitrarily changed unless otherwise specified.
[プログラム]
 前記した判定装置10は、パッケージソフトウェアやオンラインソフトウェアとしてプログラムを所望のコンピュータにインストールさせることによって実装できる。例えば、上記のプログラムを情報処理装置に実行させることにより、情報処理装置を各実施形態の判定装置10として機能させることができる。ここで言う情報処理装置には、デスクトップ型又はノート型のパーソナルコンピュータが含まれる。また、その他にも、情報処理装置にはスマートフォン、携帯電話機やPHS(Personal Handyphone System)等の移動体通信端末、さらには、PDA(Personal Digital Assistant)等の端末等がその範疇に含まれる。
[program]
The determination device 10 described above can be implemented by installing a program as package software or online software on a desired computer. For example, by causing the information processing device to execute the above program, the information processing device can function as the determination device 10 of each embodiment. The information processing device referred to here includes a desktop type or notebook type personal computer. In addition, the information processing device includes smartphones, mobile communication terminals such as mobile phones and PHS (Personal Handyphone System), and terminals such as PDAs (Personal Digital Assistants).
 また、判定装置10は、ユーザが使用する端末装置をクライアントとし、当該クライアントに上記の処理に関するサービスを提供するサーバ装置として実装することもできる。この場合、サーバ装置は、Webサーバとして実装することとしてもよいし、アウトソーシングによって上記の処理に関するサービスを提供するクラウドとして実装することとしてもかまわない。 Further, the determination device 10 can be implemented as a server device in which the terminal device used by the user is a client and the service related to the above processing is provided to the client. In this case, the server device may be implemented as a Web server, or may be implemented as a cloud that provides services related to the above processing by outsourcing.
 図10は、判定プログラムを実行するコンピュータの一例を示す図である。コンピュータ1000は、例えば、メモリ1010、CPU1020を有する。また、コンピュータ1000は、ハードディスクドライブインタフェース1030、ディスクドライブインタフェース1040、シリアルポートインタフェース1050、ビデオアダプタ1060、ネットワークインタフェース1070を有する。これらの各部は、バス1080によって接続される。 FIG. 10 is a diagram showing an example of a computer that executes a determination program. The computer 1000 has, for example, a memory 1010 and a CPU 1020. The computer 1000 also has a hard disk drive interface 1030, a disk drive interface 1040, a serial port interface 1050, a video adapter 1060, and a network interface 1070. Each of these parts is connected by a bus 1080.
 メモリ1010は、ROM(Read Only Memory)1011及びRAM(Random Access Memory)1012を含む。ROM1011は、例えば、BIOS(Basic Input Output System)等のブートプログラムを記憶する。ハードディスクドライブインタフェース1030は、ハードディスクドライブ1090に接続される。ディスクドライブインタフェース1040は、ディスクドライブ1100に接続される。例えば磁気ディスクや光ディスク等の着脱可能な記憶媒体が、ディスクドライブ1100に挿入される。シリアルポートインタフェース1050は、例えばマウス1110、キーボード1120に接続される。ビデオアダプタ1060は、例えばディスプレイ1130に接続される。 The memory 1010 includes a ROM (Read Only Memory) 1011 and a RAM (Random Access Memory) 1012. The ROM 1011 stores, for example, a boot program such as a BIOS (Basic Input Output System). The hard disk drive interface 1030 is connected to the hard disk drive 1090. The disk drive interface 1040 is connected to the disk drive 1100. For example, a removable storage medium such as a magnetic disk or an optical disk is inserted into the disk drive 1100. The serial port interface 1050 is connected to, for example, a mouse 1110 and a keyboard 1120. The video adapter 1060 is connected to, for example, the display 1130.
 ハードディスクドライブ1090は、例えば、OS1091、アプリケーションプログラム1092、プログラムモジュール1093、プログラムデータ1094を記憶する。すなわち、上記の判定装置10が実行する各処理を規定するプログラムは、コンピュータにより実行可能なコードが記述されたプログラムモジュール1093として実装される。プログラムモジュール1093は、例えばハードディスクドライブ1090に記憶される。例えば、判定装置10における機能構成と同様の処理を実行するためのプログラムモジュール1093が、ハードディスクドライブ1090に記憶される。なお、ハードディスクドライブ1090は、SSDにより代替されてもよい。 The hard disk drive 1090 stores, for example, OS1091, application program 1092, program module 1093, and program data 1094. That is, the program that defines each process executed by the determination device 10 is implemented as a program module 1093 in which a code that can be executed by a computer is described. The program module 1093 is stored in, for example, the hard disk drive 1090. For example, the program module 1093 for executing the same processing as the functional configuration in the determination device 10 is stored in the hard disk drive 1090. The hard disk drive 1090 may be replaced by an SSD.
 また、上述した実施形態の処理で用いられる各データは、プログラムデータ1094として、例えばメモリ1010やハードディスクドライブ1090に記憶される。そして、CPU1020が、メモリ1010やハードディスクドライブ1090に記憶されたプログラムモジュール1093やプログラムデータ1094を必要に応じてRAM1012に読み出して実行する。 Further, each data used in the processing of the above-described embodiment is stored as program data 1094 in, for example, a memory 1010 or a hard disk drive 1090. Then, the CPU 1020 reads the program module 1093 and the program data 1094 stored in the memory 1010 and the hard disk drive 1090 into the RAM 1012 and executes them as needed.
 なお、プログラムモジュール1093やプログラムデータ1094は、ハードディスクドライブ1090に記憶される場合に限らず、例えば着脱可能な記憶媒体に記憶され、ディスクドライブ1100等を介してCPU1020によって読み出されてもよい。あるいは、プログラムモジュール1093及びプログラムデータ1094は、ネットワされたーク(LAN(Local Area Network)、WAN(Wide Area Network)等)を介して接続他のコンピュータに記憶されてもよい。そして、プログラムモジュール1093及びプログラムデータ1094は、他のコンピュータから、ネットワークインタフェース1070を介してCPU1020によって読み出されてもよい。 The program module 1093 and the program data 1094 are not limited to those stored in the hard disk drive 1090, but may be stored in, for example, a removable storage medium and read by the CPU 1020 via the disk drive 1100 or the like. Alternatively, the program module 1093 and the program data 1094 may be stored in another computer connected via a network (LAN (Local Area Network), WAN (Wide Area Network), etc.). Then, the program module 1093 and the program data 1094 may be read from another computer by the CPU 1020 via the network interface 1070.
10 判定装置
11 教師データ記憶部
12 データ除去部
13 モデル構築部
14 モデル比較部
15 判定部
16 比較結果分析部
10 Judgment device 11 Teacher data storage unit 12 Data removal unit 13 Model construction unit 14 Model comparison unit 15 Judgment unit 16 Comparison result analysis unit

Claims (6)

  1.  第1の教師データから一部のデータを除去したデータである第2の教師データを用いて第2のモデルの構築を行うモデル構築部と、
     前記第1の教師データおよび前記第2の教師データに含まれる特徴量のうち、前記第1の教師データを用いて構築された第1のモデルと前記第2のモデルとの間でデータの分類の結果に与える影響度の差が所定の閾値以上の特徴量を特定し、前記特定した特徴量を含むデータ数または前記特定した特徴量の数を、前記第1のモデルと前記第2のモデルとの違いとして出力するモデル比較部と、
     前記第1のモデルと前記第2のモデルとの違いに基づき、前記第1の教師データはモデル構築に充分なデータであるか否かを判定する判定部と
     を備えることを特徴とする判定装置。
    A model construction unit that constructs a second model using the second teacher data, which is data obtained by removing some data from the first teacher data.
    Of the feature quantities contained in the first teacher data and the second teacher data, data classification between the first model constructed using the first teacher data and the second model. The number of data including the specified feature amount or the number of the specified feature amount is determined by specifying the feature amount whose difference in the degree of influence on the result of the above is equal to or more than a predetermined threshold value. The model comparison part that outputs as a difference from
    A determination device including a determination unit for determining whether or not the first teacher data is sufficient data for model construction based on the difference between the first model and the second model. ..
  2.  前記モデル比較部は、
     前記第1の教師データまたは前記第2の教師データに存在する、前記特定した特徴量を持つデータ数または前記特定した特徴量の数を、前記第1のモデルと前記第2のモデルとの違いとして出力する
     ことを特徴とする請求項1に記載の判定装置。
    The model comparison unit
    The difference between the first model and the second model in terms of the number of data having the specified feature amount or the number of the specified feature amount existing in the first teacher data or the second teacher data. The determination device according to claim 1, wherein the data is output as.
  3.  前記モデル比較部が、
     前記第1のモデルと前記第2のモデルとの違いとして、前記第1の教師データまたは前記第2の教師データに存在する、前記特定した特徴量を持つデータ数を出力した場合、
     前記判定部は、
     前記第1の教師データまたは前記第2の教師データのデータ数に対する、前記特定した特徴量を持つデータ数の割合が所定の閾値以上の場合、前記第1の教師データはモデル構築に充分なデータではないと判定する
     ことを特徴とする請求項2に記載の判定装置。
    The model comparison unit
    The difference between the first model and the second model is that when the number of data having the specified feature amount existing in the first teacher data or the second teacher data is output,
    The determination unit
    When the ratio of the number of data having the specified feature amount to the number of data of the first teacher data or the second teacher data is equal to or more than a predetermined threshold, the first teacher data is sufficient data for model construction. The determination device according to claim 2, wherein it is determined that the data is not the same.
  4.  前記モデル比較部が、
     前記第1のモデルと前記第2のモデルとの違いとして、前記第1の教師データまたは前記第2の教師データに存在する、前記特定した特徴量の数を出力した場合、
     前記判定部は、
     前記第1の教師データまたは前記第2の教師データに存在する特徴量の数に対する、前記特定した特徴量の数の割合が所定の閾値以上の場合、前記第1の教師データはモデル構築に充分なデータではないと判定する
     ことを特徴とする請求項2に記載の判定装置。
    The model comparison unit
    As a difference between the first model and the second model, when the number of the specified feature quantities present in the first teacher data or the second teacher data is output.
    The determination unit
    When the ratio of the number of the specified feature amount to the number of the feature amount existing in the first teacher data or the second teacher data is equal to or more than a predetermined threshold value, the first teacher data is sufficient for model construction. The determination device according to claim 2, wherein it is determined that the data is not such data.
  5.  判定装置により実行される判定方法であって、
     第1の教師データから一部のデータを除去したデータである第2の教師データを用いて第2のモデルの構築を行うモデル構築工程と、
     前記第1の教師データおよび前記第2の教師データに含まれる特徴量のうち、前記第1の教師データを用いて構築された第1のモデルと前記第2のモデルとの間でデータの分類の結果に与える影響度の差が所定の閾値以上の特徴量を特定し、前記特定した特徴量を含むデータ数または前記特定した特徴量の数を、前記第1のモデルと前記第2のモデルとの違いとして出力するモデル比較工程と、
     前記第1のモデルと前記第2のモデルとの違いに基づき、前記第1の教師データはモデル構築に充分なデータであるか否かを判定する判定工程と
     を含むことを特徴とする判定方法。
    It is a judgment method executed by the judgment device.
    A model construction process for constructing a second model using the second teacher data, which is data obtained by removing some data from the first teacher data, and
    Of the feature quantities contained in the first teacher data and the second teacher data, data classification between the first model constructed using the first teacher data and the second model. The number of data including the specified feature amount or the number of the specified feature amount is determined by specifying the feature amount whose difference in the degree of influence on the result of the above is equal to or more than a predetermined threshold value. The model comparison process that outputs as a difference from
    A determination method comprising a determination step of determining whether or not the first teacher data is sufficient data for model construction based on the difference between the first model and the second model. ..
  6.  第1の教師データから一部のデータを除去したデータである第2の教師データを用いて第2のモデルの構築を行うモデル構築工程と、
     前記第1の教師データおよび前記第2の教師データに含まれる特徴量のうち、前記第1の教師データを用いて構築された第1のモデルと前記第2のモデルとの間でデータの分類の結果に与える影響度の差が所定の閾値以上の特徴量を特定し、前記特定した特徴量を含むデータ数または前記特定した特徴量の数を、前記第1のモデルと前記第2のモデルとの違いとして出力するモデル比較工程と、
     前記第1のモデルと前記第2のモデルとの違いに基づき、前記第1の教師データはモデル構築に充分なデータであるか否かを判定する判定工程と
     をコンピュータに実行させることを特徴とする判定プログラム。
    A model construction process for constructing a second model using the second teacher data, which is data obtained by removing some data from the first teacher data, and
    Of the feature quantities contained in the first teacher data and the second teacher data, data classification between the first model constructed using the first teacher data and the second model. The number of data including the specified feature amount or the number of the specified feature amount is determined by specifying the feature amount whose difference in the degree of influence on the result of the above is equal to or more than a predetermined threshold value. The model comparison process that outputs as a difference from
    Based on the difference between the first model and the second model, the first teacher data is characterized by having a computer execute a determination step of determining whether or not the data is sufficient for model construction. Judgment program to do.
PCT/JP2020/043087 2020-11-18 2020-11-18 Determination device, determination method, and determination program WO2022107262A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/JP2020/043087 WO2022107262A1 (en) 2020-11-18 2020-11-18 Determination device, determination method, and determination program

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2020/043087 WO2022107262A1 (en) 2020-11-18 2020-11-18 Determination device, determination method, and determination program

Publications (1)

Publication Number Publication Date
WO2022107262A1 true WO2022107262A1 (en) 2022-05-27

Family

ID=81708616

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2020/043087 WO2022107262A1 (en) 2020-11-18 2020-11-18 Determination device, determination method, and determination program

Country Status (1)

Country Link
WO (1) WO2022107262A1 (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2015087973A (en) * 2013-10-31 2015-05-07 インターナショナル・ビジネス・マシーンズ・コーポレーションInternational Business Machines Corporation Generation device, generation method, and program
JP2016133895A (en) * 2015-01-16 2016-07-25 キヤノン株式会社 Information processing device, information processing method, and program
WO2019187594A1 (en) * 2018-03-29 2019-10-03 日本電気株式会社 Learning device, learning method, and learning program

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2015087973A (en) * 2013-10-31 2015-05-07 インターナショナル・ビジネス・マシーンズ・コーポレーションInternational Business Machines Corporation Generation device, generation method, and program
JP2016133895A (en) * 2015-01-16 2016-07-25 キヤノン株式会社 Information processing device, information processing method, and program
WO2019187594A1 (en) * 2018-03-29 2019-10-03 日本電気株式会社 Learning device, learning method, and learning program

Similar Documents

Publication Publication Date Title
CN109891508A (en) Single cell type detection method, device, equipment and storage medium
RU2722692C1 (en) Method and system for detecting malicious files in a non-isolated medium
CN111932269A (en) Equipment information processing method and device
CN106991325B (en) Protection method and device for software bugs
US11250368B1 (en) Business prediction method and apparatus
CN111445304A (en) Information recommendation method and device, computer equipment and storage medium
Zhong et al. Fully automatic operational modal analysis method based on statistical rule enhanced adaptive clustering method
JP2017004123A (en) Determination apparatus, determination method, and determination program
CN112686312A (en) Data classification method, device and system
CN107729469A (en) Usage mining method, apparatus, electronic equipment and computer-readable recording medium
CN116825192A (en) Interpretation method of ncRNA gene mutation, storage medium and terminal
CN113837481A (en) Financial big data management system based on block chain
Márquez et al. Vulnerability impact analysis in software project dependencies based on Satisfiability Modulo Theories (SMT)
CN111368837A (en) Image quality evaluation method and device, electronic equipment and storage medium
Ackermann et al. Black-box learning of parametric dependencies for performance models
CN113610132A (en) User equipment identification method and device and computer equipment
WO2022107262A1 (en) Determination device, determination method, and determination program
CN112433902A (en) Screen replacement model training method, screen replacement detection method and device
WO2023224742A1 (en) Predicting runtime variation in big data analytics
JP6659618B2 (en) Analysis apparatus, analysis method and analysis program
CN110633971A (en) Method and device for estimating loss
CN112749003A (en) Method, apparatus and computer-readable storage medium for system optimization
CN114579711A (en) Method, device, equipment and storage medium for identifying fraud application program
CN114722401A (en) Equipment safety testing method, device, equipment and storage medium
JP7331938B2 (en) LEARNING DEVICE, ESTIMATION DEVICE, LEARNING METHOD, AND LEARNING PROGRAM

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20962426

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20962426

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: JP