EP4182859A1 - Génération de copies de données d'entraînement bruitées dans un procédé de détection d'anomalies - Google Patents
Génération de copies de données d'entraînement bruitées dans un procédé de détection d'anomaliesInfo
- Publication number
- EP4182859A1 EP4182859A1 EP21755801.4A EP21755801A EP4182859A1 EP 4182859 A1 EP4182859 A1 EP 4182859A1 EP 21755801 A EP21755801 A EP 21755801A EP 4182859 A1 EP4182859 A1 EP 4182859A1
- Authority
- EP
- European Patent Office
- Prior art keywords
- noisy
- data
- training data
- learning module
- training
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/088—Non-supervised learning, e.g. competitive learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N5/00—Computing arrangements using knowledge-based models
- G06N5/01—Dynamic search techniques; Heuristics; Dynamic trees; Branch-and-bound
Definitions
- the invention relates to a method for detecting anomalies.
- the invention relates to an automatic learning method for the detection of abnormal data in a data set.
- Anomaly detection is a well-known subject especially in the field of data mining, in English Data Mining, with many industrial applications such as sorting objects on a recycling chain, monitoring measurement sensors or any other plant supervision application.
- the training data is generally composed of a large majority of "normal” data, while the anomaly data is generally weakly represented. This then raises a relatively important concern about the robustness of the training of the machine learning modules, in particular when they are supervised.
- the SMOTE method tends to densify sparsely represented data areas so as to reduce the imbalance of the data set.
- the training datasets include few or no examples of anomalous data, and there is a strong constraint on the accuracy of the detections obtained.
- a method for detecting anomalies implemented by computer in a data set comprising the provision of a data set of labeled training comprising at least one training datum representative of a normal state of the analyzed data of the analyzed data.
- the process includes:
- a step of forming a set of noisy training data comprising said set of training data and said at least one associated noisy copy
- the at least one noise generation parameter which includes a maximum noise amplitude to be added to all or part of the data of the training data set constitutes a simple and particularly robust parameter for obtaining distinct noisy copies of controlled manner.
- the step of generating a noisy copy may in particular comprise the generation of noisy copies comprising the noise of only one part of the training data set, for example by random selection of the data to be noisy, so as to ensure a completely random distribution of the noisy data.
- the noise parameter can have a different value but can also take identical values for all or part of the noisy copies. Indeed, the noise being added randomly, the same noise parameter will generate different noisy copies so that their performance may vary.
- the method implements in parallel the steps of generating a plurality of noisy copies, and for each noisy copy generated, the method implements in parallel, the steps of constitution, training, and calculation, the determining step determining the set of noisy training data having a maximum detection performance among the performances calculated by the parallel calculation steps.
- said generation, constitution, training and calculation steps are implemented iteratively, for example incrementally or dichotomously, so that at each iteration the detection performance of said automatic learning module calculated is compared with the detection performance of said automatic learning module calculated at the previous iteration.
- an iterative implementation allows a relatively lightweight approach, especially in terms of memory footprint, to determine the optimal noise parameter value.
- said generation, constitution, training and calculation steps are implemented incrementally, so that at each iteration the noise generation parameter value is incremented by a predetermined step, and if the performance calculated at the previous iteration is lower than that calculated at the current iteration, a new iteration of the generation, constitution, training and calculation steps is carried out.
- the set of labeled training data also includes at least one example of anomalous data.
- the machine learning module even more robust, with even better control over false positives, without diminishing the anomaly detection performance.
- the step of calculating the performance of detection of said automatic learning module is also a function of the rate of false negatives obtained by the implementation of the automatic learning module on at least one example of abnormal data.
- the step of calculating the performance of detection of said automatic learning module is also a function of the rate of false negatives obtained by the implementation of the automatic learning module on at least one example of abnormal data.
- said step of calculating the detection performance comprises calculating an average between the rate of false positives and the rate of false negatives. This is a simple and relatively reliable method for calculating overall performance.
- said average is weighted according to a predetermined coefficient.
- the weighting makes it possible to give more weight either to the false-positives or to the false-negatives in the overall calculation of the performance, in order to take into account the most important criterion of the two, in particular with regard to the technical field for which the process is implemented.
- the determination step further comprises a comparison of the maximum detection performance determined with a target performance value, so that if the maximum detection performance is lower than the performance value target, a new implementation of the method is carried out with new values of noise generation parameters.
- the method according to the invention can be relaunched with new sold-out copies, for example by changing the sold-out parameters, so as to continue the search for a set of sold-out data. showing satisfactory performance.
- said at least one brait generation parameter comprises a statistical brait distribution, such as a Gaussian white additive brait or a colored brait.
- a statistical brait distribution such as a Gaussian white additive brait or a colored brait.
- the machine learning module comprises one module among: a one-class support vector machine, a decision tree, a forest of decision trees, a method of the k nearest neighbors or an auto-encoder.
- a one-class support vector machine a decision tree
- a forest of decision trees a method of the k nearest neighbors or an auto-encoder.
- the invention also relates to a method for classifying a material, to be classified in a list of predetermined materials, such as a list of plastic resins, each material being associated with at least one measurement parameter, such as an absorption spectrum; the classification method comprising, for each material of the predetermined list of materials, the implementation of the anomaly detection method as described previously.
- the invention also relates to a computer program comprising program code instructions for the execution of the steps of the anomaly detection method as described previously and/or for the execution of the preceding classification method, when said program is run on a computer.
- FIG. 1 is a flowchart of a first embodiment of the method according to the invention.
- FIG. 2 is a flowchart of a second embodiment of the method according to the invention.
- FIG. 3 is a schematic representation of the noisy and non-noisy training datasets implemented by the invention.
- method 1 is implemented to detect non-recyclable plastic resins in a polyethylene recycling chain.
- the invention can be implemented in any technical field, whether in the field of biology, for example to carry out the detection of bacteria, in astrophysics or in any other technical field where the detection fault should be performed.
- Anomaly detection method 1 implements an unsupervised machine learning module.
- Any type of unsupervised machine learning module can be implemented. Indeed, as will emerge from the description, the invention is based in particular on the automatic optimization of the training data. The method makes it possible in this respect to optimize the implementation of any type of automatic learning module.
- one-class support vector machines called one-class SVM
- trees of decision tree known as Decision Tree
- decision tree forests in particular Isolation Forests, auto-encoders, or even the K-Nearest Neighbors method. All these methods are well known to those skilled in the art, and their particular implementation is not discussed here.
- the objective is not to classify the data among a set of classes, as is known to do in a classification process, but to determine whether each datum of entry belongs to the unique class sought or not, in other words if it is an anomaly with respect to the other data.
- the one-class SVM is particularly useful for detecting observations or measurements whose characteristics differ significantly from the expected data. This data is commonly referred to in the technical field as outliers, or outliers.
- the module looks for a hyperplane, like the classic SVM, with the difference that this hyperplane aims to separate all the data from the origin in the space in which the data is projected, while maximizing the distance from the hyperplane to the data, so as to reduce the margin between the data and the hyperplane.
- the method separates the space by a hypersphere which aims to encompass all the training data and to find the smallest hypersphere allowing to encompass all this data.
- the machine learning module detects data as "abnormal” in an erroneous way, and the risk of false negatives, in other words the module detects that the data is "normal” when it is actually data "abnormal”.
- over-sorting steps can eliminate non-recyclable materials; which will lead to an increase in waste treatment costs, ranging, for example, from €80 to €150 per ton.
- the impact relates to the quality of the recycled material produced if an incompatible material is mixed with a modification of the expected properties; this has the consequence of not being able to market the recycled materials on the expected markets and of sending them to markets with lower added value, with a loss of between €100 and €300 per tonne, for example.
- the invention aims to reduce the risk of all false detections in the case of the implementation of unsupervised machine learning modules.
- Method 1 uses multiple datasets.
- Figure 3 is a schematic representation of a training data set 20.
- Each data in Figure 3 is a point of the plane; in other words, data made of a two-dimensional vector, so that the training data set 20 can be represented in a plane.
- the reference 21 represents the noisy copy 21 of the training data. But the training data 20 being superimposed on the noisy data 21, all the data of the noisy copy 21 cannot be seen in FIG. dimensions greater than 2 or 3.
- the training data set 20 may also include examples of anomalies. This is particularly advantageous because these examples of anomalies make it possible to obtain a module more robust machine learning by further improving control over false positive detection.
- a set of labeled test data representative of the normal state of the analyzed data is also provided.
- This set of test data which is a classic element of machine learning methods, makes it possible to calculate the performance of a trained machine learning module, by identifying the rate of correct detections made on a set of data whose component data we already know.
- Method 1 trains the unsupervised automatic learning module, also from a set of noisy training data.
- the objective is to optimally calibrate the added noise to improve the robustness of the learning module.
- the noise added to the training data 20 is determined according to at least one noise parameter, in particular according to a random value limited in amplitude.
- an input data of a machine learning module whether training, test or data to be analyzed, comprises a plurality of numerical values, usually organized as a vector or matrix.
- a labeled datum associates an absorption spectrum with a material, here polyethylene.
- the absorption spectrum is then an N-dimensional vector of absorbance values. N being the number of sample values in the spectrum.
- the analysis spectrum being obtained by a so-called near infrared method
- the spectrum is for example obtained between 780nm and 2500nm.
- a number N of sampling points of the spectrum obtained during the analysis of the object is then defined, forming input data associated with the object.
- noise threshold is understood to mean the maximum value in amplitude of the generated noise.
- the noise generation method for the different data can be a method of generating an additive white Gaussian noise or a colored noise, in other words a noise whose power spectral density is not constant over the spectrum. This choice can be made by a person skilled in the art.
- the generation step 10 of a noisy copy 21 does not necessarily include the noise of all the data of the training data set, but may include the noise of only part of this set , for example by random selection of the data to be noisy, so as to ensure a completely random distribution of the noisy data.
- the amount of noisy input data may also be increased a posteriori if the detection performance as calculated 13 later in the description is not sufficient.
- the search for the optimal noise value is carried out by a search grid method in which a plurality of noise parameters are used in parallel.
- the number of noisy data 21 calculated is determined by the person skilled in the art according to the available calculation time, the desired search granularity and the desired training speed.
- Each generation step 10-10 being independent, and each noisy copy 21 obtained being produced according to a noise threshold value different from the other noisy copies, or according to a different noise mode (white noise, colored noise).
- noisy copies 21 are obtained in parallel.
- step 11-11 of constitution of a noisy training data set 22, comprising the training data set 20 and its noisy copy 21. Also , one obtains as many noisy training data sets 22 as there are distinct noisy copies 21 .
- training data is divided into two samples, a first training sub-sample comprising 80% of the training data set and a second test sub-sample made of the remaining 20%, not not be confused with the test data set of computation step 13.
- training sub-steps based on the training sub-sample are repeated several times and the hyper- parameters of the machine learning module depending on the error obtained at each repetition of the training sub-steps, the calculated error can be a performance score of the model on the test sample, such as the squared error medium.
- training sub-steps are therefore repeated a plurality of times, for example between 20 and 50 times, in particular 30 times, so as to cause the hyper-parameters of the model to converge to an optimal value as a function of the set of noisy training data 21.
- this order of magnitude is given solely by way of example and depends in particular on the calculation time available for the training.
- the performance of the machine learning module is calculated by implementing the trained module with the labeled test data set.
- This calculation step 13 can be refined, according to an alternative implementation, also taking into account the number of false negatives, otherwise depending on the number of abnormal data wrongly considered as normal. This is possible only when the test dataset also includes abnormal data. Also it is advantageous to have a set of test data also including examples of anomalous data.
- the performance can then be estimated as an average of the false-negative and false-positive rates, possibly weighted according to the importance given respectively to the type of false.
- method 1 can be restarted at steps 10-10” by creating new datasets that have been reduced, for example by creating datasets that have been reduced from values of thresholds of noise presenting a refined granularity around the brait threshold having made it possible to obtain the maximum performance during the determination step, in order to seek an even better maximum value.
- method 1' differs from the previous method 1 in that it is implemented incrementally, and not according to the grid mode. of research put in work previously.
- the incremental method described below is an example of iterative implementation of the invention.
- the invention is not limited to this single incremental method and can be adapted to other methods, such as a dichotomous approach or a heuristic method to perform an iterative search for optimum.
- the method 1′ first proceeds to a generation step 10 of a first noisy copy 21, from a first noise threshold value.
- the noise threshold value is, during the initialization of the process 1′, initialized to a minimum value, which corresponds to the smallest added noise. At each iteration of method 1′, the noise threshold is incremented by one step, determined according to the desired granularity.
- the noise threshold values in other words the noise amplitude, can be freely adapted by those skilled in the art.
- brait threshold value of the previous iteration was optimal and one can then proceed to the implementation step 15 of the machine learning module trained from the braited copy obtained for the brait threshold value of the previous iteration.
- a performance maximum when it is determined during the determination step 14' that a performance maximum has been found, it may be a local maximum and not the optimal performance maximum. Also, according to a particular implementation of this second embodiment, it is possible to carry out a plurality of new iterations, for example for a predetermined number of iterations, in order to check whether a new improvement in performance can be obtained for the subsequent iterations. This prevents the process from stopping at a local maximum.
- the invention described here makes it possible to obtain an anomaly detection method implementing unsupervised automatic learning modules having an operation optimized by the addition of controlled noise in the training data.
- the invention can also be used for classification purposes.
- the invention may relate to a classification method, for example material classification, to be classified in a predetermined list of materials, such as a list of plastic resins.
- each material is associated with at least one measurement parameter, such as an absorption spectrum, as explained above.
- the classification process then implements, for each material from the predetermined list of materials, a process 1, 1' for detecting anomalies as described above. This allows, through the implementation of several anomaly detection processes, to perform a relatively efficient and robust classification.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- Evolutionary Computation (AREA)
- Data Mining & Analysis (AREA)
- Medical Informatics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Physics & Mathematics (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Artificial Intelligence (AREA)
- Measurement Of Mechanical Vibrations Or Ultrasonic Waves (AREA)
- Image Analysis (AREA)
- Testing And Monitoring For Control Systems (AREA)
Abstract
Description
Claims
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
FR2007444A FR3112634B1 (fr) | 2020-07-16 | 2020-07-16 | Procédé de détection d’anomalies |
PCT/FR2021/051321 WO2022013503A1 (fr) | 2020-07-16 | 2021-07-15 | Génération de copies de données d'entraînement bruitées dans un procédé de détection d'anomalies |
Publications (1)
Publication Number | Publication Date |
---|---|
EP4182859A1 true EP4182859A1 (fr) | 2023-05-24 |
Family
ID=72885727
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
EP21755801.4A Pending EP4182859A1 (fr) | 2020-07-16 | 2021-07-15 | Génération de copies de données d'entraînement bruitées dans un procédé de détection d'anomalies |
Country Status (4)
Country | Link |
---|---|
EP (1) | EP4182859A1 (fr) |
FR (1) | FR3112634B1 (fr) |
WO (1) | WO2022013503A1 (fr) |
ZA (1) | ZA202301444B (fr) |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP6782679B2 (ja) * | 2016-12-06 | 2020-11-11 | パナソニック インテレクチュアル プロパティ コーポレーション オブ アメリカPanasonic Intellectual Property Corporation of America | 情報処理装置、情報処理方法及びプログラム |
-
2020
- 2020-07-16 FR FR2007444A patent/FR3112634B1/fr active Active
-
2021
- 2021-07-15 EP EP21755801.4A patent/EP4182859A1/fr active Pending
- 2021-07-15 WO PCT/FR2021/051321 patent/WO2022013503A1/fr unknown
-
2023
- 2023-02-03 ZA ZA2023/01444A patent/ZA202301444B/en unknown
Also Published As
Publication number | Publication date |
---|---|
WO2022013503A1 (fr) | 2022-01-20 |
FR3112634B1 (fr) | 2023-04-28 |
ZA202301444B (en) | 2023-10-25 |
FR3112634A1 (fr) | 2022-01-21 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Tao et al. | Self-adaptive cost weights-based support vector machine cost-sensitive ensemble for imbalanced data classification | |
González et al. | Validation methods for plankton image classification systems | |
EP3238137B1 (fr) | Representation semantique du contenu d'une image | |
Sikder et al. | Outlier detection using AI: a survey | |
FR2746530A1 (fr) | Procede et systeme pour selectionner des vecteurs d'adaptation pour la reconnaissance de formes | |
CN113609569B (zh) | 一种判别式的广义零样本学习故障诊断方法 | |
EP3846087A1 (fr) | Procede et systeme de selection d'un modele d'apprentissage au sein d'une pluralite de modeles d'apprentissage | |
WO2016012972A1 (fr) | Procede pour detecter des anomalies dans un reseau de distribution, en particulier d'eau potable | |
Wei et al. | Minority-prediction-probability-based oversampling technique for imbalanced learning | |
Almeida et al. | An integrated approach based on the correction of imbalanced small datasets and the application of machine learning algorithms to predict total phosphorus concentration in rivers | |
Kang et al. | Product failure detection for production lines using a data-driven model | |
EP4182859A1 (fr) | Génération de copies de données d'entraînement bruitées dans un procédé de détection d'anomalies | |
Kennedy et al. | Synthesizing class labels for highly imbalanced credit card fraud detection data | |
Chen et al. | Gearbox fault diagnosis using convolutional neural networks and support vector machines | |
EP2766825B1 (fr) | Systeme et procede non supervise d'analyse et de structuration thematique multi resolution de flux audio | |
US20220398494A1 (en) | Machine Learning Systems and Methods For Dual Network Multi-Class Classification | |
FR2899359A1 (fr) | Procede utilisant la multi-resolution des images pour la reconnaissance optique d'envois postaux | |
Dongre et al. | Stream data classification and adapting to gradual concept drift | |
Wood | Real-time monitoring and optimization of drilling performance using artificial intelligence techniques: a review | |
Fop et al. | Unobserved classes and extra variables in high-dimensional discriminant analysis | |
Malhotra et al. | Simplify Your Neural Networks: An Empirical Study on Cross-Project Defect Prediction | |
Razoqi et al. | A Survey Study on Proposed Solutions for Imbalanced Big Data | |
Nikovski et al. | Regularized covariance matrix estimation with high dimensional data for supervised anomaly detection problems | |
WO2019211367A1 (fr) | Procede de generation automatique de reseaux de neurones artificiels et procede d'evaluation d'un risque associe | |
WO2022135972A1 (fr) | Procede et dispositif de diagnostic d'anomalies |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: UNKNOWN |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: THE INTERNATIONAL PUBLICATION HAS BEEN MADE |
|
PUAI | Public reference made under article 153(3) epc to a published international application that has entered the european phase |
Free format text: ORIGINAL CODE: 0009012 |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: REQUEST FOR EXAMINATION WAS MADE |
|
17P | Request for examination filed |
Effective date: 20230209 |
|
AK | Designated contracting states |
Kind code of ref document: A1 Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR |
|
P01 | Opt-out of the competence of the unified patent court (upc) registered |
Effective date: 20230705 |
|
DAX | Request for extension of the european patent (deleted) | ||
RAV | Requested validation state of the european patent: fee paid |
Extension state: MA Effective date: 20230209 |