WO2019174419A1 - 预测异常样本的方法和装置 - Google Patents

预测异常样本的方法和装置 Download PDF

Info

Publication number
WO2019174419A1
WO2019174419A1 PCT/CN2019/073411 CN2019073411W WO2019174419A1 WO 2019174419 A1 WO2019174419 A1 WO 2019174419A1 CN 2019073411 W CN2019073411 W CN 2019073411W WO 2019174419 A1 WO2019174419 A1 WO 2019174419A1
Authority
WO
WIPO (PCT)
Prior art keywords
sample
processing
samples
dimensionality reduction
dimension
Prior art date
Application number
PCT/CN2019/073411
Other languages
English (en)
French (fr)
Inventor
张雅淋
李龙飞
Original Assignee
阿里巴巴集团控股有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 阿里巴巴集团控股有限公司 filed Critical 阿里巴巴集团控股有限公司
Priority to SG11202005823VA priority Critical patent/SG11202005823VA/en
Publication of WO2019174419A1 publication Critical patent/WO2019174419A1/zh
Priority to US16/888,575 priority patent/US11222046B2/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/04Forecasting or optimisation specially adapted for administrative or management purposes, e.g. linear programming or "cutting stock problem"
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/18Complex mathematical operations for evaluating statistical data, e.g. average values, frequency distributions, probability functions, regression analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/10Machine learning using kernel methods, e.g. support vector machines [SVM]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/20Ensemble learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/04Inference or reasoning models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N7/00Computing arrangements based on specific mathematical models
    • G06N7/01Probabilistic graphical models, e.g. probabilistic networks

Definitions

  • One or more embodiments of the present specification relate to the field of sample classification using a computer, and more particularly to a method and apparatus for predicting abnormal samples.
  • machine learning requires the use of large amounts of data to train the model, especially for supervised learning. It is necessary to use a sample of a known category, or a sample that has been calibrated, to train and adjust the classifier before it can be used to Samples are classified.
  • abnormal samples are often difficult to collect and calibrate.
  • the anomaly samples themselves are usually a small number.
  • the anomaly samples are often very hidden and difficult to discover. For example, abnormally accessed data is often difficult to detect. Therefore, the number of abnormal samples that can be acquired and recognized is small, which makes supervised learning difficult.
  • One or more embodiments of the present specification describe a method and apparatus that can effectively predict unknown samples if only a normal set of historical samples is acquired and the sample dimensions are high.
  • a method of predicting an abnormal sample comprising:
  • sample to be tested includes feature data having a first number of dimensions
  • Dimensionality processing is performed on the sample to be tested by using a plurality of dimensionality reduction methods to obtain a plurality of processing samples, wherein the ith dimensionality reduction method Pi of the plurality of dimensionality reduction methods processes the sample to be tested For the processed sample Si of dimension Di, the dimension Di is smaller than the first number;
  • the foregoing multiple dimensionality reduction methods include at least one of a computational dimensionality reduction method and a feature sampling dimensionality reduction method.
  • the above operation dimension reduction method includes one or more of the following: a principal component analysis PCA method, a minimum absolute contraction and selection operator LASSO method, a linear discriminant analysis LDA method, and a wavelet analysis method.
  • the feature sampling dimension reduction method includes one or more of the following: a random sampling method, a hash sampling method, a filtered feature selection method, and a wrapped feature selection method.
  • the ith processing model Mi is trained by acquiring a historical sample set that is known to be normal, the sample dimension of the historical sample set is the first number; and adopting the ith dimensional reduction method Pi And processing the historical sample set into a low-dimensional historical sample set Li with a sample dimension Di; using a support vector field to describe the SVDD mode, determining the hyperspherical surface Qi in a space of dimension Di, such that the hypersphere Q is surrounded The relationship between the number of samples in the low-dimensional historical sample set Li and the radius of the hypersphere meets a predetermined condition.
  • scoring the corresponding processing sample Si comprises: determining a relative position of the processing sample Si in the corresponding dimensional space with the hyperspherical surface Qi; determining a scoring of the processing sample Si according to the relative position.
  • the relative position comprises one of the following: the processing sample Si is located in or on the super spherical surface Qi; the processing sample Si is located in the corresponding dimensional space from the hypersphere surface Qi The distance of the center of the process; the distance of the processed sample Si in the corresponding dimensional space from the closest surface of the hypersphere Qi.
  • determining the comprehensive score of the sample to be tested comprises: weighting and summing the scores of the respective processing samples, obtaining the comprehensive score,
  • an apparatus for predicting an abnormal sample comprising:
  • An acquiring unit configured to acquire a sample to be tested, where the sample to be tested includes feature data having a first number of dimensions
  • the plurality of dimensionality reduction units respectively perform a dimensionality reduction process on the sample to be tested to obtain a plurality of processing samples, wherein the plurality of dimensionality reduction methods are a dimension reduction method Pi, the sample to be tested is processed into a processing sample Si of dimension Di, and the dimension Di is smaller than the first number;
  • a plurality of scoring units configured to score the plurality of processing samples by using a plurality of processing models, wherein the i-th processing model Mi of the plurality of processing models is based on a pre-existing support vector field describing the SVDD mode in a corresponding dimension Di
  • the hypersphere Q determined in the space is scored for the corresponding processing sample Si;
  • the synthesizing unit is configured to determine a comprehensive score of the sample to be tested according to the scoring of the each processing sample;
  • the determining unit is configured to determine, according to the comprehensive score, whether the sample to be tested is an abnormal sample.
  • a computer readable storage medium having stored thereon a computer program for causing a computer to perform the method of the first aspect when the computer program is executed in a computer.
  • a computing device comprising a memory and a processor, wherein the memory stores executable code, and when the processor executes the executable code, implementing the method of the first aspect .
  • the dimensionality reduction method is used to reduce the dimension of the sample to be tested, and then, for the plurality of processed samples after the dimension reduction, the hyperspheres established based on the SVDD model are respectively The sample is processed for scoring, and finally, based on the combined result of the plurality of scores, it is determined whether the sample to be tested is abnormal. Due to the use of a variety of different dimensionality reduction methods, the features obtained by each dimensionality reduction method can complement each other, and the information loss caused by dimensionality reduction is avoided to the utmost. At the same time, due to the dimensionality reduction process, the application of the SVDD model becomes practical and feasible, avoiding the computational obstacles caused by the dimension “explosion”. On this basis, considering the results of each SVDD model, a comprehensive evaluation and accurate prediction can be performed on the samples to be tested.
  • FIG. 1 is a schematic diagram of an implementation scenario of an embodiment disclosed in the present specification
  • FIG. 2A shows a schematic diagram of an SVM model
  • 2B shows a schematic diagram of an SVDD model
  • Figure 3 illustrates a schematic diagram of establishing a predictive model, in accordance with one embodiment
  • FIG. 4 illustrates a flowchart of a method of predicting anomalous samples, in accordance with one embodiment
  • FIG. 5 illustrates a process diagram of predicting anomalous samples, according to one embodiment
  • Figure 6 shows a schematic block diagram of an apparatus for predicting anomalous samples, according to one embodiment.
  • FIG. 1 is a schematic diagram of an implementation scenario of an embodiment disclosed in the present specification.
  • a computing platform 100 such as an Alipay server, trains a predictive model using a support vector domain description SVDD mode based on a normal set of historical samples (eg, a normal set of historical transaction samples).
  • a normal set of historical samples eg, a normal set of historical transaction samples.
  • the computing platform 100 uses various dimensionality reduction methods to perform dimensionality reduction, respectively, to obtain multiple dimensionality-reduced sample sets, and then adopt SVDD mode respectively.
  • the dimensionality-reduced sample set is learned, and multiple processing models are obtained.
  • processing models can be considered as sub-models of the prediction model, and the combination of these sub-models constitutes the above-mentioned prediction model.
  • the sample to be tested is reduced in dimension by the same dimension reduction method, corresponding to each submodel in the prediction model, and the sample to be tested is scored by each submodel, and finally scored according to the score.
  • the combined result is used to predict whether the sample to be tested is an abnormal sample.
  • embodiments of the present specification employ a Support Vector Domain Description SVDD method to build a model based only on normal historical samples. The processing of the SVDD method is described below.
  • Support Vector Domain Description SVDD is a model developed based on the support vector machine (SVM) concept.
  • the support vector machine SVM model is still a supervised learning model, and the samples need to be calibrated to different classes in advance.
  • the sample vector is first mapped into a high-dimensional space, and a maximum interval hyperplane is established in this space to separate samples of different classes.
  • Figure 2A shows a schematic diagram of an SVM model.
  • FIG. 2A a two-class classification of a sample in a two-dimensional case is illustrated as an example. As shown in FIG.
  • two mutually parallel hyperplanes are formed on both sides of the hyperplane separating the sample data (the two types of samples are shown as circles and x, respectively), and the two planes are separated by the hyperplane. The distance of the hyperplane is maximized.
  • the sample points on these two parallel hyperplanes are called support vectors.
  • FIG. 2B shows a schematic of the SVDD model.
  • the main idea of the SVDD model is to map a sample to a high-dimensional space given a certain number of samples of the same type, for example, normal historical samples, and then attempt to create an ellipsoid in this high-dimensional space. Make the ellipsoid as small as possible while including as many samples as possible. It can be understood that the above concept of "ellipsoid" is only for convenience of description.
  • the above ellipsoid actually corresponds to a hypersphere of a high dimensional space.
  • the goal of the SVDD model is to find a minimum hypersphere with a center a and a radius R in the space corresponding to xi for each known normal sample xi, so that it falls into the interior of the sphere (to the center of the sphere)
  • the number of sample points whose distance is less than R) is as much as possible, that is, the construction objective function F:
  • the hypersphere radius R be as small as possible (first condition), and that the hyperplane is expected to encompass as many sample points as possible (second condition), which are two mathematically contradictory conditions. Therefore, the parameter C is set in the above formula to measure the ratio of consideration of the first condition and the second condition. For example, if the C value is larger, it will be more inclined to find a hypersphere that can cover more sample points; in the case of a smaller C value, it will be more inclined to find a smaller hypersphere.
  • the determined hypersphere can be used to predict an unknown sample: if an unknown sample falls into the hypersphere, then the unknown sample is a normal sample with a greater probability; if the unknown sample is outside the hypersphere, it is a potentially abnormal sample.
  • the SVDD model can be modeled based on a single type of sample data, for example, based on normal samples, in the case of a large dimension of the sample data, a "dimensional explosion" is prone to occur, and the calculation efficiency is difficult to meet the requirements. If the sample data is subjected to dimensionality reduction, some useful information will be lost, making the training results inaccurate. Therefore, in the embodiment of the specification, the multi-dimensional parallel dimension reduction is complementarily used in various ways, and complement each other, and then the SVDD mode is adopted for the dimension-reduced sample data to obtain a plurality of sub-models, and the plurality of sub-models are combined to form a prediction. model.
  • FIG. 3 illustrates a schematic diagram of establishing a predictive model, in accordance with one embodiment.
  • a historical sample set H known to be normal is first acquired, including a plurality of historical samples that are known to be normal.
  • these historical samples have a higher dimension.
  • the feature data involved in the transaction sample may include buyer information, seller information, transaction object information, transaction time, transaction location, transaction records related to the transaction, etc., each of which may be further refined, and thus in some In this case, the transaction sample often includes feature data of thousands or even thousands of dimensions.
  • the feature data related to the access sample may include accessing the initiator's network address, personal information, initiation time, origination location, network address of the access target, related access records, and the like, and thus, the dimensions of the normal access sample are also usually Hundreds to thousands of dimensions.
  • a plurality of (N kinds) dimensionality reduction methods (P1, P2, ... Pi, ... PN) are used to respectively reduce the dimensionality of the sample set H to obtain a plurality (N Low-dimensional historical sample set (L1, L2, ... Li, ... LN).
  • N dimensionality reduction methods Pi by performing dimension reduction processing on the high dimensional historical sample set H, can obtain a low dimensional historical sample set Li of dimension Di.
  • the dimensions of the N low-dimensional historical sample sets obtained by the N dimensionality reduction methods may be different, but are smaller than the dimensions of the original samples.
  • the above dimensionality reduction method can employ various known dimensionality reduction algorithms that may be employed in the future.
  • the plurality of dimensionality reduction methods described above include a computational dimensionality reduction method.
  • the computational dimension reduction method is a linear or non-linear operation on the feature data in the original high-dimensional sample to obtain a processed sample with reduced dimensions.
  • the processing of feature values in a sample does not directly correspond to a feature in the original sample, but rather the result of a common operation of multiple features in the original sample.
  • the computational dimensionality reduction method includes a principal component analysis PCA (Principal Component Analysis) method.
  • the principal component analysis PCA method transforms the original data n-dimensional data into a set of linearly independent representations of each dimension by linear orthogonal transformation.
  • the first principal component has the largest variance value
  • each subsequent component It has the largest variance under the orthogonal condition constraint with the aforementioned principal component.
  • the computational dimensionality reduction method includes a Least absolute shrinkage and selection operator (LASSO) method.
  • the method is a compression estimation whose basic idea is to minimize the sum of squared residuals under the constraint that the sum of the absolute values of the regression coefficients is less than a constant.
  • some transformation operations in the mathematical wavelet analysis process can eliminate some interference data and also reduce the dimensionality, so it can also be used as a computational dimension reduction method.
  • computational dimensionality reduction examples include, for example, Linear Discriminant Analysis (LDA) methods, Laplacian feature mapping, Matrix Singular Value Decomposition SVD, LLE Locally linear embedding, and the like.
  • LDA Linear Discriminant Analysis
  • Laplacian feature mapping Laplacian feature mapping
  • Matrix Singular Value Decomposition SVD Matrix Singular Value Decomposition
  • LLE Locally linear embedding and the like.
  • the plurality of dimensionality reduction methods described above may further include a feature sampling method, or feature selection.
  • the feature sampling method selects some of the feature data of the original high-dimensional samples for sampling, that is, forms a feature subset, and the feature subset constitutes a processed sample with reduced dimensions.
  • the feature values in the processed sample may correspond directly to a feature in the original sample. It can be understood that a variety of sampling methods can be used for feature sampling.
  • the original data is subjected to feature sampling using a random sampling method, that is, a part of the features are randomly selected from the original high-dimensional samples to constitute a processed sample.
  • feature sampling is performed using a hash sampling method. According to the method, the original high-dimensional samples are hashed, and based on the result of the hash operation, which feature data is selected is determined.
  • a feature evaluation function is introduced.
  • feature sampling can also be divided into a filter feature selection method and a wrapper feature selection method.
  • the filter feature selection method generally does not rely on a specific learning algorithm to evaluate the feature subset, but evaluates the predictive ability of each feature according to the intrinsic characteristics of the data set, thereby finding a number of feature-characteristic feature subsets with better ranking.
  • such methods consider that the optimal feature subset is composed of several features with strong predictive power.
  • the wrapped feature selection method also known as the encapsulated feature selection method, is embedded in the feature selection process with subsequent learning algorithms, and its pros and cons are determined by testing the prediction performance of the feature subset on the algorithm, with little attention.
  • the predictive performance of a single feature in a subset of features is embedded in the feature selection process with subsequent learning algorithms, and its pros and cons are determined by testing the prediction performance of the feature subset on the algorithm, with little attention. The predictive performance of a single feature in a subset of features.
  • each historical sample dimension is 1000 dimensions
  • the first dimensionality reduction method can adopt the PCA method, and the dimensionality reduction obtains the first low of 100 dimensions.
  • Dimensional sample set the second dimension reduction method can adopt the random sampling method, and the dimensionality reduction can obtain the 300-dimensional second low-dimensional sample set.
  • the third dimensionality reduction method can sample the hash sampling method, and the dimensionality reduction can obtain the 200-dimensional third low-dimensional dimension. Sample set, and so on.
  • the number of dimensionality reduction methods (N) may be more or less, and the specific dimensionality reduction method may be different from the above example.
  • the dimension reduction ratio is also set. Can be different.
  • the N low-dimensional historical sample sets obtained by dimension reduction can be used to learn by SVDD.
  • the support vector domain is used to describe the SVDD mode, and the hypersphere Q is determined in the space of the dimension Di, so that the super The relationship between the number of low-dimensional history samples surrounded by the spherical surface Qi and the radius of the hypersphere satisfies a predetermined condition.
  • the hypersphere Q is determined by the previous formula (1) and formula (2), and it is desirable that the hypersphere Q can surround as many low-dimensional history samples as possible, and has Try to have a small radius.
  • the relationship between the number of low-dimensional history samples enclosed and the radius of the hypersphere is set by the parameter C therein.
  • a corresponding hypersphere Qi is established for each low-dimensional historical sample set Li.
  • the thus determined hypersphere Qi encompasses most of the low dimensional history samples in the corresponding dimensional space, and does not absolutely encompass all samples, so there will still be normal samples falling outside the hypersphere.
  • all samples in the normal historical sample set are also counted in their low dimensional space with respect to the distance distribution of the hyperspherical Qi. For example, the average distance, the maximum distance, and the like of all normal samples from the center of the hypersphere Qi in each low-dimensional space are counted. These statistical distances can be used to determine the decision thresholds required for the prediction phase.
  • the SVDD model is used to learn the normal historical sample set. More specifically, the dimensionality reduction method is used to reduce the dimensionality of the historical sample set, and then for each dimensionally reduced sample set, the corresponding hypersphere is established by using the SVDD model. Due to the use of a variety of different dimensionality reduction methods, the features obtained by each dimensionality reduction method can complement each other, and the information loss caused by dimensionality reduction is avoided to the utmost. At the same time, due to the dimensionality reduction process, the application of SVDD becomes practical and feasible, avoiding the computational obstacles caused by the dimension “explosion”.
  • the model established by the above method can be used to predict unknown samples.
  • the method for predicting an abnormal sample in this embodiment includes the following steps: Step 41: Acquire a sample to be tested, where the dimension includes a first number of feature data; and Step 42: adopt multiple dimension reduction methods, respectively
  • the sample to be tested is subjected to a dimensionality reduction process to obtain a plurality of processing samples, wherein the ith dimensionality reduction method Pi of the plurality of dimensionality reduction methods processes the sample to be tested into a processing sample Si of dimension Di, The dimension Di is smaller than the first number; in step 43, the plurality of processing samples are respectively input into the plurality of processing models to obtain the scoring of each processing sample, wherein the i-th processing model Mi is based on the pre-adoption of the support vector field to describe the SVDD mode.
  • the corresponding processing sample Si is scored; step 44, determining the comprehensive score of the sample to be tested according to the scoring of each processing sample; and step 45, determining the waiting according to the comprehensive score Test if the sample is an abnormal sample.
  • the sample T to be tested is acquired.
  • the sample to be tested is a sample with unknown classification, and is a high-dimensional sample like the historical sample used for model training. More specifically, the sample to be tested has the same dimension as the historical sample in the aforementioned historical sample set. The number of dimensions is referred to herein as the first number.
  • step 42 a plurality of dimensionality reduction methods are adopted, and the sample T to be subjected to dimensionality reduction processing is respectively performed to obtain a plurality of processing samples.
  • the multiple dimensionality reduction methods herein are consistent with the multiple dimensionality reduction methods in the training phase.
  • the i-th dimensionality reduction method Pi of the plurality of dimensionality reduction methods processes the sample T to be processed into a processing sample Si of dimension Di, and the dimension Di is smaller than the first number.
  • the dimensions of the plurality of processing samples respectively obtained by the plurality of dimensionality reduction methods may be different, but are smaller than the original dimensions (first number) of the samples to be tested.
  • the dimensionality reduction method can employ various known and later possible dimensionality reduction algorithms, including, for example, computational dimensionality reduction methods and feature sampling methods.
  • the operation dimension reduction method further includes one or more of the following: principal component analysis PCA method, minimum absolute contraction and selection operator LASSO method, wavelet analysis method, linear discriminant LDA method, Laplacian feature map, matrix singular value Decompose SVD, LLE local linear embedding, and more.
  • the feature sampling method further includes one or more of the following: a random sampling method, a hash sampling method, a filtered feature selection method, a wrapped feature selection method, and the like.
  • the plurality of processing samples are respectively input into the plurality of processing models to obtain scoring of each processing sample, wherein the i-th processing
  • the model Mi is scored for the corresponding processing sample Si based on the hypersphere Q determined in the corresponding dimensional space by using the support vector field in advance to describe the SVDD mode.
  • the support vector domain has been used to describe the SVDD mode, and the corresponding hypersphere Q is determined in the space after each dimension reduction, and the hypersphere Qi has a radius as small as possible and can include as many drops as possible.
  • the hypersphere Qi can be used to judge or predict the possibility that the currently input processed sample having the same dimension is an abnormal sample. In particular, in one embodiment, such a possibility is measured by scoring.
  • the process of scoring the processing model Si for the corresponding processing sample Si may include: determining a positional relationship of the processing sample Si in the corresponding dimensional space with the hyperspherical surface Qi; determining, according to the positional relationship, the scoring of the processed sample Si .
  • the above positional relationship may be that the processing sample Si is located outside or above the hyperspherical surface Qi.
  • the processing model Mi can be set such that if the processing sample Si is located within the hypersphere Qi in the corresponding dimensional space (Di-dimensional space), it is scored as 1; if the processing sample Si is located exactly on the hypersphere On the surface of Qi (similar to the support vector), it is scored as 0.5; if the processed sample Si is outside the hypersphere, it is scored as 0.
  • a higher score means that the processed sample Si has a greater likelihood of corresponding to a normal sample.
  • the above positional relationship may be the distance from the hyperspherical surface Qi.
  • the processing model Mi scores the processed sample Si based on the distance d of the processed sample Si from the center of the hyperspherical Qi in the corresponding dimensional space.
  • the center of the processing sample Si and the hypersphere corresponds to a point in the space of dimension Di, respectively, so the calculation of the distance d can be determined by calculating the distance of two points in the multi-dimensional space.
  • the score Gi of the processed sample Si can be calculated according to the following formula:
  • Gi takes a value between 0 and 1, and the smaller the distance between the processed sample Si and the center of the hypersphere Qi, that is, the closer to the center of the hypersphere, the smaller d is, the larger the Gi value is; The greater the distance from the center of the hyperspherical Qi, that is, the farther away from the center of the hypersphere, the larger d is, the smaller the Gi value is.
  • a larger Gi score means that the processed sample Si has a higher probability of corresponding to the normal sample.
  • the distance of the treated sample Si from the hypersphere is calculated in a different manner.
  • the acquired hypersphere Qi does not correspond to an isotropic positive sphere, but has different "radii" in different directions, similar to an ellipsoid in a three-dimensional space.
  • the above is merely a specific example of scoring a processed sample based on a predetermined hypersphere.
  • those skilled in the art can modify, replace, combine or expand them to adopt more kinds of scoring methods, which should be included in the concept of the present specification.
  • the plurality of processing models respectively score the processed samples of the corresponding dimensions, wherein the scoring algorithms used by the plurality of processing models may be the same or different.
  • each processing model Mi is scored separately for each processing sample Si, and then in step 44, the integrated score of the sample to be tested is determined based on the scoring of each processing sample.
  • the scoring of the individual processed samples is directly summed, and the summed result is taken as a composite score.
  • the scoring of each processing model is assigned a certain weight in advance, and the scoring of each processing sample is weighted and summed according to the assigned weights, thereby obtaining the integrated score.
  • step 45 based on the comprehensive score, it is determined whether the sample to be tested is an abnormal sample.
  • the integrated score determined in step 44 is compared with a predetermined threshold, and based on the comparison result, it is determined whether the sample to be tested is an abnormal sample.
  • the determination of the threshold is related to the scoring algorithm in step 43. For example, in the case of scoring according to formula (3), a higher comprehensive score means that the sample to be tested has a greater probability of corresponding to the normal sample.
  • the above threshold may be determined based on statistics of the sample learning phase.
  • the sample learning phase it is also possible to count all the samples in the normal historical sample set and the distance distribution with respect to the hypersphere Qi in its low-dimensional space.
  • the statistics of the integrated scores are calculated for the normal historical samples, and the decision threshold is determined based on the integrated score statistics. Based on the threshold thus determined, in step 45, it is possible to determine, by threshold comparison, whether the sample to be tested is an abnormal sample.
  • FIG. 5 illustrates a process diagram of predicting anomalous samples, and the prediction process of FIG. 5 is implemented based on the predictive model established in FIG. 3, in accordance with one embodiment.
  • the high-dimensional sample T to be tested is first obtained.
  • the dimension of the sample T to be tested is the same as the sample dimension in the historical sample set H shown in FIG.
  • the sample T to be tested is subjected to dimensionality reduction to obtain a plurality of processed samples.
  • the foregoing multiple dimensionality reduction methods are specifically: first, second, third, and fourth dimensionality reduction methods, and the dimensionality reduction methods respectively reduce dimensionality of the original sample T to be tested. Processing, respectively obtaining the first, second, third, and fourth processing samples, the dimensions of which are D1, D2, D3, and D4, respectively.
  • the first, second, third, and fourth processing samples are respectively input into the first, second, third, and fourth processing models to obtain scoring of each processing sample.
  • the first processing model scores G1 for the first processing sample based on the hypersphere Q1 obtained in FIG. 3 and the second processing model scores G2 for the second processing sample based on the hypersphere Q2 obtained in FIG. 3, and the third processing The model scores G3 for the third processed sample based on the hypersphere Q3 obtained in FIG. 3, and the fourth processing model scores G4 for the fourth processed sample based on the hypersphere Q4 obtained in FIG.
  • the comprehensive score G of the sample to be tested is determined according to the scores G1-G4 of the respective processed samples, and according to the comprehensive score G, it is determined whether the sample to be tested is an abnormal sample.
  • the sample to be tested is subjected to dimensionality reduction by using various dimensionality reduction methods, and then the processed samples are scored based on the hyperspheres established by the SVDD model for the plurality of processed samples after dimensionality reduction.
  • the combined result of multiple scores determines whether the sample to be tested is abnormal. Due to the use of a variety of different dimensionality reduction methods, the features obtained by each dimensionality reduction method can complement each other, and the information loss caused by dimensionality reduction is avoided to the utmost. At the same time, due to the dimensionality reduction process, the application of the SVDD model becomes practical and feasible, avoiding the computational obstacles caused by the dimension “explosion”. On this basis, considering the results of each SVDD model, a comprehensive evaluation and accurate prediction can be performed on the samples to be tested.
  • FIG. 6 shows a schematic block diagram of an apparatus for predicting anomalous samples, according to one embodiment.
  • the apparatus 600 includes: an obtaining unit 61 configured to acquire a sample to be tested, the sample to be tested includes feature data having a first number of dimensions, and a plurality of dimensionality reducing units 62, the plurality of dimensionality reduction
  • the unit adopts a plurality of dimensionality reduction methods to perform dimensionality reduction processing on the sample to be tested to obtain a plurality of processing samples, wherein the ith dimensionality reduction method Pi of the plurality of dimensionality reduction methods, the sample to be tested Processed as a processed sample Si of dimension Di, the dimension Di is smaller than the first number;
  • the plurality of scoring units 63 are configured to score the plurality of processed samples by a plurality of processing models, wherein the plurality of processing models are The i-th processing model Mi is scored for the corresponding processing sample Si based on the hypersphere Q determined in the
  • FIG. 6 a plurality of dimensionality reduction units 62 and a plurality of scoring units are schematically shown as three, but it can be understood that the number of dimensionality reduction units and scoring units can be set as needed, and is not limited to FIG. Icon.
  • the plurality of dimensionality reduction methods employed by the plurality of dimensionality reduction units 62 include at least one of a computational dimensionality reduction method and a feature sampling dimensionality reduction method.
  • the above operation dimension reduction method comprises one or more of the following: a principal component analysis PCA method, a minimum absolute contraction and selection operator LASSO method, a linear discriminant analysis LDA method, and a wavelet analysis method.
  • the feature sampling dimension reduction method includes one or more of the following: a random sampling method, a hash sampling method, a filtered feature selection method, and a wrapped feature selection method.
  • the ith processing model Mi is trained by an i-th training device (not shown), the i-th training device includes: a sample set acquisition module configured to acquire a historical sample set known to be normal, The sample dimension of the historical sample set is the first number; the ith dimension reduction module is configured to adopt the ith dimensionality reduction method Pi, and process the historical sample set into a low-dimensional historical sample set with a sample dimension of Di a superspherical surface determining module configured to describe the SVDD mode by using a support vector domain, and determining the hyperspherical surface Qi in a space of dimension Di such that the number of samples in the low-dimensional historical sample set Li surrounded by the hypersphere Q is The relationship between the radii of the hyperspheres satisfies a predetermined condition.
  • the plurality of scoring units 63 are configured to: determine a relative position of the processing sample Si in the space of the corresponding dimension Di with the hyperspherical surface Qi; and determine a scoring of the processing sample Si according to the relative position.
  • the relative position comprises one of the following:
  • the processing sample Si is located in or on the outside of the hyperspherical surface Qi;
  • the synthesis unit 64 is configured to perform a weighted summation of the scores of the respective processing samples to obtain the composite score.
  • the dimensionality reduction method is used to reduce the dimension of the sample to be tested, and each dimension reduction method complements each other to avoid information loss caused by dimensionality reduction.
  • the processed samples are scored based on the hyperspheres established by the SVDD model, and finally, whether the samples to be tested are abnormal according to the combined results of the plurality of scores. Therefore, considering the results of each SVDD model, the samples to be tested are comprehensively evaluated and accurately predicted.
  • a computer readable storage medium having stored thereon a computer program that, when executed in a computer, causes the computer to perform the method described in connection with Figures 3 through 5.
  • a computing device comprising a memory and a processor, wherein the memory stores executable code, and when the processor executes the executable code, implementing the combination of FIG. 3 to FIG. 5 Said method.
  • the functions described herein can be implemented in hardware, software, firmware, or any combination thereof.
  • the functions may be stored in a computer readable medium or transmitted as one or more instructions or code on a computer readable medium.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • Computational Mathematics (AREA)
  • Pure & Applied Mathematics (AREA)
  • Mathematical Optimization (AREA)
  • Mathematical Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Business, Economics & Management (AREA)
  • Algebra (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Economics (AREA)
  • Strategic Management (AREA)
  • Operations Research (AREA)
  • Probability & Statistics with Applications (AREA)
  • Marketing (AREA)
  • General Business, Economics & Management (AREA)
  • Human Resources & Organizations (AREA)
  • Development Economics (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Game Theory and Decision Science (AREA)
  • Accounting & Taxation (AREA)
  • Tourism & Hospitality (AREA)
  • Quality & Reliability (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Finance (AREA)
  • Technology Law (AREA)

Abstract

一种预测异常样本的方法和装置,方法包括:首先获取待测样本(41),然后采用多个降维方法,分别对待测样本进行降维处理,以获得多个处理样本(42);接着将多个处理样本分别输入多个处理模型,以获得各个处理样本的打分,其中第i处理模型Mi,基于SVDD方式确定的超球面Qi,为对应的处理样本打分(43);然后根据各个处理样本的打分确定待测样本的综合分(44),最后根据该综合分,确定待测样本是否为异常样本(45)。该方法可以更加有效地预测未知样本是否为异常样本。

Description

预测异常样本的方法和装置
相关申请的交叉引用
本专利申请要求于2018年3月15日提交的、申请号为201810215700.0、发明名称为“预测异常样本的方法和装置”的中国专利申请的优先权,该申请的全文以引用的方式并入本文中。
技术领域
本说明书一个或多个实施例涉及利用计算机进行样本分类领域,尤其涉及预测异常样本的方法和装置。
背景技术
随着计算机和互联网技术的发展,产生了大量的数据和样本。在许多场景下,需要对这些数据和样本进行分类,例如区分它是正常样本还是异常样本。例如,在支付和交易业务中,经常需要区分正常交易样本和异常交易样本(例如,套现,金融欺诈类交易等),从而更好地预防支付风险。在安全访问领域,经常需要区分正常访问数据和异常访问数据,其中异常访问数据往往来源于一些用户试图通过非法访问的方式,达到入侵或获取非法数据目的。这样的异常访问数据常常具有比较大的危害,对这类数据进行识别和预测,从而阻止异常访问,对于数据安全至关重要。
随着人工智能和机器学习的兴起,越来越多的业务场景开始引入机器学习来进行数据分析,包括样本的分类和预测。一般来说,机器学习需要运用大量数据来训练模型,特别是有监督的学习,需要利用已知类别的样本,或称为已标定的样本,来训练和调整分类器,然后才能用来对未知样本进行分类。
然而,在许多情况下,异常样本往往难以采集和标定。一方面,异常样本本身通常数量更少,另一方面,异常样本往往非常隐蔽,难以被发现,例如异常访问的数据通常难以被察觉。因此,能够获取并识别出的异常样本的数量很少,这使得监督学习难以进行。
因此,希望能有改进的方案,能够更加有效地对异常样本进行预测。
发明内容
本说明书一个或多个实施例描述了一种方法和装置,能够在仅获取正常历史样本集,且样本维度较高的情况下,有效地对未知样本进行预测。
根据第一方面,提供了一种预测异常样本的方法,包括:
获取待测样本,所述待测样本包括维度为第一数目的特征数据;
采用多个降维方法,分别对所述待测样本进行降维处理,以获得多个处理样本,其中所述多个降维方法中的第i降维方法Pi,将所述待测样本处理为维度为Di的处理样本Si,维度Di小于所述第一数目;
将所述多个处理样本分别对应输入多个处理模型,以获得各个处理样本的打分,其中所述多个处理模型中的第i处理模型Mi,基于预先采用支持向量域描述SVDD方式在对应维度Di的空间中所确定的超球面Qi,为对应的处理样本Si打分;
根据所述各个处理样本的打分确定所述待测样本的综合分;
根据所述综合分,确定所述待测样本是否为异常样本。
在一种可能的方案中,上述多个降维方法包括运算降维方法和特征采样降维方法中的至少一种。
在一个实施例中,上述运算降维方法包括以下中的一种或多种:主成分分析PCA方法,最小绝对收缩和选择算子LASSO方法,线性判别式分析LDA方法,小波分析方法。
在一个实施例中,上述特征采样降维方法包括以下中的一种或多种:随机采样方法,哈希采样方法,过滤式特征选择方法,包裹式特征选择方法。
根据一个实施例,上述第i处理模型Mi通过以下步骤训练:获取已知为正常的历史样本集,所述历史样本集的样本维度为所述第一数目;采用所述第i降维方法Pi,将所述历史样本集处理为样本维度为Di的低维历史样本集Li;采用支持向量域描述SVDD方式,在维度为Di的空间中确定所述超球面Qi,使得该超球面Qi所包围的低维历史样本集Li中的样本数目与该超球面的半径之间的关系满足预定条件。
根据一种实施方式,为对应的处理样本Si打分包括:确定所述处理样本Si在对应维度空间中与所述超球面Qi的相对位置;根据所述相对位置,确定处理样本Si的打分。
根据可能的实施方式,上述相对位置包括以下之一:所述处理样本Si位于所述超球面Qi之外、之内或之上;所述处理样本Si在对应维度空间中距离所述超球面Qi的中心的距离;所述处理样本Si在对应维度空间中距离所述超球面Qi的最近表面的距离。
在一个实施例中,确定所述待测样本的综合分包括:对所述各个处理样本的打分进行加权求和,获得所述综合分,
根据第二方面,提供一种预测异常样本的装置,包括:
获取单元,配置为获取待测样本,所述待测样本包括维度为第一数目的特征数据;
多个降维单元,所述多个降维单元分别采用多个降维方法,对所述待测样本进行降维处理,以获得多个处理样本,其中所述多个降维方法中的第i降维方法Pi,将所述待测样本处理为维度为Di的处理样本Si,维度Di小于所述第一数目;
多个打分单元,配置为通过多个处理模型对所述多个处理样本进行打分,其中所述多个处理模型中的第i处理模型Mi,基于预先采用支持向量域描述SVDD方式在对应维度Di的空间中所确定的超球面Qi,为对应的处理样本Si打分;
综合单元,配置为根据所述各个处理样本的打分确定所述待测样本的综合分;
确定单元,配置为根据所述综合分,确定所述待测样本是否为异常样本。
根据第三方面,提供了一种计算机可读存储介质,其上存储有计算机程序,当所述计算机程序在计算机中执行时,令计算机执行第一方面的方法。
根据第四方面,提供了一种计算设备,包括存储器和处理器,其特征在于,所述存储器中存储有可执行代码,所述处理器执行所述可执行代码时,实现第一方面的方法。
通过本说明书实施例提供的方法和装置,针对待测样本,采用多种降维方法分别对其进行降维,然后对于降维后的多个处理样本,分别基于SVDD模型确立的超球面,对处理样本进行打分,最后根据多个打分的综合结果判定待测样本是否异常。由于采用多种不同的降维方法,每种降维方法所得到的特征可以互相补充,最大限度地避免了降维带来的信息损失。同时,由于经过降维处理,使得SVDD模型的应用变得实际可行,避免了维度“爆炸”带来的计算障碍。在此基础上,综合考虑各个SVDD模型的结果,可以对待测样本进行全面的评估,准确的预测。
附图说明
为了更清楚地说明本发明实施例的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其它的附图。
图1为本说明书披露的一个实施例的实施场景示意图;
图2A示出SVM模型的示意图;
图2B示出SVDD模型的示意图;
图3示出根据一个实施例的建立预测模型的示意图;
图4示出根据一个实施例的预测异常样本的方法流程图;
图5示出根据一个实施例的预测异常样本的过程示意图;
图6示出根据一个实施例的预测异常样本的装置的示意性框图。
具体实施方式
下面结合附图,对本说明书提供的方案进行描述。
图1为本说明书披露的一个实施例的实施场景示意图。如图1所示,计算平台100,例如支付宝服务器,基于正常的历史样本集(例如正常的历史交易样本集),采用支持向量域描述SVDD方式训练一个预测模型。训练过程中,为避免样本维度太高带来的计算困难,对于各个历史样本,计算平台100采用多种降维方法对其分别进行降维,得到多个降维样本集,然后采用SVDD方式分别对降维样本集进行学习,得到多个处理模型,这些处理模型可以认为是预测模型的子模型,由这些子模型的组合构成上述预测模型。如此,在获取到未知的待测样本时,将待测样本采用相同的降维方式进行降维,对应输入该预测模型中的各个子模型,通过各个子模型对待测样本进行打分,最后根据打分的综合结果来预测该待测样本是否为异常样本。下面分别描述以上两个阶段的具体实施过程。
首先,描述上述预测模型的建立和训练。如前所述,对样本进行有监督的学习,从而训练预测模型的一项主要困难在于,异常样本难以获得,数量太少,不足以对其进行监督学习。为此,本说明书的实施例采用支持向量域描述(Support Vector Domain  Description)SVDD方式,仅仅基于正常历史样本来构建模型。下面描述SVDD方法的处理方式。
支持向量域描述SVDD是基于支持向量机SVM(Support Vector Machine)思想发展而来的一种模型。支持向量机SVM模型仍然是一种监督学习模型,需要预先将样本标定为不同类。训练过程中,首先将样本向量映射到一个高维空间中,在这个空间中建立一个最大间隔超平面,以分开不同类的样本。图2A示出SVM模型的示意图。在图2A中,以二维情况下对样本进行二分类为例进行示意。如图2A所示,在分开样本数据(两类样本分别示出为圆圈和×)的超平面的两边建有两个互相平行的超平面(虚线示出),分隔超平面使这两个平行超平面的距离最大化。这两个平行超平面上的样本点称为支持向量。
基于支持向量机SVM模型发展的支持向量域描述SVDD则可以基于给定某一类别的样本进行训练。图2B示出SVDD模型的示意图。如图2B所示,SVDD模型的主要思想是,给定一定数量的同类样本,例如,正常历史样本,将这些样本映射到高维空间中,然后在这个高维空间中试图建立一个椭球,使得该椭球尽量小,同时囊括尽量多的样本。可以理解的是,上述“椭球”的概念只是为了便于描述,在高维空间中,上述椭球实际上对应高维空间的超球面。换而言之,SVDD模型的目标就是,对于各个已知的正常样本xi,在xi对应的空间中求一个中心为a,半径为R的最小超球面,使得落入这个球形内部(到球心的距离小于R)的样本点的数目尽可能多,也就是构造目标函数F:
Figure PCTCN2019073411-appb-000001
使得:
Figure PCTCN2019073411-appb-000002
可以理解的是,希望超球面半径R尽可能小(第一条件),与希望该超平面囊括尽可能的样本点(第二条件),两者是数学上矛盾的两个条件。因此,在上述公式中设定了参数C,用以衡量对第一条件和第二条件的考虑比例。例如,C值更大的情况下,就会更偏向于找一个可以囊括更多样本点的超球面;C值更小的情况下,就会更偏向于找一个更小的超球面。确定的超球面可以用于对未知样本进行预测:如果未知样本落入该超球面,那么较大概率该未知样本为正常样本;如果未知样本位于该超球面之外,则为潜在的异常样本。
尽管SVDD模型可以基于单一类型的样本数据进行建模,例如基于正常样本进行建模,但是在样本数据的维度较大的情况下,容易出现“维度爆炸”,计算效率难以满足要求。如果对样本数据进行降维处理,又会损失一些有用的信息,使得训练结果不够准确。因此,在说明书的实施例中,创新性地采用多种方式并行降维,互为补充,然后对降维的样本数据分别采用SVDD方式,来获得多个子模型,由该多个子模型综合构成预测模型。
图3示出根据一个实施例的建立预测模型的示意图。如图3所示,首先获取已知为正常的历史样本集H,其中包括已知正常的多个历史样本。一般地,这些历史样本具有较高的维度。例如,交易样本涉及的特征数据可以包括,买方信息、卖方信息、交易对象信息、交易时间、交易地点、与该交易有关的交易记录等等,其中每一项又可以进一步细化,因此在一些情况下,交易样本常常包括上千维,甚至几千维的特征数据。又例如,访问样本涉及的特征数据可以包括,访问发起方的网络地址、个人信息、发起时间、发起地点、访问目标的网络地址、相关访问记录等等,因此,正常访问样本的维度通常也在几百到上千维之间。
对于这些高维历史样本构成的样本集H,采用多种(N种)降维方法(P1,P2,…Pi,…PN),分别对样本集H进行降维,以得到多个(N个)低维历史样本集(L1,L2,…Li,…LN)。具体地,N种降维方法中的任一降维方法Pi,通过对高维历史样本集H进行降维处理,可以得到维度为Di的低维历史样本集Li。通过N种降维方法得到的N个低维历史样本集的维度可以不相同,但是均小于原始样本的维度。
图3示意性示出了N=4的情况。更具体地,在图3中,多种降维方法具体为:第一、第二、第三、第四降维方法,这几种降维方法分别对原始的高维历史样本集进行降维处理,分别得到第一、第二、第三、第四低维样本集,其维度分别为D1,D2,D3和D4。
上述的降维方法可以采用各种已知的、以及以后可能采用的降维算法。
在一个实施例中,上述多种降维方法包括运算降维方法。运算降维方法是对原始高维样本中的特征数据进行线性或非线性的运算,得到维度降低的处理样本。一般地,处理样本中的特征值并不直接对应于原始样本中的某个特征,而是原始样本中多个特征共同运算的结果。
例如,运算降维方法包括主成分分析PCA(Principal Component Analysis)方法。 主成分分析PCA方法通过线性正交变换将原始数据n维数据变换为一组各维度线性无关的表示,变换后的结果中,第一个主成分具有最大的方差值,每个后续的成分在与前述主成分正交条件限制下具有最大方差。
更具体地,假设有m条n维数据,根据PCA方法,首先将原始数据按列组成n行m列矩阵X,对该矩阵X的每一行进行零均值化(即减去这一行的均值),然后求出其协方差矩阵C,以及矩阵C的特征值及对应的特征向量。接着,将上述特征向量按对应特征值大小从上到下按行排列成矩阵,取前k行组成矩阵P,并取Y=PX作为最后得到的降维到k维后的数据。
在一个具体例子中,运算降维方法包括最小绝对收缩和选择算子LASSO(Least absolute shrinkage and selection operator)方法。该方法是一种压缩估计,其基本思想是在回归系数的绝对值之和小于一个常数的约束条件下,使残差平方和最小化。
在一个具体例子中,数学上的小波分析过程中的一些变换操作可以排除一些干扰数据,也可以起到降维作用,因此也可以将其作为一种运算降维方法。
其他的运算降维的例子还包括,例如,线性判别LDA(Linear Discriminant Analysis)方法,拉普拉斯特征映射,矩阵奇异值分解SVD,LLE局部线性嵌入(Locally linear embedding)等等。
在一个实施例中,上述多种降维方法还可以包括特征采样方法,或称为特征选择。特征采样方法是从原始高维样本的特征数据中选择其中的一部分进行采样,即形成一个特征子集,由该特征子集构成维度降低的处理样本。此时,处理样本中的特征值可以直接对应于原始样本中的某个特征。可以理解的是,可以采用多种采样方法来进行特征采样。
在一个具体例子中,采用随机采样方法对原始数据进行特征采样,也就是,从原始高维样本中随机选择一部分特征构成处理样本。在另一例子中,采用哈希采样方法进行特征采样。根据该方法,对原始高维样本进行哈希运算,根据哈希运算的结果,确定选择哪些特征数据。
在特征选择领域,为了评价选择出的特征子集对原始高维样本的代表性和预测性,会引入特征评估函数。根据特征评估函数的不同,特征采样还可以分为过滤式(Filter)特征选择方法和包裹式(Wrapper)特征选择方法。过滤式特征选择方法一般不依赖具体的学习算法来评价特征子集,而是根据数据集的内在特性来评价每个特征的预测能力, 从而找出排序较优的若干个特征组成特征子集。通常,此类方法认为最优特征子集是由若干个预测能力较强的特征组成的。包裹式特征选择方法,又称为封装式特征选择方法,用后续的学习算法嵌入到特征选择过程中,通过测试特征子集在此算法上的预测性能来决定它的优劣,而极少关注特征子集中单个特征的预测性能。
以上列举了多种运算式降维方法的例子和多种特征采样式降维方法的例子。本领域技术人员可以根据要处理的数据特点,例如根据图3获取的历史样本集中历史样本的维度、数据分布、数据结构等特点,选择适当的降维方法。
例如,在图3的例子中,假定获取的正常历史样本集中,各个历史样本维度为1000维,那么在一个例子中,第一降维方法可以采用PCA方法,降维得到100维的第一低维样本集,第二降维方法可以采用随机采样方法,降维得到300维的第二低维样本集,第三降维方法可以采样哈希采样方法,降维得到200维的第三低维样本集,等等。
可以理解的是,以上仅仅是一个举例。在不同实施例中,采用的降维方法的数目(N)可以更多或更少,采取的具体降维方法可以与以上例子不同,对于任意的具体降维方法,降维比例的设定也可以不同。
在对原始历史样本集H分别进行降维的基础上,就可以基于降维得到的N个低维历史样本集,采用SVDD方式进行学习。具体地,对于通过降维方法Pi得到的第i个低维历史样本集Li(对应维度为Di),采用支持向量域描述SVDD方式,在维度为Di的空间中确定超球面Qi,使得该超球面Qi所包围的低维历史样本的数目与该超球面的半径之间的关系满足预定条件。如前所述,根据支持向量域描述SVDD方法的主旨思想,通过之前的公式(1)和公式(2)确定超球面Qi,希望上述超球面Qi能够包围尽量多的低维历史样本,且具有尽量小的半径。所包围的低维历史样本的数目与该超球面的半径之间的关系通过其中的参数C进行设定。
如此,针对各个低维历史样本集Li确立了对应的超球面Qi。如前所述,如此确定的超球面Qi囊括了对应维度空间中大多数的低维历史样本,而并非绝对囊括所有的样本,因此仍然会有正常样本落在超球面之外。相应地,在一个实施例中,在训练阶段,还对正常的历史样本集中所有样本,在其低维空间中相对于超球面Qi的距离分布进行统计。例如统计所有正常样本在各个低维空间中距离超球面Qi中心的平均距离,最大距离等。这些统计的距离可用于确定预测阶段所需的判定阈值。
通过以上过程,针对正常的历史样本集,采用SVDD模型对其进行了学习。更具 体地,采用多种降维方法分别对历史样本集进行降维,然后对于各个降维后的样本集,分别采用SVDD模型确立对应的超球面。由于采用多种不同的降维方法,每种降维方法所得到的特征可以互相补充,最大限度地避免了降维带来的信息损失。同时,由于经过降维处理,使得SVDD的应用变得实际可行,避免了维度“爆炸”带来的计算障碍。
在此基础上,可以采用通过以上方式建立的模型来预测未知样本。
图4示出根据一个实施例的预测异常样本的方法流程图,该方法的执行主体可以是图1所示的计算平台。如图4所示,该实施例中预测异常样本的方法包括以下步骤:步骤41,获取待测样本,其包括维度为第一数目的特征数据;步骤42,采用多个降维方法,分别对所述待测样本进行降维处理,以获得多个处理样本,其中所述多个降维方法中的第i降维方法Pi,将所述待测样本处理为维度为Di的处理样本Si,维度Di小于所述第一数目;步骤43,将多个处理样本分别对应输入多个处理模型,以获得各个处理样本的打分,其中第i处理模型Mi,基于预先采用支持向量域描述SVDD方式在对应维度空间中确定的超球面Qi,为对应的处理样本Si打分;步骤44,根据各个处理样本的打分确定所述待测样本的综合分;步骤45,根据所述综合分,确定所述待测样本是否为异常样本。下面描述以上各个步骤的具体执行方式。
首先在步骤41,获取待测样本T。可以理解,待测样本是分类未知的样本,并且与用于模型训练的历史样本一样都是高维样本,更具体地,待测样本与前述历史样本集中的历史样本具有相同的维度。该维度数目在此称为第一数目。
接着在步骤42,采用多个降维方法,分别对待测样本T进行降维处理,以获得多个处理样本。可以理解,此处的多个降维方法与训练阶段的多个降维方法分别对应一致。具体地,上述多个降维方法中的第i降维方法Pi,将所述待测样本T处理为维度为Di的处理样本Si,维度Di小于所述第一数目。通过多个降维方法分别得到的多个处理样本的维度之间可以不同,但是均小于待测样本的原始维度(第一数目)。
如前所述,降维方法可以采用各种已知的、以及以后可能采用的降维算法,例如包括运算降维方法,特征采样方法。运算降维方法进一步包括以下中的一项或多项:主成分分析PCA方法、最小绝对收缩和选择算子LASSO方法、小波分析方法、线性判别LDA方法、拉普拉斯特征映射、矩阵奇异值分解SVD、LLE局部线性嵌入,等等。特征采样方法进一步包括以下中的一项或多项:随机采样方法、哈希采样方法、过滤式特征选择方法、包裹式特征选择方法,等等。以上各种具体降维方法的具体描述可以参见之前的说明,在此不再赘述。但需要理解的是,不管具体采用了哪几种降维方法,步骤 42中采用的多种降维方法需与历史样本学习阶段的降维方法相一致。
在通过多种降维方法将待测样本降维为多个处理样本之后,在步骤43,将该多个处理样本分别对应输入多个处理模型,以获得各个处理样本的打分,其中第i处理模型Mi,基于预先采用支持向量域描述SVDD方式在对应维度空间中确定的超球面Qi,为对应的处理样本Si打分。
可以理解,在历史样本学习阶段,已经采用支持向量域描述SVDD方式,在各个降维之后的空间中确定了相应的超球面Qi,该超球面Qi具有尽量小的半径且能够囊括尽量多的降维后的历史样本。因此,该超球面Qi可以用于判断或预测具有相同维度的、当前输入的处理样本是异常样本的可能性。具体地,在一个实施例中,通过打分的方式来衡量这样的可能性。
在一个实施例中,处理模型Mi为对应的处理样本Si打分的过程可以包括:确定处理样本Si在对应维度空间中与超球面Qi的位置关系;根据所述位置关系,确定处理样本Si的打分。
更具体地,上述位置关系可以是,处理样本Si位于所述超球面Qi之外、之内或之上。例如,在一个示例中,处理模型Mi可以设定为,如果处理样本Si在对应维度空间(Di维空间)中位于超球面Qi之内,则其打分为1;如果处理样本Si正好位于超球面Qi的表面上(类似于支持向量),则其打分为0.5;如果处理样本Si位于超球面Qi之外,则其打分为0分。如此,较高的分数意味着,该处理样本Si有较大的可能对应于正常样本。
在一个实施例中,上述位置关系可以是与超球面Qi的距离。更具体地,在一个例子中,处理模型Mi根据处理样本Si在对应维度空间中距离超球面Qi的中心的距离d,为该处理样本Si打分。其中,处理样本Si和超球面的中心分别对应于维度为Di的空间中的点,因此距离d的计算可以采用多维空间中两点距离的计算方式来确定。在计算得到距离d的基础上,例如可以根据以下公式计算处理样本Si的打分Gi:
Gi=exp(-d)   (3)
根据该公式(3),Gi取值在0到1之间,且处理样本Si距离超球面Qi中心的距离越小,即越接近超球面中心,d越小,Gi值越大;处理样本Si距离超球面Qi中心的距离越大,即越远离超球面中心,d越大,Gi值越小。如此,较大的Gi分值意味着,该处理样本Si有较大概率对应于正常样本。
在另一实施例中,通过不同的方式计算处理样本Si与超球面的距离。例如,在一些SVDD模型下,获取到的超球面Qi并不对应于一个各向同性的正球体,而是在不同方向具有不同“半径”,类似于三维空间的椭球。在这样的情况下,在一个实施例中,首先判断处理样本Si是否位于超球面Qi之内,如果位于超球面Qi之内,则令Gi=1;如果处理样本Si位于超球面Qi之外,那么计算处理样本Si在对应维度空间中与超球面Qi的最近表面的距离d,并根据该距离d,利用公式(3)计算打分Gi。
以上仅仅是给出了基于预先确定的超球面对处理样本进行打分的具体例子。在阅读这些例子的情况下,本领域技术人员可以对其进行修改、替换、结合或扩展,从而采用更多种打分方式,这些都应该涵盖在本说明书的构思之中。并且,步骤43中,多个处理模型分别对相应维度的处理样本进行打分,其中多个处理模型各自采用的打分算法可以相同,也可以不同。
在步骤43中各个处理模型Mi分别为各个处理样本Si进行打分的基础上,接着在步骤44,根据各个处理样本的打分确定待测样本的综合分。
根据一个实施例,在该步骤中,对各个处理样本的打分直接求和,将求和结果作为综合分。在另一实施例中,预先为各个处理模型的打分分配一定的权重,根据分配的权重,对各个处理样本的打分进行加权求和,由此获得所述综合分。
于是,在步骤45,根据所述综合分,确定待测样本是否为异常样本。根据一个实施例,在该步骤中,将步骤44确定的综合分与预先确定的阈值进行比较,根据比较结果,确定待测样本是否为异常样本。可以理解,该阈值的确定与步骤43中的打分算法有关。例如,在根据公式(3)进行打分的情况下,更高的综合分意味着,待测样本有更大概率对应于正常样本。另一方面,在一个实施例,上述阈值可以根据样本学习阶段的统计来确定。如前所述,在样本学习阶段,还可以对正常的历史样本集中所有样本,在其低维空间中相对于超球面Qi的距离分布进行统计。在一个实施例中,根据这些距离统计,采用与步骤43一致的打分算法,对正常的历史样本计算综合分的统计,根据综合分统计来确定判定阈值。基于如此确定的阈值,在步骤45中,就可以简单地通过阈值比较来确定,待测样本是否为异常样本。
图5示出根据一个实施例的预测异常样本的过程示意图,并且,图5的预测过程是基于图3所建立的预测模型而实现的。如图5所示,首先获取高维的待测样本T。待测样本T的维度与图3所示的历史样本集H中的样本维度相同。接着,采用多种降维方法,分别待测样本T进行降维,得到多个处理样本。具体地,与图3对应地,上述多 种降维方法具体为:第一、第二、第三、第四降维方法,这几种降维方法分别对原始的待测样本T进行降维处理,分别得到第一、第二、第三、第四处理样本,其维度分别为D1,D2,D3和D4。
然后,将第一、第二、第三、第四处理样本分别对应输入第一、第二、第三、第四处理模型,以获得各个处理样本的打分。具体地,第一处理模型基于图3中获得的超球面Q1,为第一处理样本打分G1,第二处理模型基于图3中获得的超球面Q2,为第二处理样本打分G2,第三处理模型基于图3中获得的超球面Q3,为第三处理样本打分G3,第四处理模型基于图3中获得的超球面Q4,为第四处理样本打分G4。然后,根据各个处理样本的打分G1-G4确定待测样本的综合分G,根据该综合分G,确定所述待测样本是否为异常样本。
通过以上过程,针对待测样本,采用多种降维方法分别对其进行降维,然后对于降维后的多个处理样本,分别基于SVDD模型确立的超球面,对处理样本进行打分,最后根据多个打分的综合结果判定待测样本是否异常。由于采用多种不同的降维方法,每种降维方法所得到的特征可以互相补充,最大限度地避免了降维带来的信息损失。同时,由于经过降维处理,使得SVDD模型的应用变得实际可行,避免了维度“爆炸”带来的计算障碍。在此基础上,综合考虑各个SVDD模型的结果,可以对待测样本进行全面的评估,准确的预测。
根据另一方面的实施例,还提供一种预测异常样本的装置。图6示出根据一个实施例的预测异常样本的装置的示意性框图。如图6所示,该装置600包括:获取单元61,配置为获取待测样本,所述待测样本包括维度为第一数目的特征数据;多个降维单元62,所述多个降维单元分别采用多个降维方法,对所述待测样本进行降维处理,以获得多个处理样本,其中所述多个降维方法中的第i降维方法Pi,将所述待测样本处理为维度为Di的处理样本Si,维度Di小于所述第一数目;多个打分单元63,配置为通过多个处理模型对所述多个处理样本进行打分,其中所述多个处理模型中的第i处理模型Mi,基于预先采用支持向量域描述SVDD方式在对应维度Di的空间中所确定的超球面Qi,为对应的处理样本Si打分;综合单元64,配置为根据所述各个处理样本的打分确定所述待测样本的综合分;确定单元65,配置为根据所述综合分,确定所述待测样本是否为异常样本。在图6中,多个降维单元62和多个打分单元均示意性示出为3个,但是可以理解,降维单元和打分单元的数目可以根据需要进行设定,并不局限于图6的图示。
在一个实施例中,上述多个降维单元62所采用的多个降维方法包括运算降维方法 和特征采样降维方法中的至少一种。
根据一个实施例,上述运算降维方法包括以下中的一种或多种:主成分分析PCA方法,最小绝对收缩和选择算子LASSO方法,线性判别式分析LDA方法,小波分析方法。
根据一个实施例,上述特征采样降维方法包括以下中的一种或多种:随机采样方法,哈希采样方法,过滤式特征选择方法,包裹式特征选择方法。
在一个实施例中,上述第i处理模型Mi通过第i训练装置(未示出)训练,所述第i训练装置包括:样本集获取模块,配置为获取已知为正常的历史样本集,所述历史样本集的样本维度为所述第一数目;第i降维模块,配置为采用所述第i降维方法Pi,将所述历史样本集处理为样本维度为Di的低维历史样本集Li;超球面确定模块,配置为采用支持向量域描述SVDD方式,在维度为Di的空间中确定所述超球面Qi,使得该超球面Qi所包围的低维历史样本集Li中的样本数目与该超球面的半径之间的关系满足预定条件。
根据一个实施例,上述多个打分单元63配置为:确定所述处理样本Si在对应维度Di的空间中与所述超球面Qi的相对位置;根据所述相对位置,确定处理样本Si的打分。
在一个实施例中,所述相对位置包括以下之一:
所述处理样本Si位于所述超球面Qi之外、之内或之上;
所述处理样本Si在对应维度空间中距离所述超球面Qi的中心的距离;
所述处理样本Si在对应维度空间中距离所述超球面Qi的最近表面的距离。
根据一个实施例,综合单元64配置为:对所述各个处理样本的打分进行加权求和,获得所述综合分。
通过以上装置,针对待测样本,采用多种降维方法分别对其进行降维,各个降维方法互相补充,以此避免降维带来的信息损失。对于降维后的多个处理样本,分别基于SVDD模型确立的超球面,对处理样本进行打分,最后根据多个打分的综合结果判定待测样本是否异常。由此,综合考虑各个SVDD模型的结果,对待测样本进行全面评估,准确预测。
根据另一方面的实施例,还提供一种计算机可读存储介质,其上存储有计算机程序,当所述计算机程序在计算机中执行时,令计算机执行结合图3到图5所描述的方法。
根据再一方面的实施例,还提供一种计算设备,包括存储器和处理器,所述存储器中存储有可执行代码,所述处理器执行所述可执行代码时,实现结合图3到图5所述的方法。
本领域技术人员应该可以意识到,在上述一个或多个示例中,本发明所描述的功能可以用硬件、软件、固件或它们的任意组合来实现。当使用软件实现时,可以将这些功能存储在计算机可读介质中或者作为计算机可读介质上的一个或多个指令或代码进行传输。
以上所述的具体实施方式,对本发明的目的、技术方案和有益效果进行了进一步详细说明,所应理解的是,以上所述仅为本发明的具体实施方式而已,并不用于限定本发明的保护范围,凡在本发明的技术方案的基础之上,所做的任何修改、等同替换、改进等,均应包括在本发明的保护范围之内。

Claims (18)

  1. 一种预测异常样本的方法,包括:
    获取待测样本,所述待测样本包括维度为第一数目的特征数据;
    采用多个降维方法,分别对所述待测样本进行降维处理,以获得多个处理样本,其中所述多个降维方法中的第i降维方法Pi,将所述待测样本处理为维度为Di的处理样本Si,维度Di小于所述第一数目;
    将所述多个处理样本分别对应输入多个处理模型,以获得各个处理样本的打分,其中所述多个处理模型中的第i处理模型Mi,基于预先采用支持向量域描述SVDD方式在对应维度Di的空间中所确定的超球面Qi,为对应的处理样本Si打分;
    根据所述各个处理样本的打分确定所述待测样本的综合分;
    根据所述综合分,确定所述待测样本是否为异常样本。
  2. 根据权利要求1所述的方法,其中所述多个降维方法包括运算降维方法和特征采样降维方法中的至少一种。
  3. 根据权利要求2所述的方法,其中所述运算降维方法包括以下中的一种或多种:主成分分析PCA方法,最小绝对收缩和选择算子LASSO方法,线性判别式分析LDA方法,小波分析方法。
  4. 根据权利要求2所述的方法,其中所述特征采样降维方法包括以下中的一种或多种:随机采样方法,哈希采样方法,过滤式特征选择方法,包裹式特征选择方法。
  5. 根据权利要求1所述的方法,其中所述第i处理模型Mi通过以下步骤训练:
    获取已知为正常的历史样本集,所述历史样本集的样本维度为所述第一数目;
    采用所述第i降维方法Pi,将所述历史样本集处理为样本维度为Di的低维历史样本集Li;
    采用支持向量域描述SVDD方式,在维度为Di的空间中确定所述超球面Qi,使得该超球面Qi所包围的低维历史样本集Li中的样本数目与该超球面的半径之间的关系满足预定条件。
  6. 根据权利要求1中所述的方法,其中为对应的处理样本Si打分包括:
    确定所述处理样本Si在对应维度空间中与所述超球面Qi的相对位置;
    根据所述相对位置,确定处理样本Si的打分。
  7. 根据权利要求6所述的方法,其中所述相对位置包括以下之一:
    所述处理样本Si位于所述超球面Qi之外、之内或之上;
    所述处理样本Si在对应维度空间中距离所述超球面Qi的中心的距离;
    所述处理样本Si在对应维度空间中距离所述超球面Qi的最近表面的距离。
  8. 根据权利要求1所述的方法,其中根据所述各个处理样本的打分确定所述待测样本的综合分包括:对所述各个处理样本的打分进行加权求和,获得所述综合分。
  9. 一种预测异常样本的装置,包括:
    获取单元,配置为获取待测样本,所述待测样本包括维度为第一数目的特征数据;
    多个降维单元,所述多个降维单元分别采用多个降维方法,对所述待测样本进行降维处理,以获得多个处理样本,其中所述多个降维方法中的第i降维方法Pi,将所述待测样本处理为维度为Di的处理样本Si,维度Di小于所述第一数目;
    多个打分单元,配置为通过多个处理模型对所述多个处理样本进行打分,其中所述多个处理模型中的第i处理模型Mi,基于预先采用支持向量域描述SVDD方式在对应维度Di的空间中所确定的超球面Qi,为对应的处理样本Si打分;
    综合单元,配置为根据所述各个处理样本的打分确定所述待测样本的综合分;
    确定单元,配置为根据所述综合分,确定所述待测样本是否为异常样本。
  10. 根据权利要求9所述的装置,其中所述多个降维方法包括运算降维方法和特征采样降维方法中的至少一种。
  11. 根据权利要求10所述的装置,其中所述运算降维方法包括以下中的一种或多种:主成分分析PCA方法,最小绝对收缩和选择算子LASSO方法,线性判别式分析LDA方法,小波分析方法。
  12. 根据权利要求10所述的装置,其中所述特征采样降维方法包括以下中的一种或多种:随机采样方法,哈希采样方法,过滤式特征选择方法,包裹式特征选择方法。
  13. 根据权利要求9所述的装置,其中所述第i处理模型Mi通过第i训练装置训练,所述第i训练装置包括:
    样本集获取模块,配置为获取已知为正常的历史样本集,所述历史样本集的样本维 度为所述第一数目;
    第i降维模块,配置为采用所述第i降维方法Pi,将所述历史样本集处理为样本维度为Di的低维历史样本集Li;
    超球面确定模块,配置为采用支持向量域描述SVDD方式,在维度为Di的空间中确定所述超球面Qi,使得该超球面Qi所包围的低维历史样本集Li中的样本数目与该超球面的半径之间的关系满足预定条件。
  14. 根据权利要求9所述的装置,其中所述多个打分单元配置为:
    确定所述处理样本Si在对应维度Di的空间中与所述超球面Qi的相对位置;
    根据所述相对位置,确定处理样本Si的打分。
  15. 根据权利要求14所述的装置,其中所述相对位置包括以下之一:
    所述处理样本Si位于所述超球面Qi之外、之内或之上;
    所述处理样本Si在对应维度空间中距离所述超球面Qi的中心的距离;
    所述处理样本Si在对应维度空间中距离所述超球面Qi的最近表面的距离。
  16. 根据权利要求9所述的装置,其中所述综合单元配置为:对所述各个处理样本的打分进行加权求和,获得所述综合分。
  17. 一种计算机可读存储介质,其上存储有计算机程序,当所述计算机程序在计算机中执行时,令计算机执行权利要求1-8中任一项的所述的方法。
  18. 一种计算设备,包括存储器和处理器,其特征在于,所述存储器中存储有可执行代码,所述处理器执行所述可执行代码时,实现权利要求1-8中任一项所述的方法。
PCT/CN2019/073411 2018-03-15 2019-01-28 预测异常样本的方法和装置 WO2019174419A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
SG11202005823VA SG11202005823VA (en) 2018-03-15 2019-01-28 Abnormal sample prediction method and apparatus
US16/888,575 US11222046B2 (en) 2018-03-15 2020-05-29 Abnormal sample prediction

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810215700.0A CN108595495B (zh) 2018-03-15 2018-03-15 预测异常样本的方法和装置
CN201810215700.0 2018-03-15

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US16/888,575 Continuation US11222046B2 (en) 2018-03-15 2020-05-29 Abnormal sample prediction

Publications (1)

Publication Number Publication Date
WO2019174419A1 true WO2019174419A1 (zh) 2019-09-19

Family

ID=63626416

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/073411 WO2019174419A1 (zh) 2018-03-15 2019-01-28 预测异常样本的方法和装置

Country Status (5)

Country Link
US (1) US11222046B2 (zh)
CN (1) CN108595495B (zh)
SG (1) SG11202005823VA (zh)
TW (1) TW201939311A (zh)
WO (1) WO2019174419A1 (zh)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110879971A (zh) * 2019-10-23 2020-03-13 上海宝信软件股份有限公司 工业生产设备运行异常情况预测方法及系统
CN112052890A (zh) * 2020-08-28 2020-12-08 华北电力科学研究院有限责任公司 给水泵振动预测方法及装置
WO2021185330A1 (zh) * 2020-03-20 2021-09-23 京东方科技集团股份有限公司 数据增强方法和数据增强装置

Families Citing this family (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US12099571B2 (en) * 2018-01-18 2024-09-24 Ge Infrastructure Technology Llc Feature extractions to model large-scale complex control systems
CN108595495B (zh) 2018-03-15 2020-06-23 阿里巴巴集团控股有限公司 预测异常样本的方法和装置
EP3750115B1 (en) * 2018-04-25 2024-06-19 Samsung Electronics Co., Ltd. Machine learning on a blockchain
CN109583468B (zh) * 2018-10-12 2020-09-22 阿里巴巴集团控股有限公司 训练样本获取方法,样本预测方法及对应装置
CN109461001B (zh) * 2018-10-22 2021-07-09 创新先进技术有限公司 基于第二模型获取第一模型的训练样本的方法和装置
CN111259700B (zh) * 2018-12-03 2024-04-09 北京京东尚科信息技术有限公司 用于生成步态识别模型的方法和装置
CN110188793B (zh) * 2019-04-18 2024-02-09 创新先进技术有限公司 数据异常分析方法及装置
CN110288079B (zh) * 2019-05-20 2023-06-09 创新先进技术有限公司 特征数据获取方法、装置和设备
CN110751643A (zh) * 2019-10-21 2020-02-04 睿视智觉(厦门)科技有限公司 一种水质异常检测方法、装置及设备
CN110807399A (zh) * 2019-10-29 2020-02-18 北京师范大学 一种基于单一类别支持向量机的崩滑隐患点检测方法
CN111062003A (zh) * 2019-12-13 2020-04-24 武汉轻工大学 样本总体协方差判定方法、装置、设备及存储介质
CN111145911B (zh) * 2019-12-20 2024-06-28 深圳平安医疗健康科技服务有限公司 异常数据识别处理方法、装置、计算机设备和存储介质
US20220405634A1 (en) * 2021-06-16 2022-12-22 Moxa Inc. Device of Handling Domain-Agnostic Meta-Learning
CN114609480B (zh) * 2022-05-16 2022-08-16 国网四川省电力公司电力科学研究院 一种电网损耗异常数据检测方法、系统、终端及介质
CN115438035B (zh) * 2022-10-27 2023-04-07 江西师范大学 一种基于kpca和混合相似度的数据异常处理方法
CN117150244B (zh) * 2023-10-30 2024-01-26 山东凯莱电气设备有限公司 基于电参数分析的智能配电柜状态监测方法及系统

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104462184A (zh) * 2014-10-13 2015-03-25 北京系统工程研究所 一种基于双向抽样组合的大规模数据异常识别方法
CN105718876A (zh) * 2016-01-18 2016-06-29 上海交通大学 一种滚珠丝杠健康状态的评估方法
CN107563008A (zh) * 2017-08-08 2018-01-09 三峡大学 基于svd变换和支持向量空间的刀具运行可靠性评估方法
CN107578056A (zh) * 2017-07-04 2018-01-12 华东理工大学 一种整合经典模型用于样本降维的流形学习系统
CN108595495A (zh) * 2018-03-15 2018-09-28 阿里巴巴集团控股有限公司 预测异常样本的方法和装置

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101458522A (zh) * 2009-01-08 2009-06-17 浙江大学 基于主元分析和支持向量数据描述的多工况过程监控方法
CN103810101B (zh) * 2014-02-19 2019-02-19 北京理工大学 一种软件缺陷预测方法和软件缺陷预测系统
CN104077571B (zh) * 2014-07-01 2017-11-14 中山大学 一种采用单类序列化模型的人群异常行为检测方法
US9830558B1 (en) 2016-05-03 2017-11-28 Sas Institute Inc. Fast training of support vector data description using sampling
TWI617997B (zh) 2016-08-01 2018-03-11 Chunghwa Telecom Co Ltd Intelligent object detection assistance system and method
US20190042977A1 (en) * 2017-08-07 2019-02-07 Sas Institute Inc. Bandwidth selection in support vector data description for outlier identification
US11341138B2 (en) * 2017-12-06 2022-05-24 International Business Machines Corporation Method and system for query performance prediction

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104462184A (zh) * 2014-10-13 2015-03-25 北京系统工程研究所 一种基于双向抽样组合的大规模数据异常识别方法
CN105718876A (zh) * 2016-01-18 2016-06-29 上海交通大学 一种滚珠丝杠健康状态的评估方法
CN107578056A (zh) * 2017-07-04 2018-01-12 华东理工大学 一种整合经典模型用于样本降维的流形学习系统
CN107563008A (zh) * 2017-08-08 2018-01-09 三峡大学 基于svd变换和支持向量空间的刀具运行可靠性评估方法
CN108595495A (zh) * 2018-03-15 2018-09-28 阿里巴巴集团控股有限公司 预测异常样本的方法和装置

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110879971A (zh) * 2019-10-23 2020-03-13 上海宝信软件股份有限公司 工业生产设备运行异常情况预测方法及系统
CN110879971B (zh) * 2019-10-23 2023-06-13 上海宝信软件股份有限公司 工业生产设备运行异常情况预测方法及系统
WO2021185330A1 (zh) * 2020-03-20 2021-09-23 京东方科技集团股份有限公司 数据增强方法和数据增强装置
CN112052890A (zh) * 2020-08-28 2020-12-08 华北电力科学研究院有限责任公司 给水泵振动预测方法及装置
CN112052890B (zh) * 2020-08-28 2024-04-02 华北电力科学研究院有限责任公司 给水泵振动预测方法及装置

Also Published As

Publication number Publication date
SG11202005823VA (en) 2020-07-29
US11222046B2 (en) 2022-01-11
TW201939311A (zh) 2019-10-01
CN108595495A (zh) 2018-09-28
US20200293554A1 (en) 2020-09-17
CN108595495B (zh) 2020-06-23

Similar Documents

Publication Publication Date Title
WO2019174419A1 (zh) 预测异常样本的方法和装置
US11900294B2 (en) Automated path-based recommendation for risk mitigation
Kamalov et al. Outlier detection in high dimensional data
US20200372383A1 (en) Local-adapted minority oversampling strategy for highly imbalanced highly noisy dataset
CN110929840A (zh) 使用滚动窗口的连续学习神经网络系统
US9249287B2 (en) Document evaluation apparatus, document evaluation method, and computer-readable recording medium using missing patterns
JP4376145B2 (ja) 画像分類学習処理システム及び画像識別処理システム
Wang et al. An unequal deep learning approach for 3-D point cloud segmentation
CN110717687A (zh) 一种评价指数获取的方法及系统
JP6855604B2 (ja) 短期利益を予測する方法、装置、コンピューターデバイス、プログラムおよび記憶媒体
Harkat et al. Machine learning-based reduced kernel PCA model for nonlinear chemical process monitoring
Malan et al. Characterising the searchability of continuous optimisation problems for PSO
Tayal et al. Rankrc: Large-scale nonlinear rare class ranking
An et al. A new intrusion detection method based on SVM with minimum within‐class scatter
CN110324178B (zh) 一种基于多经验核学习的网络入侵检测方法
Koirunnisa et al. Optimized Machine Learning Performance with Feature Selection for Breast Cancer Disease Classification
CN111639688B (zh) 一种基于线性核svm的物联网智能模型的局部解释方法
CN117251813A (zh) 一种网络流量异常检测方法和系统
Farag et al. Inductive Conformal Prediction for Harvest-Readiness Classification of Cauliflower Plants: A Comparative Study of Uncertainty Quantification Methods
Lu et al. Multi-class malware classification using deep residual network with non-softmax classifier
Ikeda et al. New feature engineering framework for deep learning in financial fraud detection
Liu et al. Dimension estimation using weighted correlation dimension method
Khoirunnisa et al. Improving malaria prediction with ensemble learning and robust scaler: An integrated approach for enhanced accuracy
Zhao An evolutionary intelligent data analysis in promoting smart community
Luo et al. FUGNN: Harmonizing Fairness and Utility in Graph Neural Networks

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19767952

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19767952

Country of ref document: EP

Kind code of ref document: A1