CN114462537A - Screening method and system of data set samples - Google Patents

Screening method and system of data set samples Download PDF

Info

Publication number
CN114462537A
CN114462537A CN202210122374.5A CN202210122374A CN114462537A CN 114462537 A CN114462537 A CN 114462537A CN 202210122374 A CN202210122374 A CN 202210122374A CN 114462537 A CN114462537 A CN 114462537A
Authority
CN
China
Prior art keywords
data set
sample
screening
samples
abnormal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210122374.5A
Other languages
Chinese (zh)
Inventor
王波
罗杨
候小娥
杨文华
郭飞
万鹏
肖清明
史磊
魏文婷
张治民
王晓康
刘凤星
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
State Grid Ningxia Electric Power Co Wuzhong Power Supply Co
State Grid Ningxia Electric Power Co Ltd
Original Assignee
State Grid Ningxia Electric Power Co Wuzhong Power Supply Co
State Grid Ningxia Electric Power Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by State Grid Ningxia Electric Power Co Wuzhong Power Supply Co, State Grid Ningxia Electric Power Co Ltd filed Critical State Grid Ningxia Electric Power Co Wuzhong Power Supply Co
Priority to CN202210122374.5A priority Critical patent/CN114462537A/en
Publication of CN114462537A publication Critical patent/CN114462537A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/211Selection of the most significant subset of features
    • G06F18/2113Selection of the most significant subset of features by ranking or filtering the set of features, e.g. using a measure of variance or of feature cross-correlation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/2433Single-class perspective, e.g. one-against-all classification; Novelty detection; Outlier detection

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Quality & Reliability (AREA)
  • Complex Calculations (AREA)

Abstract

A method of screening a sample of a data set, the method comprising the steps of: step 1, carrying out multi-dimensional extraction on the data set, and carrying out outlier screening on all samples in the data set aiming at each dimension; step 2, sequencing all samples in the data set, and obtaining a preferred data set based on a sequencing result; and 3, calculating the abnormal frequency of each sample in the optimal data set, and screening the data set samples based on a preset standard. The method is simple, and provides the incidence relation among the abnormal values while optimizing the operation amount, so that the accuracy of the screened sample is higher.

Description

Screening method and system of data set samples
Technical Field
The invention relates to the field of data processing, in particular to a method and a system for screening a data set sample.
Background
In the field of data processing, screening of data set samples is used as an initial step of data modeling and subsequent processing, so that the necessary processes of data mining, data processing algorithm simplification and data processing operation amount reduction are effectively realized, and key data containing potential values can be extracted from massive data, so that the method has a vital position in data processing.
In the prior art, a screening method for a data set sample is relatively simple, and generally includes three processes of data extraction, data cleaning and data loading, where a process of actually processing original collected data includes four parts of missing data processing, repeated data processing, abnormal data processing and inconsistent data sorting, which are commonly used in data cleaning. Specifically, the missing data processing and the inconsistent data sorting can only delete the obviously acquired abnormal original data, and cannot effectively process and screen the more concealed abnormal data in the data acquisition process.
Furthermore, the abnormal data processing method adopted in the prior art is also simple, and in the prior art, the abnormal data itself is usually analyzed and abnormal values are eliminated. In the data processing of the power system, one data sample not only comprises one abnormal value data, but also comprises other collected data values closely related to the abnormal value. This makes the simple method for removing or screening abnormal values in the prior art unable to meet the requirement of complex data processing in the power system.
In addition, in the prior art, it is not possible to screen a single data sample for the frequency of occurrence of abnormal values, in other words, in the prior art, although a plurality of single abnormal values which are not related to each other can be screened, a plurality of abnormal values which are related to each other cannot be obtained for one data sample, and the sample can be accurately screened according to the abnormal serious condition of the data sample.
Aiming at the problems, the invention provides a method and a system for screening a data set sample.
Disclosure of Invention
In order to solve the defects in the prior art, the invention aims to provide a method and a system for screening a data set sample, which are used for carrying out dimension extraction and abnormal value screening on a data set and realizing sample screening aiming at the abnormal frequency of the sample.
The invention adopts the following technical scheme.
The invention relates to a screening method of a data set sample, wherein the method comprises the following steps: step 1, carrying out multi-dimensional extraction on a data set, and screening abnormal values of all samples in the data set aiming at each dimension; step 2, sequencing all samples in the data set, and obtaining an optimal data set based on a sequencing result; and 3, calculating the abnormal frequency of each sample in the optimized data set, and screening the data set samples based on a preset standard.
Preferably, in the step 1, after the data set realizes multi-dimensional extraction, the dimension N is more than or equal to 3; dimension N is extracted based on the same column name for all samples of the dataset.
Preferably, the outlier screening method is a boxplot outlier screening method or a two-eight law method.
Preferably, in step 2, the ranking method is a principal component analysis method and an entropy weight method.
Preferably, the preferred data set is obtained by: selecting the top 20% of samples from the sorting result; or, selecting the last 20% of samples from the sorting result; alternatively, the sample is selected based on expert opinion.
Preferably, in step 3, the method for calculating the abnormal frequency of each sample in the preferred data set comprises: step 3.1, outliers [ Z ] based on all samples in each dimension acquired in step 11,Z2,…,Zn,…,ZN]And calculating the abnormal value [ E ] of the preferred data set under each dimension n by the preferred data set P acquired in the step 21,E2,…,En,…,EN]Wherein N is 1,2, …, N; step 3.2, for outliers E in all different dimensionsnTaking a union set to obtain a common abnormal sample set; step 3.3, for outliers E in all different dimensionsnAnd taking intersection to obtain a key abnormal sample set.
Preferably, in step 3.3, based on the frequency M of the abnormal value specified in the preset standard, a dimension number equal to the number of the specified frequency M of the abnormal value is selected, and an intersection is taken for a plurality of abnormal values under the dimension number, so as to obtain a key abnormal sample set.
Preferably, the method for intersecting the plurality of abnormal values under the dimensionality number is to select any M dimensionalities from the N dimensionalities and base on the N dimensionalities
Figure BDA0003498978030000021
Sub-selection of outliers E for all different dimensionsnTaking an intersection; and the samples included in the intersection are the samples in the key abnormal sample set.
Preferably, the predetermined criteria generates a specified frequency of outliers based on the screening requirements of the data set samples.
In a second aspect, the present invention relates to an apparatus for screening samples of a data set, wherein the apparatus comprises a processor for implementing a method for screening samples of a data set as described in the first aspect of the present invention.
Compared with the prior art, the method and the system for screening the data set samples have the advantages that the dimension extraction and abnormal value screening can be carried out on the data set, and the final sample screening can be realized according to the abnormal frequency of the samples. The method is simple, and provides the incidence relation among the abnormal values while optimizing the operation amount, so that the accuracy of the screened sample is higher.
Drawings
FIG. 1 is a schematic diagram of the steps of a method for screening a data set sample according to the present invention.
Detailed Description
The present application is further described below with reference to the accompanying drawings. The following examples are only for illustrating the technical solutions of the present invention more clearly, and the protection scope of the present application is not limited thereby.
FIG. 1 is a schematic diagram of the steps of a method for screening a data set sample according to the present invention. As shown in fig. 1, a first aspect of the present invention relates to a method for screening a data set sample, wherein the method comprises steps 1 to 3.
Step 1, carrying out multi-dimensional extraction on a data set, and carrying out outlier screening on all samples in the data set aiming at each dimension.
It will be appreciated that the data set may first be pre-processed to screen out unnecessary data content before generating the data set. For example, the frequency of data collection by the relevant equipment is too high, and when the data required to be screened in the method of the invention does not need to be so high, the irrelevant data can be screened in advance. This screening method can be a method generally used in the art.
In addition, the data sets of the present invention may come from a number of different sources. These multiple different sources may include multiple different acquisition devices, or multiple different data set processing devices. Therefore, a plurality of partial data tables can be obtained in advance, and the plurality of partial data tables are combined in advance in a manner commonly used in the prior art to generate the data set in the invention.
It should be noted that due to the influence of different types, specifications, configurations, etc. of the plurality of different acquisition devices or different data set processing devices, such as processors, the contents of the plurality of sub data tables may not be identical, for example, some data tables may include relatively more column names, and some data tables may include relatively less column names. In the merging process of the data tables, the column names included in all the partial data tables can be merged, but it should be noted that for the column names not included in other partial data tables, the corresponding data samples should annotate the completed content to prevent the subsequent processing process from identifying the completed content as an abnormal item.
In addition, the method of the present invention can be applied to the processing of various data, in other words, the method of the present invention does not limit the content of the data at all, nor does it limit too much the association between a plurality of different data samples. However, the data samples should at least include multiple data contents, hereinafter also referred to as data dimensions, and are identical.
In one embodiment of the present invention, a data set is used, wherein the data samples are related to basic data collected from different areas in the power system. Specifically, the data content includes the unit of the distribution area, the name of the distribution area, the operation date of the distribution area, the per-household distribution and transformation capacity, the heavy load abnormal frequency, the passing and re-abnormal frequency, the low voltage abnormal frequency, the critical defect, the power failure frequency, the average load rate, the acquisition success rate and other related contents. Certainly, the sub data table generated by a plurality of different acquisition devices or processing devices may further include other contents of the sample data, and these other contents may be associated with the data set in the present invention or not associated with the data set in the present invention, but in the present invention, all samples in the data set should include multiple dimensions of the unit of the station area mentioned in the above contents, the name of the station area, the commissioning date of the station area, the per-house distribution transformation capacity, the number of times of heavy load exception, the number of times of over-and-over exception, the number of times of low voltage exception, the critical defect, the number of times of power outage, the average load rate, and the acquisition success rate.
Preferably, in the step 1, after the data set is subjected to multi-dimensional extraction, the dimension N is more than or equal to 3; dimension N is extracted based on the same column name for all samples of the dataset.
In other words, in the present invention, a single sample in the data set may include more than N data contents, and in the data table, the single sample may also be displayed as including more than N column names, however, all samples in the data set should include corresponding data of the N column names represented by the extracted dimension N. And the corresponding data are all the effective data content which is really collected.
It can be seen that in the process of extracting multiple dimensions, extraction from all samples of the data set should also be performed according to this principle. In order to effectively achieve the purpose of sample screening in the present invention, and the subsequent calculation of step 2 and step 3, the number of dimensions obtained in this step should be at least 3.
Preferably, the outlier screening method is a boxplot outlier screening method or a two eight law method.
It should be noted that, after the invention screens out multiple dimensions, abnormal values are screened out from all samples for each dimension. The method for screening the abnormal value may be a boxplot outlier screening method or a two-eight law method, or may be an abnormal value screening method commonly used in the prior art.
In the present invention, for example, a boxcar diagram outlier screening method is adopted, and this method is used N times, so that, for each dimension, it is sequentially screened out whether the data value corresponding to the dimension in each data sample in the dimension is abnormal.
It should be noted that, generally, the box plot outlier and the twenty-eight law analyze the total trend of all samples according to the corresponding values of all samples in the current dimension, and screen out a small part of outlier data from the total trend. Of course, in the present invention, the setting of the abnormal range for a certain dimension data may be realized according to the nature of the extracted data itself. For example, assuming that the acquisition success rate data is taken as the current dimension, the item of data should be the better the data value is. For example, when the acquisition success rate is 100, it indicates that the current data is not necessarily abnormal data. However, in a practical situation, for example, where the total number of samples in a data set is 50, and there are 48 samples with a collection success rate of 98.5, one sample with a collection success rate of 100, and another with a collection success rate of 97, in this case, considering only the boxplot outlier and the two-eight law, the outlier would be two samples with collection success rates of 100 and 97. Therefore, in this case, parameters such as the upper limit, the lower limit, the median of the box line diagram outlier and the like should be effectively preset according to the own attribute of the selected dimension, or the parameters should be preset for the twenty-eight law algorithm based on the own attribute of the dimension, and the like.
By the method in step 1, the invention can firstly obtain the sample abnormal value [ Z ] for each dimension in each data set1,Z2,…,Zn,…,ZN]. It should be noted that N is the total number of selected dimensions, and Z isnThe dimension n includes related information of all samples corresponding to all abnormal values in the current dimension n, such as serial numbers of the samples. For the convenience of subsequent calculations, here ZnThe outliers themselves need not be included.
And 2, sequencing all samples in the data set, and obtaining a preferred data set based on a sequencing result.
Preferably, in step 2, the ranking method is a principal component analysis method and an entropy weight method.
It should be noted that the sorting method in the present invention can be implemented by referring to a principal component analysis method or an entropy weight method adopted in the prior art. Specifically, each of the dimensions selected in the present invention, or the dimensions not selected in the present invention, may be used as an evaluation index, thereby generating a comprehensive evaluation matrix. Meanwhile, the samples in a plurality of different data sets are used as principal components, so that the principal component analysis method in the prior art is adopted to realize the calculation of the principal component characteristic values of the samples. And sequencing according to the size of the principal component characteristic value so as to obtain the sequence of the plurality of samples.
In the present invention, the magnitude of the principal component eigenvalue can be calculated by using an entropy weight method. For example, the specific gravity Pik occupied by the fraction of the ith sample under the kth evaluation index may be solved first, and then the entropy Ek and the weight Ak of the kth evaluation index may be calculated, thereby generating the characteristic value Σ Ek Ak of the ith sample.
It should be noted that the evaluation index of the principal component analysis method may coincide with a plurality of dimensions extracted in advance, and in step 1, when the abnormal value screening is performed, the proportion of the data value of each sample in the current dimension to the average value of the data values of all samples in the current dimension may be calculated at the same time.
Compared with the prior art, in the step 1, the proportion occupied by the data value in a certain dimension is calculated in advance through a box line diagram outlier method or a two-eight law method in each dimension, so that when the dimension N extracted in the step 1 is superposed with the evaluation index k in the step, the proportion occupied by the fraction of the ith sample under the kth evaluation index Pik can be quickly calculated, the calculation process is simplified, and the calculation speed is improved.
Preferably, the preferred data set is acquired in the following manner: selecting the top 20% of samples from the sorting result; or, selecting the last 20% of samples from the sorting result; alternatively, the sample is selected based on expert opinion.
In the invention, in step 2, after the sorting is performed by adopting the method, sorting and extraction can be realized based on the size of the principal component characteristic value. The extracted data amount can be 20% of the highest eigenvalue or 20% of the lowest eigenvalue, and of course, the samples corresponding to different numbers and different sizes of eigenvalues can be selected by adopting the method in the prior art according to the expert experience. The detailed description of the selection method is omitted here.
And 3, calculating the abnormal frequency of each sample in the optimized data set, and screening the data set samples based on a preset standard.
After completing step 1 and step 2, step 3 may integrate the anomaly samples of each dimension obtained in step 1, and the preferred data set obtained in step 2 is processed.
Preferably, in step 3, the method for calculating the abnormal frequency of each sample in the preferred data set comprises:
step 3.1, outliers [ Z ] based on all samples in each dimension acquired in step 11,Z2,…,Zn,…,ZN]And calculating the abnormal value [ E ] of the preferred data set under each dimension n by the preferred data set P acquired in the step 21,E2,…,En,…,En]Wherein N is 1,2, …, N; step 3.2, for outliers E in all different dimensionsnTaking a union set to obtain a common abnormal sample set; step 3.3, for outliers E in all different dimensionsnAnd taking intersection to obtain a key abnormal sample set.
Specifically, the present invention may first perform intersection calculation on the preferred data set P according to the abnormal values in multiple dimensions, respectively, to obtain [ E1,E2,…,En,…,EN]. Then, all the abnormal values E are calculatednAnd taking a union set, thereby obtaining all samples with abnormal value frequency. At this time, the frequency of the abnormal value may be only 1, or may be greater than 1, and meets the requirement of the preset standard. The sample set obtained here may be referred to as a normal exception sample set, and after the set is obtained, the exception may actually be handled with a lower priority.
In addition, in the invention, multiple dimensions can be arbitrarily selected from N dimensions for intersection operation, so that a key abnormal sample set is obtained, and the processing priority is relatively high.
Preferably, in step 3.3, based on the frequency M of the abnormal value specified in the preset standard, a dimension number equal to the number of the specified frequency M of the abnormal value is selected, and an intersection is taken for a plurality of abnormal values under the dimension number, so as to obtain a key abnormal sample set.
Specifically, different abnormal value frequencies can be set in the preset standard, that is, when there is an abnormality in all data items in a plurality of different dimensions in a certain sample, it is indicated that there is a major abnormality in the sample.
Preferably, the method for intersecting the plurality of abnormal values under the dimensionality number is to select any M dimensionalities from the N dimensionalities and base on the N dimensionalities
Figure BDA0003498978030000071
Sub-selection of outliers E for all different dimensionsnTaking an intersection; and the samples included in the intersection are the samples in the key abnormal sample set.
With the method of the present invention, it is necessary to realize
Figure BDA0003498978030000072
The dimension is selected in the next multiple different combination cases, so that the value of M is not easy to be too large, and in an embodiment of the invention, M is 2.
Preferably, the predetermined criteria generates a specified frequency of outliers based on the screening requirements of the data set samples.
In addition, the value of M also needs to be set according to the actual screening needs, and in the invention, the value of M can be manually set or generated according to an algorithm.
Hereinafter, a description will be given of a screening method for screening poor station area collected data, taking the station area collected data in an electric power system as an example.
The original data set is shown in table 1, and the list names of the table header include a serial number, a unit of the table, a table area name, a commissioning date, a per-household distribution and transformation capacity, a heavy-load abnormal frequency, a pass and re-abnormal frequency, a low-voltage abnormal frequency, an emergency defect, a power failure frequency, an average load rate and an acquisition success rate. The data set comprises a total of 90 data samples.
Figure BDA0003498978030000073
TABLE 1 original data set
In the step 1 of the invention, 9 dimensions, namely the operation date, the average distribution transformation capacity of a user, the heavy load abnormal frequency, the passing and re-abnormal frequency, the low voltage abnormal frequency, the critical defect, the power failure frequency, the average load rate and the acquisition success rate are extracted from an original data set. Then, abnormal situations are screened out from each dimension.
Table 2 shows that the date of operation is taken asFront dimension, outlier Z obtained using boxplot outlier algorithm1The corresponding sample set.
Figure BDA0003498978030000081
TABLE 2 abnormal values Z1Corresponding sample set
By adopting a similar method, the sample abnormal value Z under other dimensions can be calculated2、Z3And ZNAnd the like.
And in step 2, performing principal component analysis on 90 samples to obtain a comprehensive ranking, and then obtaining the last-ranked table area of 20% of the comprehensive ranking, so as to obtain a preferred data set in table 3.
Figure BDA0003498978030000091
Figure BDA0003498978030000101
Figure BDA0003498978030000111
TABLE 3 preferred data set
By the above method, the worst correlation sample of 18 stations is selected from the 90 samples, resulting in the preferred data set P. Associating the preferred data set with sample outliers [ Z ] in each dimension1,Z2,…,Zn,…,ZN]And taking intersection sets to respectively obtain abnormal values [ E ] of the optimal data set under each dimension n1,E2,…,En,…,EN]。
Table 4 shows the abnormal value E1The corresponding sample set. As shown in Table 4, at Z1Taking intersection with P to obtain E1
Figure BDA0003498978030000112
Figure BDA0003498978030000121
TABLE 4 abnormal value E1Corresponding sample set
In one embodiment of the invention, Z2The intersection with P is empty, Z4The intersection with P is empty. In addition, Z3、Z5、Z6、Z7And Z8And Z9The intersection with P is shown in tables 5 to 10, respectively.
Figure BDA0003498978030000122
Figure BDA0003498978030000131
TABLE 5Z3Intersection with P
Figure BDA0003498978030000132
TABLE 6Z5Intersection with P
Figure BDA0003498978030000133
Figure BDA0003498978030000141
Figure BDA0003498978030000151
TABLE 7Z6Intersection with P
Figure BDA0003498978030000152
Figure BDA0003498978030000161
TABLE 8Z7Intersection with P
Figure BDA0003498978030000162
TABLE 9Z8Intersection with P
Figure BDA0003498978030000163
TABLE 10Z9Intersection with P
If the method in step 3.2 is used, for all outliers E in different dimensionsnTaking the union to obtain
The normal anomaly sample set in table 11.
Figure BDA0003498978030000171
Figure BDA0003498978030000181
Figure BDA0003498978030000191
Figure BDA0003498978030000201
TABLE 11 common Exception sample set
In this embodiment, the samples corresponding to the distribution room with the frequency of occurrence of 2 times or more are selected as the individuals in the key abnormal sample set. Resulting in the key anomaly sample set described in table 12.
Figure BDA0003498978030000202
Figure BDA0003498978030000211
Figure BDA0003498978030000221
TABLE 12 sample set of key point anomalies
When the preset frequency is 2, it can be seen that the important abnormal sample set selected from the 18 preferred data sets P by the method of the present invention includes 1-3, 6, 8-10, and 15-18 samples, and the remaining 7 samples are screened out as normal abnormal samples.
According to a preset rule, the areas corresponding to the 1 st to 3 rd, 6 th, 8 th to 10 th and 15 th to 18 th samples can be listed as the areas to be mainly invested, and the areas corresponding to the other 7 samples are the areas with common investment.
In a second aspect, the present invention relates to an apparatus for screening data set samples, wherein the apparatus includes a processor for implementing a method for screening data set samples as described in the first aspect of the present invention.
Compared with the prior art, the method and the system for screening the data set samples have the advantages that the dimension extraction and abnormal value screening can be carried out on the data set, and the final sample screening can be realized according to the abnormal frequency of the samples. The method is simple, and provides the incidence relation among the abnormal values while optimizing the operation amount, so that the accuracy of the screened sample is higher.
The present applicant has described and illustrated embodiments of the present invention in detail with reference to the accompanying drawings, but it should be understood by those skilled in the art that the above embodiments are only preferred embodiments of the present invention, and the detailed description is only for the purpose of helping the reader to better understand the spirit of the present invention, and not for the purpose of limiting the scope of the present invention, and on the contrary, any modifications or modifications based on the spirit of the present invention should fall within the scope of the present invention.

Claims (10)

1. A method of screening a sample of a data set, the method comprising the steps of:
step 1, carrying out multi-dimensional extraction on the data set, and carrying out outlier screening on all samples in the data set aiming at each dimension;
step 2, sequencing all samples in the data set, and obtaining a preferred data set based on a sequencing result;
and 3, calculating the abnormal frequency of each sample in the optimal data set, and screening the data set samples based on a preset standard.
2. The method of claim 1, wherein the step of screening the sample of the data set comprises:
in the step 1, the step of processing the raw material,
after the data set is subjected to multi-dimensional extraction, the dimension N is more than or equal to 3;
the dimension N is extracted based on the same column name for all samples of the dataset.
3. The method of claim 2, wherein the step of screening the sample of the data set comprises:
the outlier screening method is a boxplot outlier screening method or a two-eight law method.
4. A method of screening a sample of a data set according to claim 3, wherein:
in the step 2, in the step of processing,
the sequencing method comprises a principal component analysis method and an entropy weight method.
5. The method of claim 4, wherein the step of screening the sample of the data set comprises:
the preferred data set is obtained in the following manner:
selecting the top 20% of samples from the sorting result;
or, selecting the last 20% of samples from the sorting result;
alternatively, the sample is selected based on expert opinion.
6. The method of claim 5, wherein the step of screening the sample of the data set comprises:
in the step 3, the step of processing the image,
the calculation method of the abnormal frequency of each sample in the preferred data set comprises the following steps:
step 3.1, based on the outliers [ Z ] of all samples in each dimension acquired in said step 11,Z2,…,Zn,…,ZN]And calculating an abnormal value [ E ] of the preferred data set under each dimension n by using the preferred data set P acquired in the step 21,E2,…,En,…,EN]Wherein N is 1,2, …, N;
step 3.2, for outliers E in all different dimensionsnTaking a union set to obtain a common abnormal sample set;
step 3.3, for outliers E in all different dimensionsnAnd taking intersection to obtain a key abnormal sample set.
7. The method of claim 6, wherein the step of screening the sample set comprises the steps of:
in the above-mentioned step 3.3,
selecting dimensionality numbers with the same number as the specified abnormal value frequency M based on the abnormal value frequency M specified in the preset standard, and taking intersection of a plurality of abnormal values under the dimensionality numbers to obtain a key abnormal sample set.
8. The method of claim 7, wherein the step of screening the sample set comprises the steps of:
the method for intersecting the plurality of abnormal values under the dimensionality number is to select any M dimensionalities from the N dimensionalities and base on the N dimensionalities
Figure FDA0003498978020000021
Sub-selection of outliers E for all different dimensionsnTaking an intersection;
and the samples included in the intersection are the samples in the key abnormal sample set.
9. The method of claim 8, wherein the step of screening the sample of the data set comprises:
the predetermined criteria generates the specified outlier frequency based on screening requirements of the dataset sample.
10. A screening device for data set samples is characterized in that:
the apparatus includes a processor configured to execute a program,
the processor is adapted to implement a method of screening samples of a data set as claimed in any one of claims 1 to 9.
CN202210122374.5A 2022-02-09 2022-02-09 Screening method and system of data set samples Pending CN114462537A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210122374.5A CN114462537A (en) 2022-02-09 2022-02-09 Screening method and system of data set samples

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210122374.5A CN114462537A (en) 2022-02-09 2022-02-09 Screening method and system of data set samples

Publications (1)

Publication Number Publication Date
CN114462537A true CN114462537A (en) 2022-05-10

Family

ID=81414014

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210122374.5A Pending CN114462537A (en) 2022-02-09 2022-02-09 Screening method and system of data set samples

Country Status (1)

Country Link
CN (1) CN114462537A (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190294485A1 (en) * 2018-03-22 2019-09-26 Microsoft Technology Licensing, Llc Multi-variant anomaly detection from application telemetry
CN111274543A (en) * 2020-01-17 2020-06-12 北京空间飞行器总体设计部 Spacecraft system anomaly detection method based on high-dimensional space mapping
CN112505549A (en) * 2020-11-26 2021-03-16 西安电子科技大学 New energy automobile battery abnormity detection method based on isolated forest algorithm

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190294485A1 (en) * 2018-03-22 2019-09-26 Microsoft Technology Licensing, Llc Multi-variant anomaly detection from application telemetry
CN111902805A (en) * 2018-03-22 2020-11-06 微软技术许可有限责任公司 Multivariate anomaly detection based on application telemetry
CN111274543A (en) * 2020-01-17 2020-06-12 北京空间飞行器总体设计部 Spacecraft system anomaly detection method based on high-dimensional space mapping
CN112505549A (en) * 2020-11-26 2021-03-16 西安电子科技大学 New energy automobile battery abnormity detection method based on isolated forest algorithm

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
刘当武等: "配电网运行管理深化监测分析", 电力与能源, no. 04 *

Similar Documents

Publication Publication Date Title
CN104021264B (en) A kind of failure prediction method and device
JP2006318263A (en) Information analysis system, information analysis method and program
CN104063458B (en) A kind of method and device that correspondence solution is provided terminal fault problem
DE112016005697T5 (en) Device, method and program for plan generation
DE102017220140A1 (en) Polling device, polling method and polling program
US6708185B2 (en) SQL execution analysis
CN108292380B (en) Factor analysis device, factor analysis method, and recording medium
KR20190060547A (en) Method of Deriving and Visualizing the Causes of Process Malfunctions Through Machine Learning Model In Data Imbalance Environment
US10539931B2 (en) Time-series data analysis device
CN112487146B (en) Legal case dispute focus acquisition method and device and computer equipment
Sauro et al. Making sense of usability metrics: Usability and six sigma
CN114140013A (en) Scoring card generation method, device and equipment based on xgboost
CN115659143A (en) Fault real-time diagnosis method based on experimental design
JP2004029971A (en) Data analyzing method
CN114462537A (en) Screening method and system of data set samples
CN110826306B (en) Data acquisition method and device, computer readable storage medium and electronic equipment
CN113284577A (en) Medicine prediction method, device, equipment and storage medium
CN112434886A (en) Method for predicting client mortgage loan default probability
Freeman Estimating quality costs
CN116775741A (en) Auditing method and related device for completion resolution of engineering
Sembiring et al. Defect Analysis Of Quality Palm Kernel Meal Using Statistical Quality Control In Kernels Factory
CN113962335A (en) Flexibly configurable data whole-process processing method
JP6371981B2 (en) Business support system, program for executing business support system, and medium recording the same
CN114115831A (en) Data processing method, device, equipment and storage medium
CN114245895A (en) Method for generating consistent representation for at least two log files

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination