CN114462537A

CN114462537A - Screening method and system of data set samples

Info

Publication number: CN114462537A
Application number: CN202210122374.5A
Authority: CN
Inventors: 王波; 罗杨; 候小娥; 杨文华; 郭飞; 万鹏; 肖清明; 史磊; 魏文婷; 张治民; 王晓康; 刘凤星
Original assignee: State Grid Ningxia Electric Power Co Wuzhong Power Supply Co; State Grid Ningxia Electric Power Co Ltd
Current assignee: State Grid Ningxia Electric Power Co Wuzhong Power Supply Co; State Grid Ningxia Electric Power Co Ltd
Priority date: 2022-02-09
Filing date: 2022-02-09
Publication date: 2022-05-10

Abstract

A method of screening a sample of a data set, the method comprising the steps of: step 1, carrying out multi-dimensional extraction on the data set, and carrying out outlier screening on all samples in the data set aiming at each dimension; step 2, sequencing all samples in the data set, and obtaining a preferred data set based on a sequencing result; and 3, calculating the abnormal frequency of each sample in the optimal data set, and screening the data set samples based on a preset standard. The method is simple, and provides the incidence relation among the abnormal values while optimizing the operation amount, so that the accuracy of the screened sample is higher.

Description

Screening method and system of data set samples

Technical Field

The invention relates to the field of data processing, in particular to a method and a system for screening a data set sample.

Background

In the field of data processing, screening of data set samples is used as an initial step of data modeling and subsequent processing, so that the necessary processes of data mining, data processing algorithm simplification and data processing operation amount reduction are effectively realized, and key data containing potential values can be extracted from massive data, so that the method has a vital position in data processing.

In the prior art, a screening method for a data set sample is relatively simple, and generally includes three processes of data extraction, data cleaning and data loading, where a process of actually processing original collected data includes four parts of missing data processing, repeated data processing, abnormal data processing and inconsistent data sorting, which are commonly used in data cleaning. Specifically, the missing data processing and the inconsistent data sorting can only delete the obviously acquired abnormal original data, and cannot effectively process and screen the more concealed abnormal data in the data acquisition process.

Furthermore, the abnormal data processing method adopted in the prior art is also simple, and in the prior art, the abnormal data itself is usually analyzed and abnormal values are eliminated. In the data processing of the power system, one data sample not only comprises one abnormal value data, but also comprises other collected data values closely related to the abnormal value. This makes the simple method for removing or screening abnormal values in the prior art unable to meet the requirement of complex data processing in the power system.

In addition, in the prior art, it is not possible to screen a single data sample for the frequency of occurrence of abnormal values, in other words, in the prior art, although a plurality of single abnormal values which are not related to each other can be screened, a plurality of abnormal values which are related to each other cannot be obtained for one data sample, and the sample can be accurately screened according to the abnormal serious condition of the data sample.

Aiming at the problems, the invention provides a method and a system for screening a data set sample.

Disclosure of Invention

In order to solve the defects in the prior art, the invention aims to provide a method and a system for screening a data set sample, which are used for carrying out dimension extraction and abnormal value screening on a data set and realizing sample screening aiming at the abnormal frequency of the sample.

The invention adopts the following technical scheme.

The invention relates to a screening method of a data set sample, wherein the method comprises the following steps: step 1, carrying out multi-dimensional extraction on a data set, and screening abnormal values of all samples in the data set aiming at each dimension; step 2, sequencing all samples in the data set, and obtaining an optimal data set based on a sequencing result; and 3, calculating the abnormal frequency of each sample in the optimized data set, and screening the data set samples based on a preset standard.

Preferably, in the step 1, after the data set realizes multi-dimensional extraction, the dimension N is more than or equal to 3; dimension N is extracted based on the same column name for all samples of the dataset.

Preferably, the outlier screening method is a boxplot outlier screening method or a two-eight law method.

Preferably, in step 2, the ranking method is a principal component analysis method and an entropy weight method.

Preferably, the preferred data set is obtained by: selecting the top 20% of samples from the sorting result; or, selecting the last 20% of samples from the sorting result; alternatively, the sample is selected based on expert opinion.

Preferably, in step 3, the method for calculating the abnormal frequency of each sample in the preferred data set comprises: step 3.1, outliers [ Z ] based on all samples in each dimension acquired in step 1₁，Z₂，…，Z_n，…，Z_N]And calculating the abnormal value [ E ] of the preferred data set under each dimension n by the preferred data set P acquired in the step 2₁，E₂，…，E_n，…，E_N]Wherein N is 1,2, …, N; step 3.2, for outliers E in all different dimensions_nTaking a union set to obtain a common abnormal sample set; step 3.3, for outliers E in all different dimensions_nAnd taking intersection to obtain a key abnormal sample set.

Preferably, in step 3.3, based on the frequency M of the abnormal value specified in the preset standard, a dimension number equal to the number of the specified frequency M of the abnormal value is selected, and an intersection is taken for a plurality of abnormal values under the dimension number, so as to obtain a key abnormal sample set.

Preferably, the method for intersecting the plurality of abnormal values under the dimensionality number is to select any M dimensionalities from the N dimensionalities and base on the N dimensionalities

Sub-selection of outliers E for all different dimensions_nTaking an intersection; and the samples included in the intersection are the samples in the key abnormal sample set.

Preferably, the predetermined criteria generates a specified frequency of outliers based on the screening requirements of the data set samples.

In a second aspect, the present invention relates to an apparatus for screening samples of a data set, wherein the apparatus comprises a processor for implementing a method for screening samples of a data set as described in the first aspect of the present invention.

Compared with the prior art, the method and the system for screening the data set samples have the advantages that the dimension extraction and abnormal value screening can be carried out on the data set, and the final sample screening can be realized according to the abnormal frequency of the samples. The method is simple, and provides the incidence relation among the abnormal values while optimizing the operation amount, so that the accuracy of the screened sample is higher.

Drawings

FIG. 1 is a schematic diagram of the steps of a method for screening a data set sample according to the present invention.

Detailed Description

The present application is further described below with reference to the accompanying drawings. The following examples are only for illustrating the technical solutions of the present invention more clearly, and the protection scope of the present application is not limited thereby.

FIG. 1 is a schematic diagram of the steps of a method for screening a data set sample according to the present invention. As shown in fig. 1, a first aspect of the present invention relates to a method for screening a data set sample, wherein the method comprises steps 1 to 3.

Step 1, carrying out multi-dimensional extraction on a data set, and carrying out outlier screening on all samples in the data set aiming at each dimension.

It will be appreciated that the data set may first be pre-processed to screen out unnecessary data content before generating the data set. For example, the frequency of data collection by the relevant equipment is too high, and when the data required to be screened in the method of the invention does not need to be so high, the irrelevant data can be screened in advance. This screening method can be a method generally used in the art.

In addition, the data sets of the present invention may come from a number of different sources. These multiple different sources may include multiple different acquisition devices, or multiple different data set processing devices. Therefore, a plurality of partial data tables can be obtained in advance, and the plurality of partial data tables are combined in advance in a manner commonly used in the prior art to generate the data set in the invention.

It should be noted that due to the influence of different types, specifications, configurations, etc. of the plurality of different acquisition devices or different data set processing devices, such as processors, the contents of the plurality of sub data tables may not be identical, for example, some data tables may include relatively more column names, and some data tables may include relatively less column names. In the merging process of the data tables, the column names included in all the partial data tables can be merged, but it should be noted that for the column names not included in other partial data tables, the corresponding data samples should annotate the completed content to prevent the subsequent processing process from identifying the completed content as an abnormal item.

In addition, the method of the present invention can be applied to the processing of various data, in other words, the method of the present invention does not limit the content of the data at all, nor does it limit too much the association between a plurality of different data samples. However, the data samples should at least include multiple data contents, hereinafter also referred to as data dimensions, and are identical.

In one embodiment of the present invention, a data set is used, wherein the data samples are related to basic data collected from different areas in the power system. Specifically, the data content includes the unit of the distribution area, the name of the distribution area, the operation date of the distribution area, the per-household distribution and transformation capacity, the heavy load abnormal frequency, the passing and re-abnormal frequency, the low voltage abnormal frequency, the critical defect, the power failure frequency, the average load rate, the acquisition success rate and other related contents. Certainly, the sub data table generated by a plurality of different acquisition devices or processing devices may further include other contents of the sample data, and these other contents may be associated with the data set in the present invention or not associated with the data set in the present invention, but in the present invention, all samples in the data set should include multiple dimensions of the unit of the station area mentioned in the above contents, the name of the station area, the commissioning date of the station area, the per-house distribution transformation capacity, the number of times of heavy load exception, the number of times of over-and-over exception, the number of times of low voltage exception, the critical defect, the number of times of power outage, the average load rate, and the acquisition success rate.

Preferably, in the step 1, after the data set is subjected to multi-dimensional extraction, the dimension N is more than or equal to 3; dimension N is extracted based on the same column name for all samples of the dataset.

In other words, in the present invention, a single sample in the data set may include more than N data contents, and in the data table, the single sample may also be displayed as including more than N column names, however, all samples in the data set should include corresponding data of the N column names represented by the extracted dimension N. And the corresponding data are all the effective data content which is really collected.

It can be seen that in the process of extracting multiple dimensions, extraction from all samples of the data set should also be performed according to this principle. In order to effectively achieve the purpose of sample screening in the present invention, and the subsequent calculation of step 2 and step 3, the number of dimensions obtained in this step should be at least 3.

Preferably, the outlier screening method is a boxplot outlier screening method or a two eight law method.

It should be noted that, after the invention screens out multiple dimensions, abnormal values are screened out from all samples for each dimension. The method for screening the abnormal value may be a boxplot outlier screening method or a two-eight law method, or may be an abnormal value screening method commonly used in the prior art.

In the present invention, for example, a boxcar diagram outlier screening method is adopted, and this method is used N times, so that, for each dimension, it is sequentially screened out whether the data value corresponding to the dimension in each data sample in the dimension is abnormal.

It should be noted that, generally, the box plot outlier and the twenty-eight law analyze the total trend of all samples according to the corresponding values of all samples in the current dimension, and screen out a small part of outlier data from the total trend. Of course, in the present invention, the setting of the abnormal range for a certain dimension data may be realized according to the nature of the extracted data itself. For example, assuming that the acquisition success rate data is taken as the current dimension, the item of data should be the better the data value is. For example, when the acquisition success rate is 100, it indicates that the current data is not necessarily abnormal data. However, in a practical situation, for example, where the total number of samples in a data set is 50, and there are 48 samples with a collection success rate of 98.5, one sample with a collection success rate of 100, and another with a collection success rate of 97, in this case, considering only the boxplot outlier and the two-eight law, the outlier would be two samples with collection success rates of 100 and 97. Therefore, in this case, parameters such as the upper limit, the lower limit, the median of the box line diagram outlier and the like should be effectively preset according to the own attribute of the selected dimension, or the parameters should be preset for the twenty-eight law algorithm based on the own attribute of the dimension, and the like.

By the method in step 1, the invention can firstly obtain the sample abnormal value [ Z ] for each dimension in each data set₁，Z₂，…，Z_n，…，Z_N]. It should be noted that N is the total number of selected dimensions, and Z is_nThe dimension n includes related information of all samples corresponding to all abnormal values in the current dimension n, such as serial numbers of the samples. For the convenience of subsequent calculations, here Z_nThe outliers themselves need not be included.

And 2, sequencing all samples in the data set, and obtaining a preferred data set based on a sequencing result.

It should be noted that the sorting method in the present invention can be implemented by referring to a principal component analysis method or an entropy weight method adopted in the prior art. Specifically, each of the dimensions selected in the present invention, or the dimensions not selected in the present invention, may be used as an evaluation index, thereby generating a comprehensive evaluation matrix. Meanwhile, the samples in a plurality of different data sets are used as principal components, so that the principal component analysis method in the prior art is adopted to realize the calculation of the principal component characteristic values of the samples. And sequencing according to the size of the principal component characteristic value so as to obtain the sequence of the plurality of samples.

In the present invention, the magnitude of the principal component eigenvalue can be calculated by using an entropy weight method. For example, the specific gravity Pik occupied by the fraction of the ith sample under the kth evaluation index may be solved first, and then the entropy Ek and the weight Ak of the kth evaluation index may be calculated, thereby generating the characteristic value Σ Ek Ak of the ith sample.

It should be noted that the evaluation index of the principal component analysis method may coincide with a plurality of dimensions extracted in advance, and in step 1, when the abnormal value screening is performed, the proportion of the data value of each sample in the current dimension to the average value of the data values of all samples in the current dimension may be calculated at the same time.

Compared with the prior art, in the step 1, the proportion occupied by the data value in a certain dimension is calculated in advance through a box line diagram outlier method or a two-eight law method in each dimension, so that when the dimension N extracted in the step 1 is superposed with the evaluation index k in the step, the proportion occupied by the fraction of the ith sample under the kth evaluation index Pik can be quickly calculated, the calculation process is simplified, and the calculation speed is improved.

Preferably, the preferred data set is acquired in the following manner: selecting the top 20% of samples from the sorting result; or, selecting the last 20% of samples from the sorting result; alternatively, the sample is selected based on expert opinion.

In the invention, in step 2, after the sorting is performed by adopting the method, sorting and extraction can be realized based on the size of the principal component characteristic value. The extracted data amount can be 20% of the highest eigenvalue or 20% of the lowest eigenvalue, and of course, the samples corresponding to different numbers and different sizes of eigenvalues can be selected by adopting the method in the prior art according to the expert experience. The detailed description of the selection method is omitted here.

And 3, calculating the abnormal frequency of each sample in the optimized data set, and screening the data set samples based on a preset standard.

After completing step 1 and step 2, step 3 may integrate the anomaly samples of each dimension obtained in step 1, and the preferred data set obtained in step 2 is processed.

Preferably, in step 3, the method for calculating the abnormal frequency of each sample in the preferred data set comprises:

step 3.1, outliers [ Z ] based on all samples in each dimension acquired in step 1₁，Z₂，…，Z_n，…，Z_N]And calculating the abnormal value [ E ] of the preferred data set under each dimension n by the preferred data set P acquired in the step 2₁，E₂，…，E_n，…，E_n]Wherein N is 1,2, …, N; step 3.2, for outliers E in all different dimensions_nTaking a union set to obtain a common abnormal sample set; step 3.3, for outliers E in all different dimensions_nAnd taking intersection to obtain a key abnormal sample set.

Specifically, the present invention may first perform intersection calculation on the preferred data set P according to the abnormal values in multiple dimensions, respectively, to obtain [ E₁，E₂，…，E_n，…，E_N]. Then, all the abnormal values E are calculated_nAnd taking a union set, thereby obtaining all samples with abnormal value frequency. At this time, the frequency of the abnormal value may be only 1, or may be greater than 1, and meets the requirement of the preset standard. The sample set obtained here may be referred to as a normal exception sample set, and after the set is obtained, the exception may actually be handled with a lower priority.

In addition, in the invention, multiple dimensions can be arbitrarily selected from N dimensions for intersection operation, so that a key abnormal sample set is obtained, and the processing priority is relatively high.

Specifically, different abnormal value frequencies can be set in the preset standard, that is, when there is an abnormality in all data items in a plurality of different dimensions in a certain sample, it is indicated that there is a major abnormality in the sample.

With the method of the present invention, it is necessary to realize

The dimension is selected in the next multiple different combination cases, so that the value of M is not easy to be too large, and in an embodiment of the invention, M is 2.

In addition, the value of M also needs to be set according to the actual screening needs, and in the invention, the value of M can be manually set or generated according to an algorithm.

Hereinafter, a description will be given of a screening method for screening poor station area collected data, taking the station area collected data in an electric power system as an example.

The original data set is shown in table 1, and the list names of the table header include a serial number, a unit of the table, a table area name, a commissioning date, a per-household distribution and transformation capacity, a heavy-load abnormal frequency, a pass and re-abnormal frequency, a low-voltage abnormal frequency, an emergency defect, a power failure frequency, an average load rate and an acquisition success rate. The data set comprises a total of 90 data samples.

TABLE 1 original data set

In the step 1 of the invention, 9 dimensions, namely the operation date, the average distribution transformation capacity of a user, the heavy load abnormal frequency, the passing and re-abnormal frequency, the low voltage abnormal frequency, the critical defect, the power failure frequency, the average load rate and the acquisition success rate are extracted from an original data set. Then, abnormal situations are screened out from each dimension.

Table 2 shows that the date of operation is taken asFront dimension, outlier Z obtained using boxplot outlier algorithm₁The corresponding sample set.

TABLE 2 abnormal values Z₁Corresponding sample set

By adopting a similar method, the sample abnormal value Z under other dimensions can be calculated₂、Z₃And Z_NAnd the like.

And in step 2, performing principal component analysis on 90 samples to obtain a comprehensive ranking, and then obtaining the last-ranked table area of 20% of the comprehensive ranking, so as to obtain a preferred data set in table 3.

TABLE 3 preferred data set

By the above method, the worst correlation sample of 18 stations is selected from the 90 samples, resulting in the preferred data set P. Associating the preferred data set with sample outliers [ Z ] in each dimension₁，Z₂，…，Z_n，…，Z_N]And taking intersection sets to respectively obtain abnormal values [ E ] of the optimal data set under each dimension n₁，E₂，…，E_n，…，E_N]。

Table 4 shows the abnormal value E₁The corresponding sample set. As shown in Table 4, at Z₁Taking intersection with P to obtain E₁。

TABLE 4 abnormal value E₁Corresponding sample set

In one embodiment of the invention, Z₂The intersection with P is empty, Z₄The intersection with P is empty. In addition, Z₃、Z₅、Z₆、Z₇And Z₈And Z₉The intersection with P is shown in tables 5 to 10, respectively.

TABLE 5Z₃Intersection with P

TABLE 6Z₅Intersection with P

TABLE 7Z₆Intersection with P

TABLE 8Z₇Intersection with P

TABLE 9Z₈Intersection with P

TABLE 10Z₉Intersection with P

If the method in step 3.2 is used, for all outliers E in different dimensions_nTaking the union to obtain

The normal anomaly sample set in table 11.

TABLE 11 common Exception sample set

In this embodiment, the samples corresponding to the distribution room with the frequency of occurrence of 2 times or more are selected as the individuals in the key abnormal sample set. Resulting in the key anomaly sample set described in table 12.

TABLE 12 sample set of key point anomalies

When the preset frequency is 2, it can be seen that the important abnormal sample set selected from the 18 preferred data sets P by the method of the present invention includes 1-3, 6, 8-10, and 15-18 samples, and the remaining 7 samples are screened out as normal abnormal samples.

According to a preset rule, the areas corresponding to the 1 st to 3 rd, 6 th, 8 th to 10 th and 15 th to 18 th samples can be listed as the areas to be mainly invested, and the areas corresponding to the other 7 samples are the areas with common investment.

In a second aspect, the present invention relates to an apparatus for screening data set samples, wherein the apparatus includes a processor for implementing a method for screening data set samples as described in the first aspect of the present invention.

The present applicant has described and illustrated embodiments of the present invention in detail with reference to the accompanying drawings, but it should be understood by those skilled in the art that the above embodiments are only preferred embodiments of the present invention, and the detailed description is only for the purpose of helping the reader to better understand the spirit of the present invention, and not for the purpose of limiting the scope of the present invention, and on the contrary, any modifications or modifications based on the spirit of the present invention should fall within the scope of the present invention.

Claims

1. A method of screening a sample of a data set, the method comprising the steps of:

step 1, carrying out multi-dimensional extraction on the data set, and carrying out outlier screening on all samples in the data set aiming at each dimension;

step 2, sequencing all samples in the data set, and obtaining a preferred data set based on a sequencing result;

and 3, calculating the abnormal frequency of each sample in the optimal data set, and screening the data set samples based on a preset standard.

2. The method of claim 1, wherein the step of screening the sample of the data set comprises:

in the step 1, the step of processing the raw material,

after the data set is subjected to multi-dimensional extraction, the dimension N is more than or equal to 3;

the dimension N is extracted based on the same column name for all samples of the dataset.

3. The method of claim 2, wherein the step of screening the sample of the data set comprises:

the outlier screening method is a boxplot outlier screening method or a two-eight law method.

4. A method of screening a sample of a data set according to claim 3, wherein:

in the step 2, in the step of processing,

the sequencing method comprises a principal component analysis method and an entropy weight method.

5. The method of claim 4, wherein the step of screening the sample of the data set comprises:

the preferred data set is obtained in the following manner:

selecting the top 20% of samples from the sorting result;

or, selecting the last 20% of samples from the sorting result;

alternatively, the sample is selected based on expert opinion.

6. The method of claim 5, wherein the step of screening the sample of the data set comprises:

in the step 3, the step of processing the image,

the calculation method of the abnormal frequency of each sample in the preferred data set comprises the following steps:

step 3.1, based on the outliers [ Z ] of all samples in each dimension acquired in said step 1₁，Z₂，…，Z_n，…，Z_N]And calculating an abnormal value [ E ] of the preferred data set under each dimension n by using the preferred data set P acquired in the step 2₁，E₂，…，E_n，…，E_N]Wherein N is 1,2, …, N;

step 3.2, for outliers E in all different dimensions_nTaking a union set to obtain a common abnormal sample set;

step 3.3, for outliers E in all different dimensions_nAnd taking intersection to obtain a key abnormal sample set.

7. The method of claim 6, wherein the step of screening the sample set comprises the steps of:

in the above-mentioned step 3.3,

selecting dimensionality numbers with the same number as the specified abnormal value frequency M based on the abnormal value frequency M specified in the preset standard, and taking intersection of a plurality of abnormal values under the dimensionality numbers to obtain a key abnormal sample set.

8. The method of claim 7, wherein the step of screening the sample set comprises the steps of:

the method for intersecting the plurality of abnormal values under the dimensionality number is to select any M dimensionalities from the N dimensionalities and base on the N dimensionalities

Sub-selection of outliers E for all different dimensions_nTaking an intersection;

and the samples included in the intersection are the samples in the key abnormal sample set.

9. The method of claim 8, wherein the step of screening the sample of the data set comprises:

the predetermined criteria generates the specified outlier frequency based on screening requirements of the dataset sample.

10. A screening device for data set samples is characterized in that:

the apparatus includes a processor configured to execute a program,

the processor is adapted to implement a method of screening samples of a data set as claimed in any one of claims 1 to 9.