CN114996318A

CN114996318A - Automatic judgment method and system for processing mode of abnormal value of detection data

Info

Publication number: CN114996318A
Application number: CN202210815910.XA
Authority: CN
Inventors: 高仕斌; 占栋; 李想; 张金鑫; 佘夏威; 熊昊睿; 黄瀚韬; 冯中伟
Original assignee: Southwest Jiaotong University; Chengdu Tangyuan Electric Co Ltd
Current assignee: Southwest Jiaotong University; Chengdu Tangyuan Electric Co Ltd
Priority date: 2022-07-12
Filing date: 2022-07-12
Publication date: 2022-09-02
Anticipated expiration: 2042-07-12
Also published as: CN114996318B

Abstract

The invention discloses an automatic discrimination method and system for detecting abnormal value processing mode of data, which is characterized in that each field type is determined; counting the proportion of the missing value data quantity in each data field to the total data quantity of the field, and judging whether the field is available; if the field is available, entering the next judging stage, otherwise, not entering the next judging stage; when a type field is available and a missing value exists, comparing the ratio of the missing value data amount in the type field with the availability threshold value R ₀ Comparing, and judging the processing mode of the missing value of the type field according to the comparison result; when the numerical field is available, the processing modes of the missing value and the abnormal value are judged by calculating the ratio of the coefficient of variation value to the missing value data amount. By matching statistical and business rulesThe combined mode is based on a data analysis technology, so that the data analysis efficiency is effectively improved, and the burden of big data analysis personnel and business experts is reduced.

Description

An automatic discrimination method and system for detecting data outlier processing methods

技术领域technical field

本发明涉及统计学和数据挖掘技术的技术领域，具体涉及一种检测数据异常值处理方式的自动判别方法及系统。The invention relates to the technical field of statistics and data mining technology, in particular to an automatic discrimination method and system for detecting abnormal value processing methods of data.

背景技术Background technique

现有对轨道交通检测数据异常值的处理方法判别必须首先通过数据分析人员通过对检测数据每个字段一一进行分析，获取各个字段的数据类型、分布。同时，分析人员必须在业务专家的辅助下，结合数据字段的业务背景最终决定数据各字段的异常值和缺失值处理。上述方式弊端在于如果检测数据的维度或字段较多时，会加大数据分析人员和业务专家的负担，降低数据分析的效率。为此，本发明专利通过将统计学和业务规则相结合的方式，基于数据分析技术构建了轨道交通检测数据异常值和缺失值处理的自动判别系统和方法。The existing methods for processing abnormal values of rail transit detection data must first obtain the data type and distribution of each field by analyzing each field of the detection data one by one by a data analyst. At the same time, with the assistance of business experts, the analyst must finally decide the processing of outliers and missing values in each field of the data based on the business background of the data field. The disadvantage of the above method is that if the detection data has many dimensions or fields, it will increase the burden on data analysts and business experts, and reduce the efficiency of data analysis. To this end, the patent of the present invention constructs an automatic discrimination system and method for processing abnormal values and missing values of rail transit detection data by combining statistics and business rules based on data analysis technology.

发明内容SUMMARY OF THE INVENTION

为了克服上述现有技术中存在的缺陷，本发明的目的是提供适用于轨道交通领域的一种检测数据异常值处理方式的自动判别方法，其通过将统计学和业务规则相结合的方式，基于数据分析技术构建了轨道交通检测数据异常值和缺失值处理的自动判别系统，有效提高数据分析的效率，降低大数据分析人员和业务专家的负担，具有重大的安全意义和实际应用价值。In order to overcome the above-mentioned defects in the prior art, the purpose of the present invention is to provide an automatic discrimination method for detecting data abnormal value processing methods suitable for the field of rail transit, which combines statistics and business rules based on The data analysis technology builds an automatic discrimination system for the processing of outliers and missing values in rail transit detection data, which effectively improves the efficiency of data analysis, reduces the burden on big data analysts and business experts, and has great safety significance and practical application value.

本发明的技术方案如下：The technical scheme of the present invention is as follows:

S1、根据每个字段数据的相关业务规则，确定所述每个字段类型，所述字段类型包括确定型字段和不确定型字段，其中确定型字段包括数值型字段、类别型字段和时间戳型字段。S1. Determine each field type according to the relevant business rules of each field data, where the field types include deterministic fields and indeterminate fields, wherein deterministic fields include numeric fields, category fields, and timestamp fields field.

进一步地，所述步骤S1，包括：Further, the step S1 includes:

从业务规则库中，检索每个字段数据的相关业务规则；From the business rule base, retrieve the relevant business rules of each field data;

如果业务规则库中明确了该数据字段的字段类型，则该数据字段类型为业务规则中指定类型；If the field type of the data field is specified in the business rule base, the data field type is the type specified in the business rule;

若没有该字段数据的业务规则，则获取该数据字段每个非缺失值的数据类型，所述每个非缺失值的数据类型包括数值型、类别型和时间戳型；If there is no business rule for the field data, obtain the data type of each non-missing value of the data field, and the data type of each non-missing value includes numeric type, category type and timestamp type;

根据获取的该字段每个非缺失值的三种数据类型对应的数量，分别计算三种数据类型的数量占该字段非缺失值数据总量的比例，以占比最高的数据类型为该字段的字段类型；若三种数据类型的占比相等，则该字段的字段类型为不确定型。According to the obtained number of the three data types of each non-missing value in the field, calculate the proportion of the three data types to the total non-missing value data of the field, and the data type with the highest proportion is the data type of the field. Field type; if the proportion of the three data types is equal, the field type of the field is indeterminate.

S2、统计每个字段中缺失值数量占所述字段数据总量的比例，判断所述字段是否可用；若所述字段可用则进入下一个判别阶段，否则不进入下一个判别阶段。S2. Count the ratio of the number of missing values in each field to the total amount of data in the field, and determine whether the field is available; if the field is available, enter the next judgment stage, otherwise, do not enter the next judgment stage.

进一步地，当缺失值数量比例R大于设定可用性阈值R₀时，则判断该字段不可用。Further, when the ratio R of the number of missing values is greater than the set availability threshold R ₀ , it is determined that the field is unavailable.

进一步地，对上述确定的字段类型的数据进行分析，若该字段中另外两种数据类型的数据量之和占该字段数据总量大于可用性阈值R₀，则该字段不可用；如果该字段可用，则将该字段中另外两种数据类型的数据转化为缺失值处理。Further, analyze the data of the above determined field type, if the sum of the data volume of the other two data types in the field accounts for the total amount of the field data and is greater than the availability threshold R ₀ , then the field is unavailable; if the field is available , the data of the other two data types in the field are converted to missing values.

进一步地，根据可用数值型字段的数据，构建数值型字段的标准态数据库。Further, according to the data of the available numeric fields, a standard database of numeric fields is constructed.

进一步地，从历史检测数据中提取质量良好的N次检验数据，根据检测位置将检测数据对齐，得到标准态数据库。Further, N times of inspection data with good quality are extracted from the historical inspection data, and the inspection data is aligned according to the inspection position to obtain a standard state database.

S3、当类别型字段为可用，且存在缺失值时，将所述类别型字段中缺失值数据量占比R与可用性阈值R₀比较，根据比较结果判别所述类别型字段缺失值的处理方式。S3. When the category field is available and there is a missing value, compare the proportion R of the missing value data in the category field with the availability threshold R ₀ , and determine the processing method of the missing value of the category field according to the comparison result .

进一步地，当所述类别型字段中缺失值数据量占比R小于可用性阈值

时，利用所述类别型字段的众数填充缺失值；Further, when the proportion of missing value data in the category field R is less than the availability threshold

When , use the mode of the categorical field to fill in the missing value;

当所述类别型字段中缺失值数据量占比R大于等于可用性阈值

时，利用其他字段的数据构建该类别型字段的Softmax分类模型，利用分类模型对所述类别型字段的分类结果填充所述类别型字段的缺失值。When the proportion of missing data in the category field R is greater than or equal to the availability threshold

When the data of other fields is used, a Softmax classification model of the category field is constructed, and the classification result of the category field by the classification model is used to fill in the missing value of the category field.

S4、当数值型字段为可用，分别通过计算变异系数值和缺失值数据量占比，对缺失值和异常值的处理方式进行判别。S4. When the numerical field is available, the processing methods of missing values and abnormal values are discriminated by calculating the coefficient of variation value and the proportion of missing value data respectively.

进一步地，所述步骤S4，具体包括：Further, the step S4 specifically includes:

S41、计算所述数值型字段的标准差和算术平均值的比例，得到变异系数CV；S41, calculating the ratio of the standard deviation and the arithmetic mean of the numerical field to obtain the coefficient of variation CV;

具体计算公式为：The specific calculation formula is:

，

,

其中，

为字段数据标准差，

为字段数据算术平均值；in,

is the standard deviation of the field data,

is the arithmetic mean of the field data;

根据变异系数的值所在阈值范围，利用对应阈值范围设置的判定方法，判定所述数值型字段的数据异常值；According to the threshold range where the value of the coefficient of variation is located, use the judgment method set corresponding to the threshold range to judge the data abnormal value of the numerical field;

S42、将所述数值型字段中缺失值数据量占比R与可用性阈值

比较，根据比较结果判断所述数值型字段的缺失值的填充方式。S42. Calculate the proportion R of the data volume of missing values in the numerical field and the availability threshold

Comparing, and determining the filling method of the missing value of the numeric field according to the comparison result.

进一步地，所述步骤S41，包括：Further, the step S41 includes:

当变异系数CV值小于15%时，利用标准态判定数据异常值；When the CV value of the coefficient of variation is less than 15%, the standard state is used to determine the abnormal value of the data;

当变异系数CV值小于35%，大于等于15%时，利用孤立森林算法判定数据异常值；When the CV value of the coefficient of variation is less than 35% and greater than or equal to 15%, the isolated forest algorithm is used to determine the abnormal value of the data;

当变异系数CV值小于50%，大于等于35%时，利用聚类算法判定数据异常值；When the CV value of the coefficient of variation is less than 50% and greater than or equal to 35%, the clustering algorithm is used to determine the abnormal value of the data;

当变异系数CV值大于等于50%时，利用3σ方法判定数据异常值。When the CV value of the coefficient of variation is greater than or equal to 50%, the 3σ method is used to determine the data outliers.

根据变异系数的值所在阈值范围对应的判断方法，可提高自动判别的效率。According to the judgment method corresponding to the threshold range of the value of the coefficient of variation, the efficiency of automatic judgment can be improved.

进一步地，所述将所述数值型字段中缺失值数据量占比R与可用性阈值

比较，根据比较结果判断所述数值型字段的缺失值填充方式，包括：Further, the ratio R of the data volume of missing values in the numerical field and the availability threshold

Comparing, and judging the missing value filling method of the numeric field according to the comparison result, including:

当

时，则利用该字段非缺失数据的均值填充缺失值；when

When , use the mean of the non-missing data in this field to fill in the missing values;

当

时，则利用所述数值型字段与检测位置建立插值模型，通过插值法填充缺失值；when

When , use the numerical field and the detection position to establish an interpolation model, and fill in the missing values by interpolation;

当

时，则利用其他字段的数据构建所述数值型字段的回归模型，利用回归模型填充所述数值型字段的缺失值。when

When the data of other fields is used, the regression model of the numerical field is constructed, and the missing value of the numerical field is filled by the regression model.

与现有技术相比，本发明的有益效果：Compared with the prior art, the beneficial effects of the present invention:

1. 将专家经验和业务规则结合，使检测数据的异常值和缺失值处理方式的判别实现自动化；1. Combining expert experience and business rules to automate the discrimination of abnormal values and missing values in the detection data;

2. 从数据质量出发，结合数据的可用性，判别结果更加可靠；2. Starting from the quality of the data, combined with the availability of the data, the judgment results are more reliable;

3. 在构建数值型变量的过程中，充分利用历史检测数据；3. In the process of constructing numerical variables, make full use of historical detection data;

4. 自动判别系统模块化构建，有利于计算机实现。4. The modular construction of the automatic identification system is conducive to computer realization.

基于上述一种检测数据异常值处理方式的自动判别方法，本发明还提供了一种检测数据异常值处理方式的自动判别系统，包括：Based on the above-mentioned automatic discrimination method for the processing method of abnormal value of detection data, the present invention also provides an automatic discrimination system for processing method of abnormal value of detection data, including:

业务规则判别模块，用于设置并存储各个字段的业务规则，其中业务规则包括字段的数据类型、字段取值范围或集合；The business rule discrimination module is used to set and store the business rules of each field, wherein the business rules include the data type of the field, the value range or set of the field;

字段类型自动判别模块，用于分析业务规则中未明确数据字段的数据类型，以判别所述字段的字段类型，所述字段类型包括确定型字段和不确定型字段，其中确定型字段包括数值型字段、类别型字段和时间戳型字段；The field type automatic identification module is used to analyze the data type of the unspecified data field in the business rules to identify the field type of the field, and the field type includes a deterministic field and an indeterminate field, wherein the deterministic field includes a numerical type fields, categorical fields and timestamp fields;

数据字段可用性自动判别模块，用于判别各个数据字段的质量情况，以判断各个数据字段是否具有分析意义；The data field availability automatic judgment module is used to judge the quality of each data field to judge whether each data field has analytical significance;

标准态数据库模块，用于判别数值型字段的异常值和缺失值处理方式；Standard database module, which is used to discriminate outliers and missing values of numeric fields;

数据字段处理方式自动判别模块，用于判别各个数据字段类型中异常值和/或缺失值的具体处理方式。The data field processing mode automatic discriminating module is used to discriminate the specific processing mode of outliers and/or missing values in each data field type.

进一步地，分析业务规则中未明确数据字段的数据类型，包括通过分析各个数据字段中非缺失值中数值型取值、类别型取值和时间戳型取值的占比，以得出各个数据字段的字段类型。Further, analyze the data types of the data fields that are not specified in the business rules, including analyzing the proportion of numeric values, categorical values, and timestamp values in the non-missing values in each data field to obtain each data. The field type of the field.

进一步地，判别各个数据字段的质量情况包括数据混乱程度判别、数据缺失值占比判别、数据重复值判别。Further, judging the quality of each data field includes judging the degree of data confusion, judging the proportion of missing data values, and judging data duplicate values.

进一步地，如果所述字段数据混乱且类型不确定，则判定所述字段为不可用。Further, if the field data is chaotic and the type is uncertain, it is determined that the field is unavailable.

进一步地，当所述字段中数值型和类别型数据的数量相同，则判定数据混乱，并且业务规则中没有指定类型，则所述数据类型为不确定；Further, when the number of numerical and categorical data in the field is the same, it is determined that the data is chaotic, and there is no specified type in the business rule, then the data type is uncertain;

当所述字段数据中某个值的数量占非缺失值总数的比例超过预设阈值，则判定数据重复值过多；When the proportion of the number of a certain value in the field data to the total number of non-missing values exceeds a preset threshold, it is determined that there are too many duplicate values in the data;

当所述字段数据中缺失值的数量占数据总数的比例超过预设阈值，则判定数据缺失值过多，所述数据不可用。When the ratio of the number of missing values in the field data to the total data exceeds a preset threshold, it is determined that there are too many missing values in the data, and the data is unavailable.

进一步地，所述标准态数据库通过可用数值型字段的数据构建得到。Further, the standard state database is constructed by using data of available numeric fields.

附图说明Description of drawings

图1为本发明的方法流程图。FIG. 1 is a flow chart of the method of the present invention.

具体实施方式Detailed ways

以下结合实施例和附图对本发明的构思、具体实施方式及产生的技术效果进行清楚、完整的描述，以充分地理解本发明的目的、特征和效果。The concept, specific embodiments and technical effects of the present invention will be clearly and completely described below in conjunction with the embodiments and accompanying drawings, so as to fully understand the purpose, features and effects of the present invention.

实施例1Example 1

如图1所示，本实施例提出了适用于轨道交通领域的一种检测数据异常值处理方式的自动判别方法，包括以下步骤：As shown in FIG. 1 , this embodiment proposes an automatic discrimination method for detecting abnormal value processing methods in the field of rail transit, including the following steps:

S1、根据每个数据字段的相关业务规则，确定所述每个数据字段类型，所述字段类型包括确定型字段和不确定型字段，其中确定型字段包括数值型字段、类别型字段和时间戳型字段；S1. Determine the type of each data field according to the relevant business rules of each data field, where the field type includes a deterministic field and an indeterminate field, wherein the deterministic field includes a numeric field, a category field, and a timestamp type field;

S2、统计每个数据字段中缺失值数据量占所述字段数据总量的比例，判断所述字段是否可用；若所述字段可用则进入下一个判别阶段，否则不进入下一个判别阶段；S2. Count the proportion of the missing value data in each data field to the total amount of the field data, and determine whether the field is available; if the field is available, enter the next judgment stage, otherwise do not enter the next judgment stage;

S3、当类别型字段为可用，且存在缺失值时，将所述类别型字段中缺失值数据量占比R与N倍可用性阈值

比较，根据比较结果判别所述类别型字段缺失值的处理方式；S3. When the category field is available and there is a missing value, the ratio of the data volume of the missing value in the category field is R and N times the availability threshold

Compare, according to the comparison result, determine the processing method of the missing value of the category field;

实施例2Example 2

在实施例1的基础上，本发明提出了一种数据类型确定方法，包括：On the basis of Embodiment 1, the present invention proposes a data type determination method, including:

从业务规则库中，检索每个数据字段的相关业务规则；From the business rule base, retrieve the relevant business rules for each data field;

如果业务规则库中明确了所述数据字段的字段类型，则所述数据字段类型为业务规则中指定类型；If the field type of the data field is specified in the business rule base, the data field type is the type specified in the business rule;

若没有所述数据字段的业务规则，则获取所述数据字段每个非缺失值的类型，所述每个非缺失值的数据类型包括数值型、类别型、时间戳型；If there is no business rule for the data field, obtain the type of each non-missing value of the data field, and the data type of each non-missing value includes numeric type, category type, and timestamp type;

根据获取的所述字段每个非缺失值的三种数据类型对应的数据量，计算三种数据类型的数据量占所述字段非缺失值数据总量的比例，以占比最高的数据类型为所述数据字段的字段类型；若三种数据类型的占比相等，则所述数据字段的字段类型为不确定型。According to the obtained data volume corresponding to the three data types of each non-missing value of the field, calculate the proportion of the data volume of the three data types to the total non-missing value data of the field, and the data type with the highest proportion is The field type of the data field; if the proportions of the three data types are equal, the field type of the data field is indeterminate.

实施例3Example 3

在实施例2的基础上，提出了判别所述字段是否可用的方法，具体包括：On the basis of Embodiment 2, a method for judging whether the field is available is proposed, which specifically includes:

当缺失值数据量占比R大于设定可用性阈值

时，则判断该数据字段不可用。When the proportion of missing value data R is greater than the set availability threshold

, it is judged that the data field is unavailable.

进一步地，对上述确定的字段类型的数据进行分析，若该数据字段中另外两种数据类型的数据量之和占该字段数据总量大于可用性阈值

，则该数据字段不可用；如果可用，则将该数据字段中所述另外两种数据类型的数据转化为缺失值处理。Further, analyze the data of the above determined field type, if the sum of the data volume of the other two data types in the data field accounts for the total amount of the field data and is greater than the availability threshold

, the data field is unavailable; if it is available, the data of the other two data types described in the data field is converted into missing value processing.

实施例4Example 4

根据可用数值型字段的数据，构建数值型字段的标准态数据库。According to the data of the available numeric fields, construct the standard database of numeric fields.

实施例5Example 5

在实施例3的基础上，本方案提出的类别型字段缺失值处理方式，具体判别包括：On the basis of Embodiment 3, the method for handling missing values of categorical fields proposed in this solution includes:

当所述类别型字段中缺失值数据量占比R小于N倍可用性阈值

时，利用所述类别型字段的众数填充缺失值；When the proportion of missing value data in the category field is less than N times the availability threshold

When , use the mode of the categorical field to fill in the missing value;

当所述类别型字段中缺失值数据量占比R大于等于N倍可用性阈值

时，利用其他字段的非缺失值数据构建该类别型字段的Softmax分类模型，对所述类别型字段进行分类处理，根据分类模型对所述类别型字段的分类结果填充所述类别型字段的缺失值，其中，本方案优选N为0.1。When the proportion of missing data in the category field R is greater than or equal to N times the availability threshold

When , use the non-missing value data of other fields to construct the Softmax classification model of the categorical field, classify the categorical field, and fill in the missing of the categorical field according to the classification result of the categorical field by the classification model. value, wherein, in this scheme, N is preferably 0.1.

实施例6Example 6

在实施例3的基础上，本方案提出的数值型字段缺失值和异常值处理方式的具体判别包括：On the basis of Embodiment 3, the specific discrimination of the processing methods for missing values and outliers in numerical fields proposed by this solution includes:

具体计算公式为：The specific calculation formula is:

其中，

为字段数据标准差，

为字段数据算术平均值；in,

is the standard deviation of the field data,

is the arithmetic mean of the field data;

S42、将所述数值型字段中缺失值数据量占比R与可用性阈值

比较，根据比较结果判断所述数值型字段的缺失值填充方式。S42. Calculate the proportion R of the data volume of missing values in the numerical field and the availability threshold

Comparing, and judging the filling method of the missing value of the numeric field according to the comparison result.

进一步地，所述步骤S41，包括：Further, the step S41 includes:

当变异系数CV值＜15%时，判别结结果为利用标准态判定数据异常值；When the CV value of the coefficient of variation is less than 15%, the result of the discriminant result is to use the standard state to determine the abnormal value of the data;

当15%≤变异系数CV值＜35%时，判别结结果为利用孤立森林算法判定数据异常值；When 15%≤variation coefficient CV value <35%, the result of the discrimination result is to use the isolated forest algorithm to determine the abnormal value of the data;

当35%≤变异系数CV值＜50%时，判别结结果为利用聚类算法判定数据异常值；When 35%≤variation coefficient CV value <50%, the result of the discriminant result is to use the clustering algorithm to determine the abnormal value of the data;

当变异系数CV值≥50%时，判别结结果为利用3σ方法判定数据异常值。When the coefficient of variation CV value is greater than or equal to 50%, the result of the discriminant result is to use the 3σ method to determine the abnormal value of the data.

根据变异系数的值所在阈值范围对应的判断方法，可提出自动判别的效率。According to the judgment method corresponding to the threshold range of the value of the coefficient of variation, the efficiency of automatic judgment can be proposed.

当

时，则利用所述数值型字段中非缺失数据的均值填充缺失值；when

, then use the mean of the non-missing data in the numeric field to fill in the missing values;

当

当

1. 将专家经验和业务规则结合，使检测数据的异常值和缺失值处理方式的判别实现自动化了；1. Combining expert experience and business rules to automate the discrimination of abnormal values and missing values in the detection data;

实施例7Example 7

数据字段类型自动判别模块，用于分析业务规则中未明确数据字段的数据类型，以判别所述字段的字段类型，所述字段类型包括确定型字段和不确定型，其中确定型字段包括数值型字段、类别型字段和时间戳型字段；The data field type automatic identification module is used to analyze the data type of the unspecified data field in the business rules, so as to identify the field type of the field, the field type includes a definite type field and an indeterminate type, wherein the definite type field includes a numerical type fields, categorical fields and timestamp fields;

数据字段处理方法自动判别模块，用于判别各个数据字段类型中异常值和/或缺失值的具体处理方式。The data field processing method automatic discrimination module is used to discriminate the specific processing method of outliers and/or missing values in each data field type.

进一步地，所述分析业务规则中未明确数据字段的数据类型，包括通过分析各个数据字段中非缺失值中数值型取值、类别型取值和时间戳型取值的占比，以得出各个数据字段的字段类型。Further, the data types of the data fields are not specified in the analysis business rules, including by analyzing the proportion of numerical values, categorical values and timestamp values in the non-missing values in each data field to obtain The field type of each data field.

进一步地，所述判别数据字段的质量情况包括数据混乱程度判别、数据缺失值占比判别、数据重复值判别。Further, the judging of the quality of the data fields includes judging the degree of confusion in the data, judging the proportion of missing values in the data, and judging the repeated values of the data.

以上对本发明的实施方式进行了具体说明，但本发明并不限于所述实施例，熟悉本领域的技术人员在不违背本发明精神的前提下，还可作出种种等同变型或替换，这些等同或替换均包含在本发明权利要求所限定的范围内。The embodiments of the present invention have been specifically described above, but the present invention is not limited to the examples. Those skilled in the art can make various equivalent modifications or substitutions without departing from the spirit of the present invention. These equivalents or Alternatives are included within the scope of the invention as defined by the claims.

Claims

1. an automatic discrimination method of detection data abnormal value processing mode, is characterized in that, comprises:

Each field type is determined according to the relevant business rules of each field data, and the field types include deterministic fields and indeterminate fields, wherein the deterministic fields include numeric fields, category fields, and timestamp fields ;

Count the ratio R of the number of missing values in the field to the total amount of data in the field, and determine whether the field is available; if the field is available, enter the next judgment stage, otherwise do not enter the next judgment stage;

When the categorical field is available and there is a missing value, compare the proportion R of the missing value data in the categorical field with the availability threshold R ₀ , and determine the processing method for the missing value of the categorical field according to the comparison result;

When the numeric field is available, the processing methods of missing values and outliers are discriminated by calculating the coefficient of variation value and the proportion of missing value data respectively.

2 . The automatic discrimination method for the processing mode of abnormal value of detection data according to claim 1 , wherein the standard state database of the numerical field is constructed according to the data of the available numerical field. 3 .

3. the automatic discrimination method of detection data abnormal value processing mode according to claim 1, is characterized in that:

If the field type is not determined in the business rule base, obtain the data type corresponding to each non-missing value in the field, wherein the data type of the field includes a numeric type, a category type, and a timestamp type;

According to the data volume corresponding to the three data types of non-missing values, calculate the proportion of the data volume of the three data types to the total non-missing value data in the field data;

According to the proportion of the data type data in the field, the field type is determined.

4. the automatic discrimination method of detection data abnormal value processing mode according to claim 3, is characterized in that:

The said field type is judged according to the proportion of the data type data volume in the said field, which specifically includes:

Taking the data type with the highest proportion as the type of the deterministic field;

If the proportions of the three data types are equal, the field type is an indeterminate field.

5. the automatic discrimination method of detection data abnormal value processing mode according to claim 1, is characterized in that:

The judging whether the field is available includes:

When the proportion of missing value data R is greater than the set availability threshold R ₀ , it is determined that the field is unavailable.

6. The automatic discrimination method of detection data abnormal value processing mode according to claim 5, characterized in that: said judging whether said field is available, further comprising:

Count the proportion of the sum of the other two data types in the deterministic field to the total amount of data in the field;

If it is greater than the set availability threshold R ₀ , the deterministic field is unavailable; otherwise, the deterministic field is available.

7. The automatic discrimination method of detection data abnormal value processing mode according to claim 6, is characterized in that,

when the deterministic field is available;

The data of the other two data types in the deterministic field are converted into missing values for processing.

8. The automatic discrimination method of detection data abnormal value processing mode according to claim 1, is characterized in that,

The processing method for judging the missing value of the categorical field according to the comparison result includes:

When the proportion R of the data volume of missing values in the categorical field is less than N times the availability threshold R ₀ , use the mode of the categorical field to fill in the missing value;

When the proportion R of missing data in the category field is greater than or equal to N times the availability threshold R ₀ , use the data of other fields to construct a Softmax classification model for the category field, and use the classification model to classify the category field. The result fills in the missing values of the categorical field.

9. The automatic discrimination method of detection data abnormal value processing method according to claim 1, characterized in that, the processing methods of missing values and abnormal values are respectively calculated by calculating the coefficient of variation value and the proportion of missing value data. Judgment, including:

Calculate the ratio of the standard deviation and the arithmetic mean of the numerical field to obtain the coefficient of variation CV, and determine the data abnormal value of the numerical field according to the threshold range where the value of the coefficient of variation is located, using the judgment method set corresponding to the threshold range;

The proportion R of the missing value data in the numerical field is compared with the availability threshold R ₀ , and the missing value of the numerical field is filled according to the comparison result.

10. The automatic discrimination method of detection data abnormal value processing mode according to claim 9, is characterized in that,

According to the threshold range where the value of the coefficient of variation is located, using the judgment method set corresponding to the threshold range to determine the data abnormal value of the numerical field, specifically including:

When the CV value of the coefficient of variation is in the range of CV value <15%, the standard state is used to determine the abnormal value of the data;

When the CV value of the coefficient of variation is in the range of 15%≤CV value<35%, the isolated forest algorithm is used to determine the abnormal value of the data;

When the CV value of the coefficient of variation is in the range of 35%≤CV value<50%, the clustering algorithm is used to determine the abnormal value of the data;

When the CV value of the coefficient of variation is in the range of CV value ≥ 50%, the 3σ method is used to determine the abnormal value of the data.

11. The automatic discrimination method of detection data abnormal value processing mode according to claim 9, is characterized in that,

When R < 0.1R ₀ , the mean of the non-missing data in this field is used to fill in the missing values;

When 0.1R ₀ ≤R<0.5R ₀ , an interpolation model is established by using the numerical field and the detection position, and missing values are filled by interpolation;

When R≥0.5R ₀ , use the data of other fields to construct a regression model of the numerical field, and use the regression model to fill in the missing values of the numerical field.

12. An automatic discrimination system for detecting an abnormal value processing method, characterized in that it comprises a business rule discrimination module, a data field type automatic discrimination module, a data field availability automatic discrimination module, a standard state database module and a data field processing mode automatic discrimination module. ;

The business rule discrimination module is used to set and store the business rules of each field, wherein the business rules include the data type of the field, the field value range or set;

The data field type automatic discriminating module is used to analyze the data type of the unspecified data field in the business rules to discriminate the field type of the field. Type fields include numeric fields, categorical fields and timestamp fields;

The data field availability automatic judging module is used for judging the quality of each data field to judge whether each data field has analytical significance;

The standard state database module is used to discriminate the abnormal value and missing value processing mode of the numerical field;

The data field processing mode automatic discriminating module is used to discriminate the specific processing mode of outliers and/or missing values in each data field type.