CN114996318A - Automatic judgment method and system for processing mode of abnormal value of detection data - Google Patents
Automatic judgment method and system for processing mode of abnormal value of detection data Download PDFInfo
- Publication number
- CN114996318A CN114996318A CN202210815910.XA CN202210815910A CN114996318A CN 114996318 A CN114996318 A CN 114996318A CN 202210815910 A CN202210815910 A CN 202210815910A CN 114996318 A CN114996318 A CN 114996318A
- Authority
- CN
- China
- Prior art keywords
- field
- data
- value
- missing
- type
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 230000002159 abnormal effect Effects 0.000 title claims abstract description 45
- 238000001514 detection method Methods 0.000 title claims description 28
- 238000000034 method Methods 0.000 title claims description 24
- 238000012850 discrimination method Methods 0.000 claims abstract description 17
- 238000003672 processing method Methods 0.000 claims description 20
- 238000013145 classification model Methods 0.000 claims description 6
- 238000007405 data analysis Methods 0.000 abstract description 7
- 238000005516 engineering process Methods 0.000 abstract description 4
- 238000007689 inspection Methods 0.000 description 16
- 230000000739 chaotic effect Effects 0.000 description 4
- 230000009286 beneficial effect Effects 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 2
- 238000010276 construction Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 1
- 238000007418 data mining Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2455—Query execution
- G06F16/24564—Applying rules; Deductive queries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2458—Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
- G06F16/2462—Approximate or statistical queries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/906—Clustering; Classification
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Probability & Statistics with Applications (AREA)
- Fuzzy Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
Description
技术领域technical field
本发明涉及统计学和数据挖掘技术的技术领域,具体涉及一种检测数据异常值处理方式的自动判别方法及系统。The invention relates to the technical field of statistics and data mining technology, in particular to an automatic discrimination method and system for detecting abnormal value processing methods of data.
背景技术Background technique
现有对轨道交通检测数据异常值的处理方法判别必须首先通过数据分析人员通过对检测数据每个字段一一进行分析,获取各个字段的数据类型、分布。同时,分析人员必须在业务专家的辅助下,结合数据字段的业务背景最终决定数据各字段的异常值和缺失值处理。上述方式弊端在于如果检测数据的维度或字段较多时,会加大数据分析人员和业务专家的负担,降低数据分析的效率。为此,本发明专利通过将统计学和业务规则相结合的方式,基于数据分析技术构建了轨道交通检测数据异常值和缺失值处理的自动判别系统和方法。The existing methods for processing abnormal values of rail transit detection data must first obtain the data type and distribution of each field by analyzing each field of the detection data one by one by a data analyst. At the same time, with the assistance of business experts, the analyst must finally decide the processing of outliers and missing values in each field of the data based on the business background of the data field. The disadvantage of the above method is that if the detection data has many dimensions or fields, it will increase the burden on data analysts and business experts, and reduce the efficiency of data analysis. To this end, the patent of the present invention constructs an automatic discrimination system and method for processing abnormal values and missing values of rail transit detection data by combining statistics and business rules based on data analysis technology.
发明内容SUMMARY OF THE INVENTION
为了克服上述现有技术中存在的缺陷,本发明的目的是提供适用于轨道交通领域的一种检测数据异常值处理方式的自动判别方法,其通过将统计学和业务规则相结合的方式,基于数据分析技术构建了轨道交通检测数据异常值和缺失值处理的自动判别系统,有效提高数据分析的效率,降低大数据分析人员和业务专家的负担,具有重大的安全意义和实际应用价值。In order to overcome the above-mentioned defects in the prior art, the purpose of the present invention is to provide an automatic discrimination method for detecting data abnormal value processing methods suitable for the field of rail transit, which combines statistics and business rules based on The data analysis technology builds an automatic discrimination system for the processing of outliers and missing values in rail transit detection data, which effectively improves the efficiency of data analysis, reduces the burden on big data analysts and business experts, and has great safety significance and practical application value.
本发明的技术方案如下:The technical scheme of the present invention is as follows:
S1、根据每个字段数据的相关业务规则,确定所述每个字段类型,所述字段类型包括确定型字段和不确定型字段,其中确定型字段包括数值型字段、类别型字段和时间戳型字段。S1. Determine each field type according to the relevant business rules of each field data, where the field types include deterministic fields and indeterminate fields, wherein deterministic fields include numeric fields, category fields, and timestamp fields field.
进一步地,所述步骤S1,包括:Further, the step S1 includes:
从业务规则库中,检索每个字段数据的相关业务规则;From the business rule base, retrieve the relevant business rules of each field data;
如果业务规则库中明确了该数据字段的字段类型,则该数据字段类型为业务规则中指定类型;If the field type of the data field is specified in the business rule base, the data field type is the type specified in the business rule;
若没有该字段数据的业务规则,则获取该数据字段每个非缺失值的数据类型,所述每个非缺失值的数据类型包括数值型、类别型和时间戳型;If there is no business rule for the field data, obtain the data type of each non-missing value of the data field, and the data type of each non-missing value includes numeric type, category type and timestamp type;
根据获取的该字段每个非缺失值的三种数据类型对应的数量,分别计算三种数据类型的数量占该字段非缺失值数据总量的比例,以占比最高的数据类型为该字段的字段类型;若三种数据类型的占比相等,则该字段的字段类型为不确定型。According to the obtained number of the three data types of each non-missing value in the field, calculate the proportion of the three data types to the total non-missing value data of the field, and the data type with the highest proportion is the data type of the field. Field type; if the proportion of the three data types is equal, the field type of the field is indeterminate.
S2、统计每个字段中缺失值数量占所述字段数据总量的比例,判断所述字段是否可用;若所述字段可用则进入下一个判别阶段,否则不进入下一个判别阶段。S2. Count the ratio of the number of missing values in each field to the total amount of data in the field, and determine whether the field is available; if the field is available, enter the next judgment stage, otherwise, do not enter the next judgment stage.
进一步地,当缺失值数量比例R大于设定可用性阈值R0时,则判断该字段不可用。Further, when the ratio R of the number of missing values is greater than the set availability threshold R 0 , it is determined that the field is unavailable.
进一步地,对上述确定的字段类型的数据进行分析,若该字段中另外两种数据类型的数据量之和占该字段数据总量大于可用性阈值R0,则该字段不可用;如果该字段可用,则将该字段中另外两种数据类型的数据转化为缺失值处理。Further, analyze the data of the above determined field type, if the sum of the data volume of the other two data types in the field accounts for the total amount of the field data and is greater than the availability threshold R 0 , then the field is unavailable; if the field is available , the data of the other two data types in the field are converted to missing values.
进一步地,根据可用数值型字段的数据,构建数值型字段的标准态数据库。Further, according to the data of the available numeric fields, a standard database of numeric fields is constructed.
进一步地,从历史检测数据中提取质量良好的N次检验数据,根据检测位置将检测数据对齐,得到标准态数据库。Further, N times of inspection data with good quality are extracted from the historical inspection data, and the inspection data is aligned according to the inspection position to obtain a standard state database.
S3、当类别型字段为可用,且存在缺失值时,将所述类别型字段中缺失值数据量占比R与可用性阈值R0比较,根据比较结果判别所述类别型字段缺失值的处理方式。S3. When the category field is available and there is a missing value, compare the proportion R of the missing value data in the category field with the availability threshold R 0 , and determine the processing method of the missing value of the category field according to the comparison result .
进一步地,当所述类别型字段中缺失值数据量占比R小于可用性阈值时,利用所述类别型字段的众数填充缺失值;Further, when the proportion of missing value data in the category field R is less than the availability threshold When , use the mode of the categorical field to fill in the missing value;
当所述类别型字段中缺失值数据量占比R大于等于可用性阈值时,利用其他字段的数据构建该类别型字段的Softmax分类模型,利用分类模型对所述类别型字段的分类结果填充所述类别型字段的缺失值。When the proportion of missing data in the category field R is greater than or equal to the availability threshold When the data of other fields is used, a Softmax classification model of the category field is constructed, and the classification result of the category field by the classification model is used to fill in the missing value of the category field.
S4、当数值型字段为可用,分别通过计算变异系数值和缺失值数据量占比,对缺失值和异常值的处理方式进行判别。S4. When the numerical field is available, the processing methods of missing values and abnormal values are discriminated by calculating the coefficient of variation value and the proportion of missing value data respectively.
进一步地,所述步骤S4,具体包括:Further, the step S4 specifically includes:
S41、计算所述数值型字段的标准差和算术平均值的比例,得到变异系数CV;S41, calculating the ratio of the standard deviation and the arithmetic mean of the numerical field to obtain the coefficient of variation CV;
具体计算公式为:The specific calculation formula is:
, ,
其中,为字段数据标准差,为字段数据算术平均值;in, is the standard deviation of the field data, is the arithmetic mean of the field data;
根据变异系数的值所在阈值范围,利用对应阈值范围设置的判定方法,判定所述数值型字段的数据异常值;According to the threshold range where the value of the coefficient of variation is located, use the judgment method set corresponding to the threshold range to judge the data abnormal value of the numerical field;
S42、将所述数值型字段中缺失值数据量占比R与可用性阈值比较,根据比较结果判断所述数值型字段的缺失值的填充方式。S42. Calculate the proportion R of the data volume of missing values in the numerical field and the availability threshold Comparing, and determining the filling method of the missing value of the numeric field according to the comparison result.
进一步地,所述步骤S41,包括:Further, the step S41 includes:
当变异系数CV值小于15%时,利用标准态判定数据异常值;When the CV value of the coefficient of variation is less than 15%, the standard state is used to determine the abnormal value of the data;
当变异系数CV值小于35%,大于等于15%时,利用孤立森林算法判定数据异常值;When the CV value of the coefficient of variation is less than 35% and greater than or equal to 15%, the isolated forest algorithm is used to determine the abnormal value of the data;
当变异系数CV值小于50%,大于等于35%时,利用聚类算法判定数据异常值;When the CV value of the coefficient of variation is less than 50% and greater than or equal to 35%, the clustering algorithm is used to determine the abnormal value of the data;
当变异系数CV值大于等于50%时,利用3σ方法判定数据异常值。When the CV value of the coefficient of variation is greater than or equal to 50%, the 3σ method is used to determine the data outliers.
根据变异系数的值所在阈值范围对应的判断方法,可提高自动判别的效率。According to the judgment method corresponding to the threshold range of the value of the coefficient of variation, the efficiency of automatic judgment can be improved.
进一步地,所述将所述数值型字段中缺失值数据量占比R与可用性阈值比较,根据比较结果判断所述数值型字段的缺失值填充方式,包括:Further, the ratio R of the data volume of missing values in the numerical field and the availability threshold Comparing, and judging the missing value filling method of the numeric field according to the comparison result, including:
当时,则利用该字段非缺失数据的均值填充缺失值;when When , use the mean of the non-missing data in this field to fill in the missing values;
当时,则利用所述数值型字段与检测位置建立插值模型,通过插值法填充缺失值;when When , use the numerical field and the detection position to establish an interpolation model, and fill in the missing values by interpolation;
当时,则利用其他字段的数据构建所述数值型字段的回归模型,利用回归模型填充所述数值型字段的缺失值。when When the data of other fields is used, the regression model of the numerical field is constructed, and the missing value of the numerical field is filled by the regression model.
与现有技术相比,本发明的有益效果:Compared with the prior art, the beneficial effects of the present invention:
1. 将专家经验和业务规则结合,使检测数据的异常值和缺失值处理方式的判别实现自动化;1. Combining expert experience and business rules to automate the discrimination of abnormal values and missing values in the detection data;
2. 从数据质量出发,结合数据的可用性,判别结果更加可靠;2. Starting from the quality of the data, combined with the availability of the data, the judgment results are more reliable;
3. 在构建数值型变量的过程中,充分利用历史检测数据;3. In the process of constructing numerical variables, make full use of historical detection data;
4. 自动判别系统模块化构建,有利于计算机实现。4. The modular construction of the automatic identification system is conducive to computer realization.
基于上述一种检测数据异常值处理方式的自动判别方法,本发明还提供了一种检测数据异常值处理方式的自动判别系统,包括:Based on the above-mentioned automatic discrimination method for the processing method of abnormal value of detection data, the present invention also provides an automatic discrimination system for processing method of abnormal value of detection data, including:
业务规则判别模块,用于设置并存储各个字段的业务规则,其中业务规则包括字段的数据类型、字段取值范围或集合;The business rule discrimination module is used to set and store the business rules of each field, wherein the business rules include the data type of the field, the value range or set of the field;
字段类型自动判别模块,用于分析业务规则中未明确数据字段的数据类型,以判别所述字段的字段类型,所述字段类型包括确定型字段和不确定型字段,其中确定型字段包括数值型字段、类别型字段和时间戳型字段;The field type automatic identification module is used to analyze the data type of the unspecified data field in the business rules to identify the field type of the field, and the field type includes a deterministic field and an indeterminate field, wherein the deterministic field includes a numerical type fields, categorical fields and timestamp fields;
数据字段可用性自动判别模块,用于判别各个数据字段的质量情况,以判断各个数据字段是否具有分析意义;The data field availability automatic judgment module is used to judge the quality of each data field to judge whether each data field has analytical significance;
标准态数据库模块,用于判别数值型字段的异常值和缺失值处理方式;Standard database module, which is used to discriminate outliers and missing values of numeric fields;
数据字段处理方式自动判别模块,用于判别各个数据字段类型中异常值和/或缺失值的具体处理方式。The data field processing mode automatic discriminating module is used to discriminate the specific processing mode of outliers and/or missing values in each data field type.
进一步地,分析业务规则中未明确数据字段的数据类型,包括通过分析各个数据字段中非缺失值中数值型取值、类别型取值和时间戳型取值的占比,以得出各个数据字段的字段类型。Further, analyze the data types of the data fields that are not specified in the business rules, including analyzing the proportion of numeric values, categorical values, and timestamp values in the non-missing values in each data field to obtain each data. The field type of the field.
进一步地,判别各个数据字段的质量情况包括数据混乱程度判别、数据缺失值占比判别、数据重复值判别。Further, judging the quality of each data field includes judging the degree of data confusion, judging the proportion of missing data values, and judging data duplicate values.
进一步地,如果所述字段数据混乱且类型不确定,则判定所述字段为不可用。Further, if the field data is chaotic and the type is uncertain, it is determined that the field is unavailable.
进一步地,当所述字段中数值型和类别型数据的数量相同,则判定数据混乱,并且业务规则中没有指定类型,则所述数据类型为不确定;Further, when the number of numerical and categorical data in the field is the same, it is determined that the data is chaotic, and there is no specified type in the business rule, then the data type is uncertain;
当所述字段数据中某个值的数量占非缺失值总数的比例超过预设阈值,则判定数据重复值过多;When the proportion of the number of a certain value in the field data to the total number of non-missing values exceeds a preset threshold, it is determined that there are too many duplicate values in the data;
当所述字段数据中缺失值的数量占数据总数的比例超过预设阈值,则判定数据缺失值过多,所述数据不可用。When the ratio of the number of missing values in the field data to the total data exceeds a preset threshold, it is determined that there are too many missing values in the data, and the data is unavailable.
进一步地,所述标准态数据库通过可用数值型字段的数据构建得到。Further, the standard state database is constructed by using data of available numeric fields.
进一步地,从历史检测数据中提取质量良好的N次检验数据,根据检测位置将检测数据对齐,得到标准态数据库。Further, N times of inspection data with good quality are extracted from the historical inspection data, and the inspection data is aligned according to the inspection position to obtain a standard state database.
附图说明Description of drawings
图1为本发明的方法流程图。FIG. 1 is a flow chart of the method of the present invention.
具体实施方式Detailed ways
以下结合实施例和附图对本发明的构思、具体实施方式及产生的技术效果进行清楚、完整的描述,以充分地理解本发明的目的、特征和效果。The concept, specific embodiments and technical effects of the present invention will be clearly and completely described below in conjunction with the embodiments and accompanying drawings, so as to fully understand the purpose, features and effects of the present invention.
实施例1Example 1
如图1所示,本实施例提出了适用于轨道交通领域的一种检测数据异常值处理方式的自动判别方法,包括以下步骤:As shown in FIG. 1 , this embodiment proposes an automatic discrimination method for detecting abnormal value processing methods in the field of rail transit, including the following steps:
S1、根据每个数据字段的相关业务规则,确定所述每个数据字段类型,所述字段类型包括确定型字段和不确定型字段,其中确定型字段包括数值型字段、类别型字段和时间戳型字段;S1. Determine the type of each data field according to the relevant business rules of each data field, where the field type includes a deterministic field and an indeterminate field, wherein the deterministic field includes a numeric field, a category field, and a timestamp type field;
S2、统计每个数据字段中缺失值数据量占所述字段数据总量的比例,判断所述字段是否可用;若所述字段可用则进入下一个判别阶段,否则不进入下一个判别阶段;S2. Count the proportion of the missing value data in each data field to the total amount of the field data, and determine whether the field is available; if the field is available, enter the next judgment stage, otherwise do not enter the next judgment stage;
S3、当类别型字段为可用,且存在缺失值时,将所述类别型字段中缺失值数据量占比R与N倍可用性阈值比较,根据比较结果判别所述类别型字段缺失值的处理方式;S3. When the category field is available and there is a missing value, the ratio of the data volume of the missing value in the category field is R and N times the availability threshold Compare, according to the comparison result, determine the processing method of the missing value of the category field;
S4、当数值型字段为可用,分别通过计算变异系数值和缺失值数据量占比,对缺失值和异常值的处理方式进行判别。S4. When the numerical field is available, the processing methods of missing values and abnormal values are discriminated by calculating the coefficient of variation value and the proportion of missing value data respectively.
实施例2Example 2
在实施例1的基础上,本发明提出了一种数据类型确定方法,包括:On the basis of Embodiment 1, the present invention proposes a data type determination method, including:
从业务规则库中,检索每个数据字段的相关业务规则;From the business rule base, retrieve the relevant business rules for each data field;
如果业务规则库中明确了所述数据字段的字段类型,则所述数据字段类型为业务规则中指定类型;If the field type of the data field is specified in the business rule base, the data field type is the type specified in the business rule;
若没有所述数据字段的业务规则,则获取所述数据字段每个非缺失值的类型,所述每个非缺失值的数据类型包括数值型、类别型、时间戳型;If there is no business rule for the data field, obtain the type of each non-missing value of the data field, and the data type of each non-missing value includes numeric type, category type, and timestamp type;
根据获取的所述字段每个非缺失值的三种数据类型对应的数据量,计算三种数据类型的数据量占所述字段非缺失值数据总量的比例,以占比最高的数据类型为所述数据字段的字段类型;若三种数据类型的占比相等,则所述数据字段的字段类型为不确定型。According to the obtained data volume corresponding to the three data types of each non-missing value of the field, calculate the proportion of the data volume of the three data types to the total non-missing value data of the field, and the data type with the highest proportion is The field type of the data field; if the proportions of the three data types are equal, the field type of the data field is indeterminate.
实施例3Example 3
在实施例2的基础上,提出了判别所述字段是否可用的方法,具体包括:On the basis of Embodiment 2, a method for judging whether the field is available is proposed, which specifically includes:
当缺失值数据量占比R大于设定可用性阈值时,则判断该数据字段不可用。When the proportion of missing value data R is greater than the set availability threshold , it is judged that the data field is unavailable.
进一步地,对上述确定的字段类型的数据进行分析,若该数据字段中另外两种数据类型的数据量之和占该字段数据总量大于可用性阈值,则该数据字段不可用;如果可用,则将该数据字段中所述另外两种数据类型的数据转化为缺失值处理。Further, analyze the data of the above determined field type, if the sum of the data volume of the other two data types in the data field accounts for the total amount of the field data and is greater than the availability threshold , the data field is unavailable; if it is available, the data of the other two data types described in the data field is converted into missing value processing.
实施例4Example 4
根据可用数值型字段的数据,构建数值型字段的标准态数据库。According to the data of the available numeric fields, construct the standard database of numeric fields.
进一步地,从历史检测数据中提取质量良好的N次检验数据,根据检测位置将检测数据对齐,得到标准态数据库。Further, N times of inspection data with good quality are extracted from the historical inspection data, and the inspection data is aligned according to the inspection position to obtain a standard state database.
实施例5Example 5
在实施例3的基础上,本方案提出的类别型字段缺失值处理方式,具体判别包括:On the basis of Embodiment 3, the method for handling missing values of categorical fields proposed in this solution includes:
当所述类别型字段中缺失值数据量占比R小于N倍可用性阈值时,利用所述类别型字段的众数填充缺失值;When the proportion of missing value data in the category field is less than N times the availability threshold When , use the mode of the categorical field to fill in the missing value;
当所述类别型字段中缺失值数据量占比R大于等于N倍可用性阈值时,利用其他字段的非缺失值数据构建该类别型字段的Softmax分类模型,对所述类别型字段进行分类处理,根据分类模型对所述类别型字段的分类结果填充所述类别型字段的缺失值,其中,本方案优选N为0.1。When the proportion of missing data in the category field R is greater than or equal to N times the availability threshold When , use the non-missing value data of other fields to construct the Softmax classification model of the categorical field, classify the categorical field, and fill in the missing of the categorical field according to the classification result of the categorical field by the classification model. value, wherein, in this scheme, N is preferably 0.1.
实施例6Example 6
在实施例3的基础上,本方案提出的数值型字段缺失值和异常值处理方式的具体判别包括:On the basis of Embodiment 3, the specific discrimination of the processing methods for missing values and outliers in numerical fields proposed by this solution includes:
S41、计算所述数值型字段的标准差和算术平均值的比例,得到变异系数CV;S41, calculating the ratio of the standard deviation and the arithmetic mean of the numerical field to obtain the coefficient of variation CV;
具体计算公式为:The specific calculation formula is:
其中,为字段数据标准差,为字段数据算术平均值;in, is the standard deviation of the field data, is the arithmetic mean of the field data;
根据变异系数的值所在阈值范围,利用对应阈值范围设置的判定方法,判定所述数值型字段的数据异常值;According to the threshold range where the value of the coefficient of variation is located, use the judgment method set corresponding to the threshold range to judge the data abnormal value of the numerical field;
S42、将所述数值型字段中缺失值数据量占比R与可用性阈值比较,根据比较结果判断所述数值型字段的缺失值填充方式。S42. Calculate the proportion R of the data volume of missing values in the numerical field and the availability threshold Comparing, and judging the filling method of the missing value of the numeric field according to the comparison result.
进一步地,所述步骤S41,包括:Further, the step S41 includes:
当变异系数CV值<15%时,判别结结果为利用标准态判定数据异常值;When the CV value of the coefficient of variation is less than 15%, the result of the discriminant result is to use the standard state to determine the abnormal value of the data;
当15%≤变异系数CV值<35%时,判别结结果为利用孤立森林算法判定数据异常值;When 15%≤variation coefficient CV value <35%, the result of the discrimination result is to use the isolated forest algorithm to determine the abnormal value of the data;
当35%≤变异系数CV值<50%时,判别结结果为利用聚类算法判定数据异常值;When 35%≤variation coefficient CV value <50%, the result of the discriminant result is to use the clustering algorithm to determine the abnormal value of the data;
当变异系数CV值≥50%时,判别结结果为利用3σ方法判定数据异常值。When the coefficient of variation CV value is greater than or equal to 50%, the result of the discriminant result is to use the 3σ method to determine the abnormal value of the data.
根据变异系数的值所在阈值范围对应的判断方法,可提出自动判别的效率。According to the judgment method corresponding to the threshold range of the value of the coefficient of variation, the efficiency of automatic judgment can be proposed.
进一步地,所述将所述数值型字段中缺失值数据量占比R与可用性阈值比较,根据比较结果判断所述数值型字段的缺失值填充方式,包括:Further, the ratio R of the data volume of missing values in the numerical field and the availability threshold Comparing, and judging the missing value filling method of the numeric field according to the comparison result, including:
当时,则利用所述数值型字段中非缺失数据的均值填充缺失值;when , then use the mean of the non-missing data in the numeric field to fill in the missing values;
当时,则利用所述数值型字段与检测位置建立插值模型,通过插值法填充缺失值;when When , use the numerical field and the detection position to establish an interpolation model, and fill in the missing values by interpolation;
当时,则利用其他字段的数据构建所述数值型字段的回归模型,利用回归模型填充所述数值型字段的缺失值。when When the data of other fields is used, the regression model of the numerical field is constructed, and the missing value of the numerical field is filled by the regression model.
与现有技术相比,本发明的有益效果:Compared with the prior art, the beneficial effects of the present invention:
1. 将专家经验和业务规则结合,使检测数据的异常值和缺失值处理方式的判别实现自动化了;1. Combining expert experience and business rules to automate the discrimination of abnormal values and missing values in the detection data;
2. 从数据质量出发,结合数据的可用性,判别结果更加可靠;2. Starting from the quality of the data, combined with the availability of the data, the judgment results are more reliable;
3. 在构建数值型变量的过程中,充分利用历史检测数据;3. In the process of constructing numerical variables, make full use of historical detection data;
4. 自动判别系统模块化构建,有利于计算机实现。4. The modular construction of the automatic identification system is conducive to computer realization.
实施例7Example 7
基于上述一种检测数据异常值处理方式的自动判别方法,本发明还提供了一种检测数据异常值处理方式的自动判别系统,包括:Based on the above-mentioned automatic discrimination method for the processing method of abnormal value of detection data, the present invention also provides an automatic discrimination system for processing method of abnormal value of detection data, including:
业务规则判别模块,用于设置并存储各个字段的业务规则,其中业务规则包括字段的数据类型、字段取值范围或集合;The business rule discrimination module is used to set and store the business rules of each field, wherein the business rules include the data type of the field, the value range or set of the field;
数据字段类型自动判别模块,用于分析业务规则中未明确数据字段的数据类型,以判别所述字段的字段类型,所述字段类型包括确定型字段和不确定型,其中确定型字段包括数值型字段、类别型字段和时间戳型字段;The data field type automatic identification module is used to analyze the data type of the unspecified data field in the business rules, so as to identify the field type of the field, the field type includes a definite type field and an indeterminate type, wherein the definite type field includes a numerical type fields, categorical fields and timestamp fields;
数据字段可用性自动判别模块,用于判别各个数据字段的质量情况,以判断各个数据字段是否具有分析意义;The data field availability automatic judgment module is used to judge the quality of each data field to judge whether each data field has analytical significance;
标准态数据库模块,用于判别数值型字段的异常值和缺失值处理方式;Standard database module, which is used to discriminate outliers and missing values of numeric fields;
数据字段处理方法自动判别模块,用于判别各个数据字段类型中异常值和/或缺失值的具体处理方式。The data field processing method automatic discrimination module is used to discriminate the specific processing method of outliers and/or missing values in each data field type.
进一步地,所述分析业务规则中未明确数据字段的数据类型,包括通过分析各个数据字段中非缺失值中数值型取值、类别型取值和时间戳型取值的占比,以得出各个数据字段的字段类型。Further, the data types of the data fields are not specified in the analysis business rules, including by analyzing the proportion of numerical values, categorical values and timestamp values in the non-missing values in each data field to obtain The field type of each data field.
进一步地,所述判别数据字段的质量情况包括数据混乱程度判别、数据缺失值占比判别、数据重复值判别。Further, the judging of the quality of the data fields includes judging the degree of confusion in the data, judging the proportion of missing values in the data, and judging the repeated values of the data.
进一步地,如果所述字段数据混乱且类型不确定,则判定所述字段为不可用。Further, if the field data is chaotic and the type is uncertain, it is determined that the field is unavailable.
进一步地,当所述字段中数值型和类别型数据的数量相同,则判定数据混乱,并且业务规则中没有指定类型,则所述数据类型为不确定;Further, when the number of numerical and categorical data in the field is the same, it is determined that the data is chaotic, and there is no specified type in the business rule, then the data type is uncertain;
当所述字段数据中某个值的数量占非缺失值总数的比例超过预设阈值,则判定数据重复值过多;When the proportion of the number of a certain value in the field data to the total number of non-missing values exceeds a preset threshold, it is determined that there are too many duplicate values in the data;
当所述字段数据中缺失值的数量占数据总数的比例超过预设阈值,则判定数据缺失值过多,所述数据不可用。When the ratio of the number of missing values in the field data to the total data exceeds a preset threshold, it is determined that there are too many missing values in the data, and the data is unavailable.
进一步地,所述标准态数据库通过可用数值型字段的数据构建得到。Further, the standard state database is constructed by using data of available numeric fields.
进一步地,从历史检测数据中提取质量良好的N次检验数据,根据检测位置将检测数据对齐,得到标准态数据库。Further, N times of inspection data with good quality are extracted from the historical inspection data, and the inspection data is aligned according to the inspection position to obtain a standard state database.
以上对本发明的实施方式进行了具体说明,但本发明并不限于所述实施例,熟悉本领域的技术人员在不违背本发明精神的前提下,还可作出种种等同变型或替换,这些等同或替换均包含在本发明权利要求所限定的范围内。The embodiments of the present invention have been specifically described above, but the present invention is not limited to the examples. Those skilled in the art can make various equivalent modifications or substitutions without departing from the spirit of the present invention. These equivalents or Alternatives are included within the scope of the invention as defined by the claims.
Claims (12)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210815910.XA CN114996318B (en) | 2022-07-12 | 2022-07-12 | Automatic judgment method and system for processing mode of abnormal value of detection data |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210815910.XA CN114996318B (en) | 2022-07-12 | 2022-07-12 | Automatic judgment method and system for processing mode of abnormal value of detection data |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114996318A true CN114996318A (en) | 2022-09-02 |
CN114996318B CN114996318B (en) | 2022-11-04 |
Family
ID=83020719
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210815910.XA Active CN114996318B (en) | 2022-07-12 | 2022-07-12 | Automatic judgment method and system for processing mode of abnormal value of detection data |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114996318B (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118861019A (en) * | 2024-07-05 | 2024-10-29 | 山西晋云互联科技有限公司 | A method, system, device and medium for automatically verifying the quality of structured data |
CN119126779A (en) * | 2024-08-06 | 2024-12-13 | 台州爱鑫智能科技有限公司 | A control method for an unmanned intelligent robot |
Citations (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20020169735A1 (en) * | 2001-03-07 | 2002-11-14 | David Kil | Automatic mapping from data to preprocessing algorithms |
US20040194061A1 (en) * | 2003-03-31 | 2004-09-30 | Hitachi, Ltd. | Method for allocating programs |
CN103440283A (en) * | 2013-08-13 | 2013-12-11 | 江苏华大天益电力科技有限公司 | Vacancy filling system for measured point data and vacancy filling method |
CN105426425A (en) * | 2015-11-04 | 2016-03-23 | 华中科技大学 | Big data marketing method based on mobile signaling |
CN106649579A (en) * | 2016-11-17 | 2017-05-10 | 苏州航天系统工程有限公司 | Time-series data cleaning method for pipe net modeling |
CN107729293A (en) * | 2017-09-27 | 2018-02-23 | 中南大学 | A kind of geographical space method for detecting abnormal based on Multivariate adaptive regression splines |
CN110086860A (en) * | 2019-04-19 | 2019-08-02 | 武汉大学 | A kind of data exception detection method and device under Internet of Things big data environment |
CN110808084A (en) * | 2019-09-19 | 2020-02-18 | 西安电子科技大学 | A copy number variation detection method based on single-sample next-generation sequencing data |
CN111177217A (en) * | 2019-12-24 | 2020-05-19 | 平安信托有限责任公司 | Data preprocessing method, device, computer equipment and storage medium |
CN111680267A (en) * | 2020-06-01 | 2020-09-18 | 四川大学 | A three-stage advanced online identification method for abnormal dam safety monitoring data |
CN111737249A (en) * | 2020-08-24 | 2020-10-02 | 国网浙江省电力有限公司 | Abnormal data detection method and device based on Lasso algorithm |
CN112883340A (en) * | 2021-04-30 | 2021-06-01 | 西南交通大学 | Track quality index threshold value rationality analysis method based on quantile regression |
CN113934716A (en) * | 2021-09-27 | 2022-01-14 | 杭州电子科技大学 | A smart campus-oriented time series data restoration method based on coefficient of variation constraints |
CN114492552A (en) * | 2020-11-12 | 2022-05-13 | 中移动信息技术有限公司 | Method, device and equipment for training broadband user authenticity judgment model |
CN114660378A (en) * | 2022-02-28 | 2022-06-24 | 成都唐源电气股份有限公司 | A Catenary Comprehensive Diagnosis Method Based on Multi-source Detection Parameters |
-
2022
- 2022-07-12 CN CN202210815910.XA patent/CN114996318B/en active Active
Patent Citations (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20020169735A1 (en) * | 2001-03-07 | 2002-11-14 | David Kil | Automatic mapping from data to preprocessing algorithms |
US20040194061A1 (en) * | 2003-03-31 | 2004-09-30 | Hitachi, Ltd. | Method for allocating programs |
CN103440283A (en) * | 2013-08-13 | 2013-12-11 | 江苏华大天益电力科技有限公司 | Vacancy filling system for measured point data and vacancy filling method |
CN105426425A (en) * | 2015-11-04 | 2016-03-23 | 华中科技大学 | Big data marketing method based on mobile signaling |
CN106649579A (en) * | 2016-11-17 | 2017-05-10 | 苏州航天系统工程有限公司 | Time-series data cleaning method for pipe net modeling |
CN107729293A (en) * | 2017-09-27 | 2018-02-23 | 中南大学 | A kind of geographical space method for detecting abnormal based on Multivariate adaptive regression splines |
CN110086860A (en) * | 2019-04-19 | 2019-08-02 | 武汉大学 | A kind of data exception detection method and device under Internet of Things big data environment |
CN110808084A (en) * | 2019-09-19 | 2020-02-18 | 西安电子科技大学 | A copy number variation detection method based on single-sample next-generation sequencing data |
CN111177217A (en) * | 2019-12-24 | 2020-05-19 | 平安信托有限责任公司 | Data preprocessing method, device, computer equipment and storage medium |
CN111680267A (en) * | 2020-06-01 | 2020-09-18 | 四川大学 | A three-stage advanced online identification method for abnormal dam safety monitoring data |
CN111737249A (en) * | 2020-08-24 | 2020-10-02 | 国网浙江省电力有限公司 | Abnormal data detection method and device based on Lasso algorithm |
CN114492552A (en) * | 2020-11-12 | 2022-05-13 | 中移动信息技术有限公司 | Method, device and equipment for training broadband user authenticity judgment model |
CN112883340A (en) * | 2021-04-30 | 2021-06-01 | 西南交通大学 | Track quality index threshold value rationality analysis method based on quantile regression |
CN113934716A (en) * | 2021-09-27 | 2022-01-14 | 杭州电子科技大学 | A smart campus-oriented time series data restoration method based on coefficient of variation constraints |
CN114660378A (en) * | 2022-02-28 | 2022-06-24 | 成都唐源电气股份有限公司 | A Catenary Comprehensive Diagnosis Method Based on Multi-source Detection Parameters |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118861019A (en) * | 2024-07-05 | 2024-10-29 | 山西晋云互联科技有限公司 | A method, system, device and medium for automatically verifying the quality of structured data |
CN119126779A (en) * | 2024-08-06 | 2024-12-13 | 台州爱鑫智能科技有限公司 | A control method for an unmanned intelligent robot |
Also Published As
Publication number | Publication date |
---|---|
CN114996318B (en) | 2022-11-04 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111882446B (en) | Abnormal account detection method based on graph convolution network | |
CN114996318A (en) | Automatic judgment method and system for processing mode of abnormal value of detection data | |
CN107742127A (en) | An improved anti-stealing intelligent early warning system and method | |
CN109949152A (en) | A Personal Credit Default Prediction Method | |
CN111695823B (en) | Industrial control network flow-based anomaly evaluation method and system | |
CN107679734A (en) | It is a kind of to be used for the method and system without label data classification prediction | |
CN110866331A (en) | An Evaluation Method for Quality Defects of Power Transformer Family | |
CN115510302B (en) | Smart factory data classification method based on big data statistics | |
CN105574642A (en) | Smart grid big data-based electricity price execution checking method | |
CN117349786B (en) | Evidence fusion transformer fault diagnosis method based on data equalization | |
CN109215799B (en) | Screening method for false association signals in reported adverse drug reaction data of concomitant medications | |
CN117829994A (en) | Money laundering risk analysis method based on graph calculation | |
CN111709668A (en) | Method and device for risk identification of power grid equipment parameters based on data mining technology | |
CN117171157A (en) | Clearing data acquisition and cleaning method based on data analysis | |
CN112330095A (en) | Quality management method based on decision tree algorithm | |
CN116739645A (en) | Order abnormity supervision system based on enterprise management | |
CN112949735A (en) | Liquid hazardous chemical substance volatile concentration abnormity discovery method based on outlier data mining | |
CN114385403A (en) | Distributed collaborative fault diagnosis method based on two-layer knowledge graph architecture | |
CN118279034A (en) | Internet financial wind control report analysis method and system based on artificial intelligence | |
CN110703183A (en) | Intelligent electric energy meter fault data analysis method and system | |
CN119722282A (en) | A method and system for building a corporate credit scoring model | |
CN113657747A (en) | Enterprise safety production standardization level intelligent evaluation system | |
CN115081716A (en) | Enterprise default risk prediction method, computer equipment and storage medium | |
CN118779587B (en) | Gas use abnormality judging method based on gas user classification model | |
CN118587019B (en) | A method for inferring and identifying the time of an accident based on big data of Internet of Vehicles |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |