CN115878599A

CN115878599A - Sewage industry data cleaning method

Info

Publication number: CN115878599A
Application number: CN202211320749.5A
Authority: CN
Inventors: 田志民; 牛豫海; 张自力; 马景春; 周晓萍; 滕国宝
Original assignee: Cangzhou Water Supply And Drainage Group Co ltd; Hebei Construction & Investment Water Investment Co ltd; Hebei Construction Investment Hengshui Water Affairs Co ltd; Korla Longrun Water Treatment Co ltd; Hebei Xiong'an Ruitian Technology Co ltd
Current assignee: Cangzhou Water Supply And Drainage Group Co ltd; Hebei Construction & Investment Water Investment Co ltd; Hebei Construction Investment Hengshui Water Affairs Co ltd; Korla Longrun Water Treatment Co ltd; Hebei Xiong'an Ruitian Technology Co ltd
Priority date: 2022-10-26
Filing date: 2022-10-26
Publication date: 2023-03-31

Abstract

The invention discloses a data cleaning method for sewage industry, belonging to the field of data safety, and the cleaning method comprises the following specific steps: (1) receiving sewage industry data and carrying out risk investigation; (2) constructing an industry database and carrying out data quality detection; (3) Constructing a data cleaning framework to clean the industry data; (4) Detecting the operation efficiency of the server in real time and optimizing the performance; the invention can improve the detection precision of the quality detection model and the efficiency of searching parameters, does not need to manually set the parameters, has simple operation process and easy operation, improves the use experience of workers, can perform large-granularity compression, improves the compression efficiency, effectively improves the response efficiency of the server, and saves the time required by compressing the memory.

Description

Sewage industry data cleaning method

Technical Field

The invention relates to the field of data security, in particular to a data cleaning method in sewage industry.

Background

In order to simplify the sewage database, data cleaning in the sewage industry becomes one of important attention objects in the industry; therefore, the invention of a data cleaning method in the sewage industry becomes particularly important;

through retrieval, the Chinese patent number CN109783813A discloses a data cleaning method and system, the invention standardizes irregular industry data by combining word segmentation with a method for calculating the Jaccard distance, cleans the irregular enterprise industry data into corresponding data in national standard, increases the usability of the industry data, but has low detection precision, requires manual parameter setting and has complex operation steps; in addition, the existing sewage industry data cleaning method has low server response efficiency and long time for memory compression; therefore, a data cleaning method in the sewage industry is provided.

Disclosure of Invention

The invention aims to solve the defects in the prior art and provides a method for cleaning data in the sewage industry.

In order to achieve the purpose, the invention adopts the following technical scheme:

the method for cleaning the data in the sewage industry comprises the following specific steps:

(1) Receiving sewage industry data and carrying out risk investigation;

(2) An industry database is constructed and data quality detection is carried out;

(3) Constructing a data cleaning framework to clean the industry data;

(4) And detecting the operation efficiency of the server in real time and optimizing the performance.

As a further scheme of the invention, the risk investigation in the step (1) comprises the following specific steps:

the method comprises the following steps: the server receives and receives the industry data, then converts non-binary data in the received industry data into binary data, and converts each group of industry data sets into a specified detection interval by a Min-Max normalization method;

step two: then the server is in communication connection with the virus database and the cloud virtual machine, analyzes each group of industry data, performs data retrieval comparison in the virus database according to an analysis result, and intercepts corresponding industry data if data with consistent comparison results exist;

step three: and if the data with the consistent comparison result does not exist, uploading the related industry data to the cloud virtual machine for infection simulation, then performing virus analysis on the simulation result by the server according to the infection standard established by the network virus definition, and intercepting the industry data with the consistent analysis result.

As a further scheme of the present invention, the data quality detection in step (2) specifically comprises the following steps:

step I: constructing a quality detection model, training and optimizing the quality detection model according to data quality dimensions, sequentially inputting industry data into the quality detection model, and classifying each group of industry data by the quality test model according to different enterprises;

and step II: then, performing feature dimension reduction processing on each group of industrial data, screening out feature parameters capable of expressing the quality of the industrial data, screening out feature parameters with poor characterization capability, dividing the industrial data into a training set and a testing set, and performing standardized processing on the training set to generate a training sample;

step III: and conveying the training samples to a quality detection model, setting optimal parameters of the model according to an optimization result, training the quality detection model by adopting a long-term iteration method, inputting a test set into the trained model, drawing data accuracy, universality, completeness and consistency curves, analyzing the data, and marking and recording industrial data with data loss, similar repetition, abnormality, logic errors and inconsistency.

As a further scheme of the present invention, the data quality dimension in step i specifically includes data specification, data integrity criterion, data repetition, data accuracy, consistency and synchronization, timeliness and availability, ease and maintainability, data coverage, expression quality, data decay, utility and understandability, correlation and credibility;

in the step II, the specific formula of the feature dimension reduction is as follows:

wherein σ represents a standard deviation of the characteristic parameter; μ represents a mean value of the characteristic parameter; CV represents the variance coefficient of the characteristic parameter, if the variance coefficient is larger, the CV represents more important, otherwise, the CV represents unimportant, and the CV is eliminated;

the specific formula of the standardization treatment in the step II is as follows:

wherein x represents a proposed characteristic parameter; mean (x) represents the average processing of the characteristic parameters; std (x) represents the standard deviation of the characteristic parameter.

As a further scheme of the invention, the quality detection model training optimization in the step I comprises the following specific steps:

s1.1: the server receives a test data set and data quality dimensions uploaded by workers, selects a group of test data from N groups of test data sets as verification data, fits the rest data into a group of test models, verifies the precision of the test models by the verification data, calculates the detection capability of the test models through root-mean-square errors, repeats the steps for N times, and performs parameter optimization processing on generated precision parameters;

s1.2: initializing a parameter range, then confirming a learning rate according to a system default or manual setting mode, dividing data samples, selecting any subset as a test set and the rest subsets as training sets for each group of data, predicting the test set after training a test model through the training sets, and counting the root mean square error of a test result;

s1.3: meanwhile, the test set is replaced by another subset, the residual subset is taken as a training set, the root mean square error is counted again until all data are predicted once, the corresponding combination parameter when the root mean square error is minimum is selected as the optimal parameter in the data interval, and the original parameter in the quality test model is replaced by the optimal parameter;

s1.4: recording each group of data and detection results detected by the quality detection model, replacing original data in the test data set for subsequent parameter updating, simultaneously evaluating the accuracy, the detection rate and the false alarm rate of the real-time quality detection model, and feeding the evaluation results back to working personnel for checking.

As a further scheme of the present invention, the specific data cleaning steps of the data cleaning framework in step (3) are as follows:

s2.1: the data cleaning framework intercepts all groups of marked industry data and classifies the industry data with data loss, similar repetition, abnormity, logic errors and inconsistency;

s2.2: neglecting the industrial data records with data missing, removing the data attribute, then using a system default value, an attribute average value and a similar sample average value to perform relevance estimation on the missing value, and filling the estimation result as a filling value;

s2.3: for similar and repeated industry data, the data cleaning framework selects key attribute fields, assigns corresponding weights to each key attribute according to the importance degree of the key attributes in the expression record characteristics, enables the key fields to express the characteristics of the record more accurately, selects an attribute field matching degree algorithm to perform secondary detection on the marked similar and repeated industry data, cleans the similar and repeated industry data according to a set cleaning rule, stores the data which cannot be automatically processed into a log table, and provides a corresponding cleaning result report;

s2.4: for abnormal data, the data cleaning framework carries out clustering processing on similar industry data, takes values falling outside a clustering set as isolated points, removes the data falling in the isolated points, then carries out box separation processing on the residual abnormal data, and carries out smoothing processing according to a box average value;

s2.5: for the industry data with logic errors, the data cleaning framework calls related rules from the logic definition library to process error attribute values in the industry data, and if no proper processing rule exists, the data is stored in a log table and is manually processed; for inconsistent industry data, the data cleansing framework cleanses the industry data by transforming, formatting or summarizing the industry data.

As a further scheme of the present invention, the specific steps of the server performance optimization in step (4) are as follows:

p1: the method comprises the steps that a server internal performance optimization framework generates a starting linked list for each port of a server, the head of each set of starting linked lists is further linked according to the LRU sequence of the ports, port information with the minimum interaction frequency is collected, the starting linked list of the port is arranged at the head of the LRU linked list and is sequentially ordered;

p2: clearing access bits of all updated page table entries before starting a port, rechecking the access bits of all pages before the port starting time is finished by a performance optimization framework, updating data of each group of pages in a starting linked list after the check is finished, sequentially selecting the least active port from the head of an LRU linked list, and selecting a victim page from the corresponding starting linked list until enough pages are obtained;

p3: merging the selected victim pages into a block, marking the block, waking up a compression driver to analyze the marked block and obtain physical pages belonging to the block, copying the physical pages into a buffer area, calling a compression algorithm to compress the physical pages in the buffer area into a compression block, and storing the compression block into a performance optimization framework area.

Compared with the prior art, the invention has the beneficial effects that:

1. compared with the conventional cleaning method, the method for cleaning the sewage industry data has the advantages that the quality detection model is constructed, parameters of the quality detection model are calculated and selected according to a test data set uploaded by a worker and data quality dimensions, then characteristic dimension reduction processing is carried out on each group of industry data, characteristic parameters capable of expressing the quality of the industry data are screened out, the characteristic parameters poor in characterization capability are screened out, the industry data are divided into a training set and a test set, the training set is subjected to standardization processing to generate a training sample, the training sample is conveyed into the quality detection model, the optimal parameters of the model are set according to the optimization result, the quality detection model is trained by adopting a long-term iteration method, the test set is input into the trained model, data accuracy, universality, completeness and consistency curves are drawn, the detection accuracy of the quality detection model and the parameter searching efficiency can be improved, manual parameter setting is not needed, the operation process is simple and easy to operate, and the use experience of the worker is improved;

2. the sewage industry data cleaning method comprises the steps of generating a starting linked list for each port of a server through a server internal performance optimization framework, further linking the head of each group of starting linked lists according to the LRU sequence of the ports, sequencing the starting linked lists of each port in sequence from small to large according to interaction frequency, updating data of each group of pages in the starting linked list before the port starting time is over, sequentially selecting the most inactive port from the head of the LRU linked list, selecting a victim page from the corresponding starting linked list, combining the selected victim pages into a block, marking the block, waking up a compression driving program to analyze the marked block and obtain a physical page belonging to the block, copying the physical page into a buffer area, then calling a compression algorithm to compress the physical page in the buffer area into a compression block, storing the compression block into a performance optimization framework area, compressing at large granularity, improving compression efficiency, effectively improving server response efficiency, and saving time required by a compression memory.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention.

FIG. 1 is a block diagram of a process of a sewage industry data cleaning method according to the present invention.

Detailed Description

Example 1

Referring to fig. 1, the method for cleaning the data in the sewage industry comprises the following specific steps:

and receiving sewage industry data and carrying out risk investigation.

Specifically, the server receives and receives industry data, then converts non-binary data in the received industry data into binary data, converts each group of industry data into a specified detection interval through a Min-Max normalization method, then is in communication connection with the virus database and the cloud virtual machine, analyzes each group of industry data, carries out data retrieval comparison in the virus database according to an analysis result, intercepts corresponding industry data if data with consistent comparison results exist, uploads the related industry data to the cloud virtual machine for infection simulation if data with consistent comparison results do not exist, and then carries out virus analysis on simulation results according to infection standards established by network virus definition and intercepts the industry data with consistent analysis results.

And (4) constructing an industry database and detecting the data quality.

Specifically, a quality detection model is established, training and optimization are carried out on the quality detection model according to data quality dimensionality, then industry data are sequentially recorded into the quality detection model, the quality test model carries out classification processing on various groups of industry data according to different enterprises, feature dimension reduction processing is carried out on various groups of industry data, feature parameters capable of representing the quality of the industry data are screened out, the feature parameters poor in characterization capability are screened out, the industry data are divided into a training set and a testing set, the training set is subjected to standardization processing to generate training samples, the training samples are conveyed into the quality detection model, model optimal parameters are set according to optimization results, the quality detection model is trained by a long-term iteration method, the testing set is input into the trained model, data accuracy, generality, completeness and consistency curves are drawn out, analysis is carried out, and meanwhile, the industry data with data loss, similar repetition, abnormity, logic errors and inconsistency are marked and recorded.

The method includes the steps that a server receives a test data set and data quality dimensions uploaded by workers, a group of test data is selected from N groups of test data sets to serve as verification data, the rest data are fitted into a group of test models, the precision of the test models is verified through the verification data, the detection capacity of the test models is calculated through root mean square errors, the operation is repeated for N times, parameter optimization processing is conducted on generated precision parameters, a parameter range is initialized, learning rates are confirmed according to a system default or manual setting mode, data samples are divided, any subset is selected to serve as the test set and the rest subsets serve as training sets, after training of the test models through the training sets is completed, the test sets are predicted, root mean square errors of test results are counted, the test sets are replaced by another subset, the rest subsets serve as the training sets, errors are counted again until all data are predicted once, corresponding combination parameters when the root mean square errors are smallest are selected to serve as optimal parameters in a data interval, original parameters in the quality test models are replaced by the optimal parameters, all groups of the quality detection models and the detection results and the replacement results of the original data are recorded and the new quality parameters are evaluated in real time.

In this embodiment, the data quality dimension specifically includes data specification, data integrity criteria, data duplication, data accuracy, consistency and synchronization, timeliness and availability, ease of use and maintainability, data coverage, expression quality, data decay, utility, and understandability, relevance, and credibility.

It should be further explained that the specific formula of feature dimension reduction is as follows:

the normalization process is specifically formulated as follows:

Example 2

Referring to fig. 1, the method for cleaning data in sewage industry comprises the following steps:

and constructing a data cleaning framework to clean the industry data.

Specifically, the data cleaning framework intercepts all groups of marked industry data, classifies and processes the industry data with data missing, similar repetition, abnormity, logic errors and inconsistency, ignores the industry data records with data missing, removes the data attributes, uses a system default value, an attribute average value and a similar sample average value to perform relevance estimation on the missing values, fills the estimation results as filling values, selects a key attribute field for the similar and repeated industry data, distributes corresponding weight for each key attribute according to the importance degree of the key attribute in the expression record characteristics, enables the key field to more accurately express the characteristics of the record, selects an attribute field matching degree algorithm to perform secondary detection on the marked similar and repeated industry data, cleans the similar and repeated industry data according to the set cleaning rule, stores the data which cannot be automatically processed into a log table, provides a corresponding cleaning result report, for the abnormal data, performs clustering processing on the similar industry data, treats the values which fall outside the cluster set as an isolated data set, cleans the data set, stores the data in the logic error processing log table, and defines the residual data in a logic error processing database, and processes the data box according to the logic error of the logic data, if the residual data is not defined in the logic error processing log box, and the data, the data box, and processes the similar industry data which are not defined by the isolated and the logic error; for inconsistent industry data, the data cleansing framework cleanses the industry data by transforming, formatting or summarizing the industry data.

And detecting the operating efficiency of the server in real time and optimizing the performance.

Specifically, a server internal performance optimization framework generates a start linked list for each port of a server, the head of each set of start linked lists is further linked according to the LRU sequence of the ports, port information with the minimum interaction frequency is collected, the start linked lists of the ports are arranged at the head of the LRU linked list and are sequentially ordered, access bits of all updated page table items are cleared before the ports are started, the performance optimization framework rechecks the access bits of all pages before the port starting time is over, data updating is carried out on each set of pages in the start linked list after the check is completed, the least active port is sequentially selected from the head of the LRU linked list, a victim page is selected from the corresponding start linked list until enough pages are obtained, the selected victim pages are combined into a block and marked, a compression driving program is awakened to analyze the marked block and obtain physical pages belonging to the block, then the physical pages are copied into a buffer area, then a compression algorithm is called to compress the physical pages in the buffer area into a compression block, and the compression block is stored in the performance optimization framework area.

Claims

1. The method for cleaning the data in the sewage industry is characterized by comprising the following specific steps:

(1) Receiving sewage industry data and carrying out risk investigation;

(3) Constructing a data cleaning framework to clean the industry data;

(4) And detecting the operating efficiency of the server in real time and optimizing the performance.

2. The sewage industry data cleaning method according to claim 1, wherein the risk investigation in step (1) comprises the following specific steps:

step three: and if no data with consistent comparison results exist, uploading the related industry data to a cloud virtual machine for infection simulation, then carrying out virus analysis on the simulation results by the server according to infection standards established by network virus definitions, and intercepting the industry data with consistent analysis results.

3. The sewage industry data cleaning method according to claim 1, wherein the data quality detection in step (2) specifically comprises the following steps:

and step II: then, carrying out feature dimension reduction processing on each group of industrial data, screening out feature parameters capable of expressing the quality of the industrial data, screening out feature parameters with poor characterization capability, dividing the industrial data into a training set and a testing set, and carrying out standardized processing on the training set to generate a training sample;

step III: and conveying the training samples to a quality detection model, setting optimal parameters of the model according to an optimization result, training the quality detection model by adopting a long-term iteration method, inputting a test set into the trained model, drawing data accuracy, universality, completeness and consistency curves, analyzing the data, and marking and recording industrial data with data loss, similar repetition, abnormality, logic error and inconsistency.

4. The sewage industry data cleaning method according to claim 3, wherein the data quality dimension in step I specifically includes data specification, data integrity criteria, data duplication, data accuracy, consistency and synchronicity, timeliness and availability, ease and maintainability, data coverage, expression quality, data decay, utility and intelligibility, correlation and credibility;

in the step II, the specific formula of the characteristic dimension reduction is as follows:

wherein σ represents a standard deviation of the characteristic parameter; μ represents a mean value of the characteristic parameter; CV represents variance coefficient of characteristic parameter, if the variance coefficient is larger, it represents more important, otherwise, it represents unimportant, and it is eliminated;

5. The sewage industry data cleaning method according to claim 3, wherein the quality detection model training optimization in step I specifically comprises the following steps:

s1.4: and recording all groups of data and detection results detected by the quality detection model, replacing original data in the test data set for subsequent parameter updating, simultaneously evaluating the accuracy, the detectable rate and the false alarm rate of the real-time quality detection model, and feeding the evaluation results back to workers for checking.

6. The sewage industry data cleaning method according to claim 3, wherein the data cleaning framework in step (3) comprises the following specific data cleaning steps:

s2.1: the data cleaning framework intercepts each marked group of industry data and classifies and processes the industry data with data loss, similar repetition, abnormity, logic errors and inconsistency;

s2.2: neglecting the industrial data records with data missing, removing the data attribute, then using a system default value, an attribute average value and a similar sample average value to carry out relevance estimation on the missing value, and filling the estimation result as a filling value;

s2.3: for similar and repeated industry data, the data cleaning framework selects key attribute fields, assigns corresponding weights to each key attribute according to the importance degree of the key attributes in the expression record characteristics, enables the key fields to more accurately express the characteristics of records, selects an attribute field matching degree algorithm to perform secondary detection on the marked similar and repeated industry data, cleans the similar and repeated industry data according to a set cleaning rule, stores the data which cannot be automatically processed into a log table, and provides a corresponding cleaning result report;

s2.4: for abnormal data, the data cleaning framework carries out clustering processing on similar industry data, takes values falling out of a clustering set as isolated points, removes the data falling in the isolated points, carries out binning processing on the residual abnormal data and carries out smoothing processing according to a binning average value;

7. The sewage industry data cleaning method according to claim 1, wherein the server performance optimization in step (4) specifically comprises the following steps: