CN113934720A - Data cleaning method and equipment and computer storage medium - Google Patents

Data cleaning method and equipment and computer storage medium Download PDF

Info

Publication number
CN113934720A
CN113934720A CN202111212807.8A CN202111212807A CN113934720A CN 113934720 A CN113934720 A CN 113934720A CN 202111212807 A CN202111212807 A CN 202111212807A CN 113934720 A CN113934720 A CN 113934720A
Authority
CN
China
Prior art keywords
data
abnormal
cleaning
sequence
sensing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111212807.8A
Other languages
Chinese (zh)
Inventor
阮安邦
李飞
张晓东
魏明
陈旭明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Octa Innovations Information Technology Co Ltd
Original Assignee
Beijing Octa Innovations Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Octa Innovations Information Technology Co Ltd filed Critical Beijing Octa Innovations Information Technology Co Ltd
Priority to CN202111212807.8A priority Critical patent/CN113934720A/en
Publication of CN113934720A publication Critical patent/CN113934720A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1446Point-in-time backing up or restoration of persistent data
    • G06F11/1448Management of the data involved in backup or backup restore
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • H04L63/1425Traffic logging, e.g. anomaly detection

Abstract

The invention relates to a data cleaning method, equipment and a computer storage medium, which at least comprise the following steps: acquiring sensing data to be cleaned; filtering first abnormal data in a data sequence of the sensing data through a preset data threshold; carrying out secondary verification on the first abnormal data, and screening out second abnormal data generated by transmission network fluctuation; the second abnormal data can be supplemented with the processed data sequence, so that the cleaned sensing data is reconstructed and at least part of the missing data sequence is recovered. This application will be distinguished by the anomalous data that the outside invasion or the trouble caused by the unstability of energy supply parameter/network transmission parameter causes to obtain enough accurate and more comprehensive sensing data, can also reduce the data disappearance that the data washing caused when reducing the unusual invasion effectively and destroying communication security's risk and guarantee database storage data's safety.

Description

Data cleaning method and equipment and computer storage medium
Technical Field
The present invention relates to the field of computer technologies, and in particular, to a data cleaning method and apparatus, and a computer storage medium.
Background
Data cleansing is the process of filtering and screening the collected data and rechecking and verifying the data. The method aims to delete repeated data and correct error data in the data collected by a plurality of or single collection terminals, and can also carry out consistency check on the collected data so as to process invalid values and missing values in a data set. The objects of filtering and screening may include data with abnormal feature values, data with excessive missing feature values, data unrelated to application scenarios, and so on. Algorithms, such as gaussian algorithms, etc., can be used for detecting abnormal features of data. Aiming at the related problems of concerned communication safety at present, the data cleaning can also carry out screening and isolation processing on the abnormal data uploaded or captured by the acquisition terminal, so that the storage safety of the existing database is protected while the uploading of abnormal data such as bad invasion is effectively prevented.
In modern offices and production sites, functional devices in buildings are increasingly arranged, including requirements for space environment, requirements for production equipment and the like. Therefore, in order to meet these requirements, various mechanical and electronic devices are installed inside buildings, and it is also necessary that these devices have automatic control capability so that they can maintain an optimal operation state to improve work efficiency and service quality. The embedded operating system related in the building can acquire, monitor and control the running condition related data of the electromechanical equipment in the whole building, and the operating system can perform information processing, data calculation, data analysis, logic judgment, image recognition and the like on the acquired original data information, so that the efficient and safe running and management of various electromechanical equipment are guaranteed.
Chinese patent CN112084178A discloses a data cleaning method, system, data cleaning device and readable storage medium, wherein the data cleaning method comprises: embedding a data cleaning device into industrial equipment, wherein cleaning strategies for cleaning different types of equipment data are integrated in the data cleaning device; acquiring equipment data to be cleaned, which is generated by industrial equipment, and transmitting the equipment data to a data cleaning device for data cleaning; the data cleaning device carries out data cleaning on different types of equipment data according to a preset cleaning strategy; and exporting and storing the cleaned equipment data. According to the data cleaning method and device, data cleaning strategies for different types of equipment data are integrated in the data cleaning device, the data cleaning device is stored in the form of code blocks or configuration files, and a user only needs to configure the data cleaning device into industrial equipment, so that the equipment data can be automatically cleaned by the equipment, the difficulty of data cleaning is simplified, and the efficiency of data cleaning is improved. However, the patent cannot screen data abnormality caused by transient abnormality of a transmission network and the like of industrial equipment, cannot provide effective and accurate data cleaning, and may delete data containing no threat by mistake.
The current data cleaning scheme usually filters all abnormal data information exceeding the data threshold value set by the standardization in one time, although the risk of communication security damage caused by abnormal intrusion can be effectively reduced, the security of data stored in the database is effectively ensured, but often, abnormal data caused by unstable working states of some acquisition equipment and transmission equipment is filtered, and although the abnormal data may exceed the existing data threshold, the relevant data information actually recorded is valuable in itself, and not at risk of infringement, therefore, in the actual data cleaning process, it is necessary to accurately study and judge such abnormal data and effectively distinguish the abnormal data from the abnormal data caused by external intrusion or failure, so that sufficiently accurate and more comprehensive raw data information can be obtained for the relevant data system.
Furthermore, on the one hand, due to the differences in understanding to the person skilled in the art; on the other hand, since the inventor has studied a lot of documents and patents when making the present invention, but the space is not limited to the details and contents listed in the above, however, the present invention is by no means free of the features of the prior art, but the present invention has been provided with all the features of the prior art, and the applicant reserves the right to increase the related prior art in the background.
Disclosure of Invention
Aiming at the defects of the prior art, the technical scheme of the invention provides a data cleaning method, which at least comprises the following steps:
s1: acquiring sensing data to be cleaned;
s2: filtering first abnormal data in a data sequence of the sensing data through a preset data threshold;
s3: performing secondary verification on the first abnormal data, and screening out second abnormal data generated by fluctuation of construction parameters of the data transmission channel;
s4: performing connectivity splicing on the data sequence supplementing the second abnormal data; wherein the content of the first and second substances,
the second abnormal data can be supplemented with the processed data sequence, so that the cleaned sensing data is reconstructed and at least part of the missing data sequence is recovered. The method has the advantages that abnormal data caused by instability of energy supply parameters/network transmission parameters can be screened out, the part of abnormal data exceeds the existing data threshold value, but relevant data information actually recorded by the part of abnormal data is valuable and has no invasion risk, so that the abnormal data is effectively distinguished from abnormal data caused by external invasion or faults in the data cleaning process, and the accurate and comprehensive original data information can be acquired for a relevant data system.
According to a preferred embodiment, the first abnormal data is obtained by performing comparison and verification on whether the sensed data of the same time exists in other data acquisition units on the same acquisition network or acquisition branch to which the data acquisition unit acquiring the sensed data belongs. The method has the advantages that the non-threat abnormal data is extracted and re-supplemented into the data sequence, so that the integrity and the accuracy of the sensing data are effectively improved, and the information interaction between the equipment and the database is facilitated.
According to a preferred embodiment, the performing of the secondary verification on the first abnormal data further includes determining whether the sensed data acquired by other data acquisition units in the same communication transmission network of the data acquisition unit corresponding to the first abnormal data at the same time is abnormal, and using the verification result as the screening condition of the second abnormal data.
According to a preferred embodiment, the data acquisition unit can upload the acquired sensing data to a sampling database of the data cleaning unit through a network transmission channel, so that the data cleaning unit 2 obtains a data sequence corresponding to the sensing data; the data cleaning unit selectively formulates different cleaning strategies according to the service scene and the analysis rule, and completes reconstruction of the sensing data and recovery of at least part of missing data through the selected cleaning strategies.
According to a preferred embodiment, the step S2 of filtering the abnormal data in the data sequence based on the time domain characteristics of the sensing data at least includes:
s201: setting a data threshold value of a critical point of the variation degree of the segmentation data according to the variation of the periodically acquired sensing data on a time axis;
s202: screening out a data sequence with abnormal data according to a data threshold value, and dividing the data sequence acquired in a single period into a plurality of data segments;
s203: and screening abnormal data in at least one data segment according to the data threshold value.
According to a preferred embodiment, the data sequence acquired in a single period is divided into a plurality of data segments by a preset unit time length, wherein different data segments on the same data sequence do not overlap with each other.
According to a preferred embodiment, the rules of the data cleansing policy are defined according to rule results of data analysis, and the data cleansing unit performs a data cleansing task on the data objects after capturing the abnormal data by performing predefined analysis viewing rules on the data objects and reporting or alarming identification abnormal data.
The application also provides a data cleaning device, which at least comprises a data acquisition unit and a data cleaning unit, wherein the data acquisition unit acquires sensing data to be cleaned and uploads the sensing data to the data cleaning unit for cleaning; the data cleaning unit filters first abnormal data in a data sequence of the sensing data through a preset data threshold, and can perform secondary verification on the first abnormal data, screen out second abnormal data generated by fluctuation of construction parameters of a data transmission channel, and perform connectivity splicing on the data sequence supplementing the second abnormal data. The method has the advantages that abnormal data caused by instability of energy supply parameters/network transmission parameters can be screened out, the part of abnormal data exceeds the existing data threshold value, but relevant data information actually recorded by the part of abnormal data is valuable and has no invasion risk, so that the abnormal data is effectively distinguished from abnormal data caused by external invasion or faults in the data cleaning process, and the accurate and comprehensive original data information can be acquired for a relevant data system.
According to a preferred embodiment, the data cleaning unit can filter out third abnormal data after the second abnormal data according to the first abnormal data and send information of a data acquisition unit acquiring the third abnormal data to the early warning module.
The application also provides a computer storage medium, and the computer storage medium is used for storing program data and archiving of various sensor detection data uploaded by the data acquisition unit and processed by the data cleaning unit.
Drawings
FIG. 1 is a schematic workflow diagram of a preferred embodiment of a data cleansing method, apparatus and computer storage medium of the present invention;
FIG. 2 is a schematic diagram of a data cleansing flow of a preferred embodiment of a data cleansing method, apparatus and computer storage medium according to the present invention;
FIG. 3 is a schematic diagram of a data cleaning method, a data cleaning apparatus, and a data cleaning apparatus for computer storage media according to the present invention.
List of reference numerals
1: a data acquisition unit; 2: a data cleaning unit; 3: an early warning module; 4: data cleaning equipment; 41: a processor; 42: a memory; 43: an input-output device; 44: a bus.
Detailed Description
The following detailed description is made with reference to the accompanying drawings.
A data cleaning device for embedded operation for safety communication comprises a data acquisition unit 1 and a data cleaning unit 2. The data acquisition unit 1 may be various data information acquisition terminal devices which are suitable for different application scenarios and are connected to a communication network. The data information collected by the data collecting unit 1 can be transmitted to the data cleaning unit 2 in a communication mode for processing. The data information acquisition terminal device as the data acquisition unit 1 can acquire certain sensing data associated therewith, thereby obtaining sampling data of the sensing data. The data acquisition unit 1 can upload the acquired sensing data to a sampling database of the data cleaning unit 2 through a network transmission channel, so that the data cleaning unit 2 obtains a data sequence corresponding to the sensing data. The data cleaning unit 2 can clean the data sequence stored in the sampling database, so as to obtain reliable and accurate data and provide basic data for subsequent application or analysis of related systems.
As shown in fig. 1, a plurality of different and same-region data acquisition units 1 can be stored in a classified storage manner at different positions of the same sampling database of the data cleaning unit 2. Preferably, the data acquisition unit 1 and the data cleaning unit can be connected through a wireless or wired data network. Preferably, the sensing data collected by the data collecting unit 1 can be any sensing data with monitoring requirements, such as temperature, monitoring video, operation parameters, and the like. Preferably, the data acquisition unit 1 may be a sensor or an acquisition module capable of acquiring any single sensing data referred to above. For example, if the sensing data is temperature, the data acquisition unit may be a wearable device or an embedded device of a temperature sensor; the sensing data is an image in a certain space, and the data acquisition unit can be a camera shooting unit capable of continuously shooting videos in a certain area. Preferably, the sensing data collected by the plurality of data collecting units 1 arranged in a certain space range for collecting various sensing data of the same object can be connected through the same transmission network, and the sensing data can be uploaded and classified and stored through the same data communication channel or different data communication channels under the same network. Preferably, the data cleaning unit 2 can be any device with data storage and data processing functions, such as a server, a computer, etc. connected to the data network and capable of receiving the data collected by the data collecting unit 1.
Example 1
The data cleaning is a main mode for improving the data quality of the system, the system provides a general data cleaning mode, and the cleaning method mainly comprises the cleaning methods of repeated data removal, missing value filling, date standardization, dictionary standardization, data desensitization and the like. The user can select a corresponding mode to clean or customize a data cleaning mode by the user according to the analysis and detection data result. Data cleansing is generally performed by a user construction job, and cleansing job must include data source input, data output source, and cleansing conversion rules. A general data cleaning method is designed and realized aiming at the problem of common quality information of data cleaning. The data cleaning firstly needs to analyze the reasons for generating abnormal data, a cleaning strategy is formulated by combining a service scene and an analysis rule, and then a cleaning method is executed to improve the data quality. The user can directly clean the data based on the original data, and can also remove abnormal data by adopting a corresponding cleaning strategy aiming at the data quality problem by referring to the data analysis report, thereby improving the data quality and obtaining the data meeting the service requirement. Preferably, a cleaning method may be integrated into a cleaning assembly. The data cleaning is displayed in an operation mode, and the cleaning operation comprises an input assembly, a cleaning assembly and an output assembly. The input assembly is mainly used for configuring information of a cleaning object, various cleaning methods are packaged in the cleaning assembly, and different cleaning assemblies are selected for cleaning aiming at different data quality problems. The output component provides a built data model for write-in of the cleaned data. When the data cleaning is implemented, a user can select a corresponding cleaning mode or a user-defined cleaning mode according to the analysis and detection data result.
As shown in fig. 2, the data cleansing rule may be defined according to a rule result of data analysis, and an appropriate cleansing method is selected according to the rule result of the analysis for data cleansing. And performing a data cleaning task on the abnormal data after capturing the abnormal data by executing a predefined analysis viewing rule on the data object and reporting or alarming the abnormal data. One cleaning task, namely cleaning operation, comprises data input, cleaning conversion rules and data output. And after the cleaning object is determined, constructing data cleaning operation, firstly configuring data input component information, then selecting a cleaning component, defining a cleaning conversion rule, finally creating a data output model, and selecting an output component to configure output information. The data output model table is mainly used for writing data after data cleaning, and mainly prevents source data from being covered and whether the output object can be detected later to reach the cleaning target or not. And after the cleaning operation is constructed, executing a predefined cleaning conversion rule, correcting the detected abnormal data and improving the data quality. And periodically detecting whether the data output object reaches the cleaning target or not, and returning the clean data to the target data source after the cleaning target is reached.
The cleaning method adopts modular development, the system can expand cleaning components, and the cleaning method mainly integrates the universality of the cleaning method including data deduplication, null filling, data desensitization, dictionary standardization and date standardization by developing new components to be integrated into the system.
(1) Data deduplication is performed, information often exists in a plurality of records and represents the same object, repeated data exists after the system is accessed, and sometimes records represent the same object even though individual fields in the records are different. Data deduplication mainly detects similar duplicate data and removes duplicate data. The system detects repeated data and removes the duplicate based on the distributed data set, and aims to achieve a more accurate duplicate removal effect by using the resources as less as possible. The data deduplication component is mainly realized by using an RDD operator of Spark and realizes partition sequencing deduplication based on a combinbyKey deduplication operator provided by Spark. Reading a data set, traversing all elements, grouping combinbyKey () according to Key values of the elements, accumulating the same records of the same partition in the traversing process by using merge value (), forming a plurality of groups of same record sets under each Key (Col _ id) in a map, merging and sorting by using a shuffle, and merging and accumulating different partition values. And finally, selecting one piece of data in the repeated data to realize de-duplication and writing the de-duplication data into a storage when traversing the merged result set.
(2) And (4) filling null values, wherein the system processes the data missing problem through the null value filling aiming at the situations of null values, null strings or null records and the like caused by collection and processing errors or machine damage. The null value filling is to adopt a certain method to determine a reasonable estimation value for the missing value in the data record and then fill the missing value. For the case of a null value or a null string in a data record, the system provides a variety of processing strategies. The user can select the same-attribute constant value to fill, can select the mean value, the mode and the median as substitute values of missing values, and can select the same-attribute random column value to fill.
(3) Data desensitization, aiming at a data sharing or exchanging scene, a system is designed and a mechanism for hiding and protecting sensitive data information is realized. The system mainly processes data of numerical value types and character types, and the data desensitization algorithm realizes the erasure of sensitive data by using a replacement method, so that the sensitive data are hidden. The system mainly realizes desensitization processing on data of numerical value types and character types.
When data desensitization operation is carried out, a user needs to configure a data object needing desensitization, a desensitization strategy and a desensitization range and replacement value in a cross section. When the system processes the numerical type, the numerical type is firstly converted into the corresponding character type and then processed according to the character type in a unified way. Firstly, desensitization component information is initialized, relevant parameters and exception parameters are processed, and then replacement processing is carried out according to different desensitization strategies. The regular expression desensitization is mainly to perform expression matching on desensitization range data through regular expression, replace matched parts with specified character strings, and if the matching fails, the result is unchanged. The default desensitization processing of the system is a Hash method, and needed desensitization data of the desensitization processing is replaced by a corresponding Hash value. The user may also specify a constant value for replacement.
(4) And standardization, namely date standardization and dictionary standardization. Date standardization is formatted as its designated standard date format as the name implies. The logic for implementing date standardization by the system is simple, and the date value of the data is standardized according to a format specified by a user. The standardization of the dictionary is mainly realized by defining a standardized dictionary to perform mapping replacement on specified data columns in a data table, and the standardized dictionary table is required to be established firstly for carrying out standardized cleaning of the dictionary. The main purpose of dictionary cleaning is to standardize and clean the numerical values which are not in accordance with the standard according to the dictionary values. Before dictionary cleaning, a dictionary table is required to be established, and the dictionary table is unified, defined and described by terms of related business industries of the system and the same kind of public data information with constant values or recognized data information in the system. And after a dictionary table is established, associating the data objects, configuring mapping between the data elements and dictionary values, and finally carrying out standardized cleaning according to dictionary mapping rules.
When the dictionary table information is established, the dictionary table needs to contain dictionary codes, original dictionary value information, dictionary value information and corresponding mapping. It should be noted that, if a dictionary corresponding to the dictionary list needs to be established when the associated dictionary table is selected, it is necessary to check whether the attribute column is associated with the dictionary table rule first when the cleaning task is performed. And selecting a dictionary rule data set to be configured when associating rules, selecting a table object under the data set, and selecting a table field needing to configure the dictionary rules.
For the created dictionary cleaning operation, whether a dictionary table exists or not needs to be checked firstly when the dictionary is cleaned, if the associated dictionary table does not exist, a dictionary table corresponding to the column object needs to be established firstly, and a dictionary rule is configured. If the dictionary table exists, firstly, the dictionary table information is inquired according to the attribute column of the data table object, the cleaning data set is compared with the dictionary table data rule, and then the standardized cleaning is carried out according to the dictionary value. And if the source data set data meets the dictionary rule, replacing the source data set data by the dictionary value, and if the source data set data does not meet the dictionary rule, keeping the original data. And writing the data into the target data source after the cleaning is finished.
Example 2
As shown in fig. 2, the data cleansing unit 2 of the present application further relates to a data cleansing method, which includes:
s1: acquiring a data sequence of sensing data or sampling information of an acquisition terminal;
s2: filtering and screening out first abnormal data in the data sequence based on a data threshold;
s3: performing secondary verification on the first abnormal data, judging whether the data is abnormal due to energy supply fluctuation or transmission obstacle, and generating second abnormal data by generating fluctuation (transmission network fluctuation) of construction parameters of a data transmission channel for transmitting the acquired data so as to filter the first abnormal data and generate third abnormal data;
s4: and combining the second abnormal data with the data sequence of the screened first abnormal data, and splicing the data sequence without the third abnormal data, thereby obtaining the continuous data segment after uniform processing.
Preferably, the inaccuracy of the data acquisition may be that when the data acquisition operation is performed, the parameters supporting the acquisition operation fluctuate to cause instability of the acquisition operation, so that in the acquired valid data, a part of the data sequence is mixed with an invalid data sequence segment which is not credible but does not influence the validity of the data. The fluctuation (transmission) of the construction parameters of the data transmission channel is the instability of a wide area network or a line in a communication network capable of completing data transportation, and a phenomenon of on-off or instantaneous interruption continuously occurs, so that a routing protocol is frequently calculated, and a part of unnecessary (abnormal) data sequence fragments are doped in a data sequence due to the instability of transmission of transmitted sensing data, so that the data sequence is changed into a data sequence in a loss-of-trust state after transmission and cannot be screened out through conventional data, but an original credible data sequence in the part of sensing data is still not damaged, the availability and the credibility of the data are still changed, and the data can be uploaded and stored as credible data substantially.
Preferably, when the sampled data is physical data such as temperature and the like which continuously changes, and the actual temperature of a certain human body or equipment changes, a gradual change process exists, and the speed of the change degree is related to the actual target and the target environment. For example, the cooling water temperature of an electromechanical device is raised at a rate that is uniform over time intervals, and there is no jump in temperature data. If one jump temperature data appears, the abnormal condition in the acquisition process is considered, and the jump of the temperature of the acquisition object does not occur. Preferably, this occurs because the sampled data itself is not reasonable and the filtering is not identifiable using existing noise filtering methods. Preferably, the data in the data sequence after filtering the abnormal data in the uploaded data sequence is the reliable and accurate data which accords with the time domain characteristics by the data cleaning unit 2 based on the time domain characteristics of the sensing data.
Preferably, after filtering out the third abnormal data having problems or threats according to the step S2-3, the remaining data sequences may have deletions on the time axis, resulting in the data sequences combined for a certain period of time being discontinuous and/or uneven in time. In order to be able to conveniently select all data sequences with a certain length of time period for use, the data cleaning unit 2 uniformly processes a plurality of data segments in the data sequences with the third abnormal data screened out in terms of time, so as to provide reliable, continuous and uniformly continuous data segments in terms of time for subsequent use. Preferably, the continuous data segment refers to a data segment in the data sequence without the third abnormal data, and the time interval corresponding to all adjacent data is smaller than a preset time interval threshold. Preferably, the sampling period and frequency of the sensing data are adaptively adjusted according to different selectivity of the actual application scenario and the sensing data. The embodiment can acquire the data sequence of time domain sampling based on different acquisition environments and by combining the physical and data characteristics of the sensing data, can remove the third abnormal data from the time domain characteristics of the sensing data, and uniformly process the continuous data segment in the data sequence after the third abnormal data is removed in time, so that the cleaning of the data sequence of time domain sampling is realized, reliable and accurate sampling data are finally obtained, and the accuracy of correlation analysis based on the sampling data is further improved.
Preferably, the step S2 of performing the filtering operation on the first abnormal data in the data sequence based on the time domain characteristics of the sensing data may include the following steps:
s201: setting a data threshold value of a critical point of the variation degree of the segmentation data according to the variation of the periodically acquired sensing data on a time axis;
s202: screening out a first data sequence with abnormal data according to a data threshold value, and dividing the data sequence acquired in a single period into a plurality of data segments;
s203: and screening out first abnormal data in at least one data segment according to the data threshold value.
Preferably, the time threshold reflects a temporal characteristic of the sensed data, i.e. a characteristic of the sensed data over time. For example, taking the continuously collected human body temperature as an example, the change of the human body temperature within 5 minutes does not exceed 1 degree, and in the case of setting the collection period to 5 minutes, the data threshold value for representing the data change amount may be set to 1 degree. And when the variation difference of the temperatures of a plurality of time points collected in a certain period is greater than 1 degree, determining that the body temperature data is first abnormal data.
Preferably, in step S202, based on the data threshold set in step 2021, the data sequence acquired in a single period is divided into a plurality of data segments, and the unit time length of each data segment is the acquisition period 1/10. Preferably, when the data sequence is divided into a plurality of data segments, at least one data segment has no overlap.
Preferably, in step S203, the first abnormal data in the plurality of data segments is screened out by using a time threshold value pair. Preferably, for a first data segment of the data segments, the following method can be adopted to remove the first abnormal data therein: removing data beyond the data range in the first data segment; and/or removing data with a fluctuation rate greater than a fluctuation threshold value in the first data segment.
Preferably, the step of removing the number out of the data range in the first data segment may be: calculating the mean and variance of the data in the first data segment, and respectively recording the mean and variance as mu and sigma; setting an upper boundary and a lower boundary of a data range according to the mean value and the variance, and respectively marking as mu + rho sigma and mu-rho sigma; data in the first data segment that is larger than the upper boundary μ + ρ σ and data that is smaller than the lower boundary μ - ρ σ are removed, i.e., only data in the first data segment that is between the upper boundary μ + ρ σ and the lower boundary μ - ρ σ is retained. Where ρ is a coefficient, which may depend on the application scenario and the sensed data.
Preferably, the step of removing the data with the fluctuation rate greater than the fluctuation threshold value in the first data segment may be: calculating a differential of the data in the first data segment; data in the first data segment having an absolute value of the differential greater than the differential threshold is removed. In this alternative embodiment, the volatility of the data is represented by a differential and the volatility threshold is represented by a differential threshold. The absolute value of the differential of all data in the data sequence can be compared to a differential threshold, typically above which the differential will appear fragmented. The data with the absolute value of the differential appearing in the slices larger than the differential threshold belongs to the data with abnormal change, such as the data with the initial acquisition stage or the data with the end acquisition stage or the data with the loss of the acquisition object caused by some reason, and the data generally belongs to the first abnormal data.
Example 3
This embodiment is a further improvement of embodiment 1, and repeated contents are not described again.
The data washing unit 2 can acquire a data sequence of specific sensing data acquired by a certain acquisition sensor and related to time domain sampling. That is, the data cleansing unit 2 may acquire a data sequence formed by the data acquisition unit 1 sampling the sensing data in the time domain. Preferably, the data acquisition unit 1 adds time stamps to the acquired data in a time sequence order during the process of time-domain sampling the sensing data, so that the sampled data acquired by the data cleaning unit 2 are all time stamps associated with the time points of the acquired data. Preferably, the data collected by the data collection unit 1 may be also sampled data without a time stamp, which is obtained by starting at a specified time point and performing equidistant sampling of the sensing data at set time intervals. Before data cleaning, the data cleaning unit 2 can add a timestamp to the sequentially uploaded sampling data according to the initial time and the interval time of data acquisition, so as to obtain a time-domain sampled data sequence of the sensing data.
Preferably, the acquired sensing data is used as a monitoring state display, so that the sensing data has physical and mathematical characteristics, namely the sensing data has certain time domain characteristics, the data is gentle or regularly repeated when changing in a continuous time period, and sudden changes with overlarge change quantity do not exist. For example, a system in one area receives operating parameters of a plurality of electromechanical devices at the same time, and under non-external force control, the operating state and the output power parameters of the system generally belong to a stable state, so that the change of the output power parameters cannot be obvious.
Preferably, a plurality of processing background devices capable of acquiring terminal device data and periodically uploading the acquired device information data to the data network exist in the same data network, each terminal device has a specific IP address, so that the data information uploaded by each data channel connected to each IP each time has similar data content and data format, wherein the change of the data information uploaded in a plurality of continuous periods is also slowly changed, and at least one data segment of the data sequence can always correspond to the data sequence in other acquisition periods, so that the data cleaning unit 2 can determine that the data uploaded each time belongs to the working condition identifier of the terminal device in the normal operation state. When a specific data segment of a certain collected data sequence cannot be matched with a historical data sequence, and when the other data segments of the data sequence are obviously different from the corresponding data segment of the previously uploaded data sequence, the abnormal condition of the data sequence in the collection process is judged or the data sequence is determined to have intrusion threat and is screened out.
According to the content related to step S3 in embodiment 1, it is found that at the time point corresponding to the partially abnormal data sequence in the actual data sorting and manual evaluation process, the terminal device associated with the sensing device is in a normal operation condition and has no problem of external intrusion, and the reason of the data abnormality is that the acquisition is unstable or the transmission is unstable. And carrying out secondary filtering on the data sequence which cannot be matched with the historical data or the data rule for the data segment. Preferably, the data cleaning unit 2 is capable of performing a secondary screening on the first abnormal data filtered by the primary screening with respect to the evaluation of the first abnormal data. Specifically, when recognizing and detecting that one piece of abnormal data is uploaded at one IP address, the data cleaning unit 2 recognizes and detects data uploaded at an IP address that is close to the IP address and belongs to the same network area. The data cleaning unit 2 judges whether the same or similar data abnormal condition exists with the IP, if the same data sequence abnormality occurs in the terminal equipment in the same network environment of the whole area, the uploaded data abnormality may be caused by network abnormality in the area, so it is judged that the data uploaded in the same area environment at the time point does not have intrusion threat, and the part of data (second abnormal data) can be uploaded to the database as normal data for storage. On the contrary, if the data collected by other terminal devices in the same area of the terminal device where the first abnormal data is located are normal data and the condition that the data cannot be matched or identified does not exist, it is determined that the terminal device of the IP address has a fault or has external intrusion to cause data abnormality, and the data (third abnormal data) uploaded by the terminal device threatens the database or causes a risk of leakage of the stored data. As shown in fig. 1, the data cleaning unit 2 can send the IP address or the terminal device information corresponding to the third abnormal data to the early warning module 3, so that the user can conveniently trace the source of the abnormal condition in time.
According to the method and the device, the first abnormal data screened conventionally are screened for the second time, so that the abnormality caused by interference factors occurring in the acquisition process and the transmission process can be effectively reserved, the information stored by the data sequences can be conveniently uploaded to the corresponding database for storage, or the data requests contained by the corresponding data sequences can be authenticated to perform normal data access of the designated database. Preferably, for uncertainty and instability of existing internet communication, the requirement of network interruption and continuous transmission needs to be considered in the process of transmitting the sensing data, so that the integrity of the data in a certain monitoring time period can be maintained. In particular, when the compensation transmission of data backlog caused by network-breaking communication fault occurs, a problem of simultaneous transmission of a large amount of concurrent data occurs. The server cannot determine which data is reliable and which data needs to be discarded, so that the cleaned data needs to be sent. The data cleaning unit 2 can find out the second obvious abnormal data caused by the time marks and other abnormalities in the data sequence from the stacked industrial sensing data, preprocess (data cleaning) in a mode of 'rebuilding, recovering or discarding', and upload the data in batches, so that the phenomenon that the stacked data is larger than the data quantity contained in a single uploading period to cause transmission obstruction to a transmission channel to a certain extent and delay of subsequent data uploading can be avoided.
Example 4
In order to implement the data cleaning method of the above embodiment, the present application also provides a data cleaning device.
As shown in fig. 3, which is a schematic structural diagram of the data cleansing apparatus provided in the present application, the data cleansing apparatus 4 provided in the embodiment of the present application includes a processor 41, a memory 42, an input/output device 43, and a bus 44. The processor 41, the memory 42, and the input/output device 43 are preferably connected to the bus 44, respectively, and the memory 42 stores therein program data for performing data cleansing processing. Processor 41 is operative to execute program data to implement a data cleansing method. Preferably, the processor 41 may also be referred to as a CPU (Central Processing Unit). The processor 41 may be an integrated circuit chip having signal processing capabilities. The processor 41 may also be a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components. A general purpose processor may be a microprocessor or the processor 41 may be any conventional processor or the like.
The application also provides a computer storage medium, and the computer storage medium is used for storing program data and archiving of various sensor detection data uploaded by the data acquisition unit 1 and processed by the data cleaning unit. Preferably, the program data is adapted to perform data cleansing when executed by the processor.
Preferably, the online monitoring data standardization is to scale and convert the data to fall into a small specific interval for each type of inconsistent data. The standardization of the monitoring data mainly comprises data chemotaxis processing and dimensionless processing, thereby not only ensuring the operation boundary, but also highlighting the essential meaning of the monitoring data. Because different online monitoring data have different dimensions and dimension units, the standard condition of the data can have certain influence on the analysis result of the online monitoring data. In order to ensure that the influence of the dimension on the monitoring data is reduced, the online monitoring data needs to be standardized. By keeping the data standards on the same dimensional level, the data can be comprehensively analyzed to form a comparative evaluation result. Based on different service scenes, the data standardization processing adopts different processing algorithms and processing modes aiming at different information fields. For index monitoring data acquired by automatic acquisition of monitoring equipment, a standardized algorithm can be used for automatic preprocessing based on the requirement of data analysis modeling.
It should be noted that the above-mentioned embodiments are exemplary, and that those skilled in the art, having benefit of the present disclosure, may devise various arrangements that are within the scope of the present disclosure and that fall within the scope of the invention. It should be understood by those skilled in the art that the present specification and figures are illustrative only and are not limiting upon the claims. The scope of the invention is defined by the claims and their equivalents. Throughout this document, the features referred to as "preferably" are only an optional feature and should not be understood as necessarily requiring that such applicant reserves the right to disclaim or delete the associated preferred feature at any time.

Claims (10)

1. A method for cleaning data, comprising:
s1: acquiring sensing data to be cleaned;
s2: filtering first abnormal data in a data sequence of the sensing data through a preset data threshold;
s3: performing secondary verification on the first abnormal data, and screening out second abnormal data generated by fluctuation of construction parameters of the data transmission channel;
s4: performing connectivity splicing on the data sequence supplementing the second abnormal data; wherein the content of the first and second substances,
the second abnormal data can be supplemented with the processed data sequence, so that the cleaned sensing data is reconstructed and at least part of the missing data sequence is recovered.
2. The data cleaning method according to claim 1, wherein the first abnormal data is obtained by performing a comparison verification on whether there is an abnormal sensing data at the same time in other data acquisition units (1) on the same acquisition network or acquisition branch to which the data acquisition unit (1) acquiring the sensing data belongs.
3. The data cleaning method according to claim 1, wherein the performing of the secondary verification on the first abnormal data further comprises determining whether the sensed data acquired by other data acquisition units (1) in the same communication transmission network of the data acquisition unit (1) corresponding to the first abnormal data at the same time is abnormal, and taking the verification result as the screening condition of the second abnormal data.
4. The data cleansing method according to the preceding claims 2 or 3, characterized in that the data acquisition unit (1) is capable of uploading its acquired sensory data to the sampling database of the data cleansing unit (2) via a network transmission channel, so that the data cleansing unit (2) obtains a data sequence of the sensory data;
the data cleaning unit (2) selectively makes different cleaning strategies according to the service scenes and the analysis rules, and completes reconstruction of the sensing data and recovery of at least part of missing data through the selected cleaning strategies.
5. The data cleansing method of claim 1, wherein the step S2 of performing a filtering operation of abnormal data in the data sequence based on the time domain characteristics of the sensed data at least comprises:
s201: setting a data threshold value of a critical point of the variation degree of the segmentation data according to the variation of the periodically acquired sensing data on a time axis;
s202: screening out a data sequence with abnormal data according to a data threshold value, and dividing the data sequence acquired in a single period into a plurality of data segments;
s203: and screening abnormal data in at least one data segment according to the data threshold value.
6. The data cleansing method of claim 5, wherein the data sequence collected in a single cycle is divided into a plurality of data segments by a predetermined unit time length, wherein different data segments on the same data sequence do not overlap each other.
7. The data cleansing method according to one of the preceding claims, characterized in that the rules of the data cleansing policy are defined according to the rule results of the data analysis, and the data cleansing unit (2) looks at the rules by performing a predefined analysis on the data objects and reports or alerts identifying anomalous data, which is subjected to a data cleansing task after capturing the anomalous data.
8. A data washing device, characterized by comprising at least a data acquisition unit (1) and a data washing unit (2), wherein,
the data acquisition unit (1) acquires sensing data to be cleaned and uploads the sensing data to the data cleaning unit (2) for cleaning;
the data cleaning unit (2) filters first abnormal data in a data sequence of the sensing data through a preset data threshold, the data cleaning unit (2) can also carry out secondary verification on the first abnormal data, screen out second abnormal data generated by fluctuation of construction parameters of a data transmission channel, and carry out connectivity splicing on the data sequence for supplementing the second abnormal data.
9. The data cleaning device according to one of the preceding claims, wherein the data cleaning unit (2) is capable of sending information of the data acquisition unit (1) acquiring third abnormal data to the early warning module (3) according to the third abnormal data after the second abnormal data is filtered out from the first abnormal data.
10. A computer storage medium is characterized in that the computer storage medium is used for storing program data and archives of various sensor detection data uploaded by a data acquisition unit (1) and processed by a data cleaning unit (2).
CN202111212807.8A 2021-10-18 2021-10-18 Data cleaning method and equipment and computer storage medium Pending CN113934720A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111212807.8A CN113934720A (en) 2021-10-18 2021-10-18 Data cleaning method and equipment and computer storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111212807.8A CN113934720A (en) 2021-10-18 2021-10-18 Data cleaning method and equipment and computer storage medium

Publications (1)

Publication Number Publication Date
CN113934720A true CN113934720A (en) 2022-01-14

Family

ID=79280166

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111212807.8A Pending CN113934720A (en) 2021-10-18 2021-10-18 Data cleaning method and equipment and computer storage medium

Country Status (1)

Country Link
CN (1) CN113934720A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114661802A (en) * 2022-01-25 2022-06-24 桂林电子科技大学 System and method for efficiently acquiring and analyzing factory equipment data
CN114679500A (en) * 2022-05-30 2022-06-28 深圳市明珞锋科技有限责任公司 Acceleration type information transmission system for merging repeated information
CN115037643A (en) * 2022-03-25 2022-09-09 武汉烽火技术服务有限公司 Method and device for acquiring and labeling network health state data
CN115580545A (en) * 2022-12-09 2023-01-06 中用科技有限公司 Internet of things communication method for improving data transmission efficiency

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114661802A (en) * 2022-01-25 2022-06-24 桂林电子科技大学 System and method for efficiently acquiring and analyzing factory equipment data
CN114661802B (en) * 2022-01-25 2024-04-05 桂林电子科技大学 Efficient collection and analysis system and method for factory equipment data
CN115037643A (en) * 2022-03-25 2022-09-09 武汉烽火技术服务有限公司 Method and device for acquiring and labeling network health state data
CN115037643B (en) * 2022-03-25 2023-05-30 武汉烽火技术服务有限公司 Method and device for collecting and labeling network health state data
CN114679500A (en) * 2022-05-30 2022-06-28 深圳市明珞锋科技有限责任公司 Acceleration type information transmission system for merging repeated information
CN115580545A (en) * 2022-12-09 2023-01-06 中用科技有限公司 Internet of things communication method for improving data transmission efficiency
CN115580545B (en) * 2022-12-09 2023-04-07 中用科技有限公司 Internet of things communication method for improving data transmission efficiency

Similar Documents

Publication Publication Date Title
CN113934720A (en) Data cleaning method and equipment and computer storage medium
CA2689219C (en) Method and system for state encoding
CN116209963A (en) Fault diagnosis and solution recommendation method, device, system and storage medium
CN111027615B (en) Middleware fault early warning method and system based on machine learning
CN109918196B (en) System resource allocation method, device, computer equipment and storage medium
CN113157524B (en) Big data based exception problem solving method, system, equipment and storage medium
CN112751711B (en) Alarm information processing method and device, storage medium and electronic equipment
CN112416872A (en) Cloud platform log management system based on big data
KR101960755B1 (en) Method and apparatus of generating unacquired power data
CN115935286A (en) Abnormal point detection method, device and terminal for railway bearing state monitoring data
CN111371647A (en) Data center monitoring data preprocessing method and device
CN114968959A (en) Log processing method, log processing device and storage medium
CN113938306B (en) Trusted authentication method and system based on data cleaning rule
CN111555895B (en) Method, device, storage medium and computer equipment for analyzing website faults
CN116986246A (en) Intelligent inspection system and method for coal conveying belt
CN112800061A (en) Data storage method, device, server and storage medium
CN113535458B (en) Abnormal false alarm processing method and device, storage medium and terminal
CN110515365B (en) Industrial control system abnormal behavior analysis method based on process mining
CN115150294A (en) Data analysis method, equipment and medium for monitoring Internet of things equipment
CN112416896A (en) Data abnormity warning method and device, storage medium and electronic device
CN113746862A (en) Abnormal flow detection method, device and equipment based on machine learning
CN112699005A (en) Server hardware fault monitoring method, electronic equipment and storage medium
CN113626236A (en) Fault diagnosis method, device, equipment and medium for distributed file system
CN113407520A (en) Power network safety data cleaning system and method based on machine learning
CN117422938B (en) Dam slope concrete structure anomaly analysis method based on three-dimensional analysis platform

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination