CN116383190A

CN116383190A - Intelligent cleaning method and system for massive big data

Info

Publication number: CN116383190A
Application number: CN202310537830.7A
Authority: CN
Inventors: 贾庆佳
Original assignee: Qingdao Off Site Market Clearing Center Co ltd
Current assignee: Qingdao Off Site Market Clearing Center Co ltd
Priority date: 2023-05-15
Filing date: 2023-05-15
Publication date: 2023-07-04
Anticipated expiration: 2043-05-15
Also published as: CN116383190B

Abstract

The invention relates to the technical field of data processing, in particular to an intelligent cleaning method and system for massive big data, wherein the method comprises the following steps: acquiring a time sequence subsequence corresponding to massive big data; obtaining a morphological similarity measurement index between any two time sequence subsequences according to the difference between the data in any two time sequence subsequences; obtaining importance indexes corresponding to any two time sequence subsequences according to the length difference between any two time sequence subsequences; obtaining a state factor of the time sequence subsequence according to the abnormal condition of the data in the time sequence subsequence; obtaining a difference index between the time sequence subsequences according to the distance between the time sequence subsequences, the morphological similarity measurement index, the importance index and the state factor; and classifying the time sequence subsequences according to the difference indexes, and cleaning the data of massive big data according to the classification results. The invention can obtain more accurate data cleaning results of time sequence data.

Description

Intelligent cleaning method and system for massive big data

Technical Field

The invention relates to the technical field of data processing, in particular to an intelligent cleaning method and system for massive big data.

Background

Data cleaning is an important step in data processing, can improve the quality and accuracy of data, ensures the reliability and effectiveness of the data, and is beneficial to decision making and business application of a clearing center. In the process of data cleaning, abnormal value detection is required to be performed on massive large data, and due to the fact that the data size is large, in order to improve data processing efficiency, collected data are required to be subjected to data slicing, abnormal value detection is performed on each data block in a parallel computing mode, and abnormal values of each data block are obtained. And finally, gathering and counting the abnormal values dispersed in different data blocks through abnormal value aggregation to obtain a global abnormal value detection result.

When the collected data is segmented, the existing classification algorithm only considers the data difference between time sequences, but does not consider the difference change between time sequences with different lengths, so that the data classification result is inaccurate, and the accuracy of the data cleaning result of each data block obtained by classification is low.

Disclosure of Invention

In order to solve the technical problem of lower accuracy of data cleaning results of each data block obtained by classification, the invention aims to provide a method and a system for intelligently cleaning massive big data, and the adopted technical scheme is as follows:

acquiring a time sequence subsequence corresponding to massive big data;

obtaining a morphological similarity measurement index between any two time sequence subsequences according to the difference between the data in any two time sequence subsequences; obtaining importance indexes corresponding to any two time sequence subsequences according to the length difference between any two time sequence subsequences;

obtaining a state factor of the time sequence subsequence according to the abnormal condition of the data in the time sequence subsequence; obtaining a difference index between the time sequence subsequences according to the distance between the time sequence subsequences, the morphological similarity measurement index, the importance index and the state factor; and classifying the time sequence subsequences according to the difference indexes, and cleaning the data of massive big data according to the classification results.

Preferably, the method for obtaining the morphological similarity measurement index between any two time sequence subsequences according to the difference between the data in any two time sequence subsequences specifically includes:

for any two time sequence subsequences, obtaining a change rate corresponding to the data according to the difference between every two adjacent data in the time sequence subsequences; and taking the absolute value of the difference value between all the change rate average values corresponding to the two time sequence subsequences as a morphological similarity measurement index between any two time sequence subsequences.

Preferably, the obtaining the importance index corresponding to any two time sequence subsequences according to the length difference between any two time sequence subsequences specifically includes:

acquiring the absolute value of the difference of the length between every two time sequence subsequences and recording the absolute value as the length difference of the time sequence subsequences;

for any two time sequence subsequences, the ratio between the length difference corresponding to the two time sequence subsequences and the maximum value in all the length differences is used as an importance index corresponding to any two time sequence subsequences.

Preferably, the state factor for obtaining the time sequence subsequence according to the abnormal condition of the data in the time sequence subsequence is specifically:

and acquiring the average link distance corresponding to the data in the time sequence subsequence by using a COF outlier factor detection algorithm, and taking a normalized value of the average link distance of all the data in the time sequence subsequence as a state factor of the time sequence subsequence.

Preferably, the obtaining the difference index between the time sequence subsequences according to the distance between the time sequence subsequences, the morphological similarity measurement index, the importance index and the state factor specifically includes:

and acquiring the DTW distance between any two time sequence subsequences, acquiring the difference between state factors corresponding to the two time sequence subsequences, and acquiring a difference index between the time sequence subsequences according to the DTW distance, the difference between the state factors, the morphological similarity measurement index and the importance index.

Preferably, the method for calculating the difference index specifically includes:

wherein, the liquid crystal display device comprises a liquid crystal display device,

representing the index of difference between the time sequence subsequence u and the time sequence subsequence v,/>

An importance index corresponding to the time sequence subsequence u and the time sequence subsequence v is represented by +.>

Represents the DTW distance between the time sequence subsequence u and the time sequence subsequence v,/and>

represents the DTW distance between the time sequence subsequence u and the time sequence subsequence v,/>

A morphological similarity measure indicative of the temporal sub-sequence u and the temporal sub-sequence v, +.>

Representing the state factor corresponding to the time sequence sub-sequence u, < ->

Representing the state factor corresponding to the time sequence subsequence v.

Preferably, the time sequence subsequence corresponding to the massive big data is specifically: and dividing the time sequence formed by the massive big data to obtain a time sequence sub-sequence.

Preferably, the dividing the time sequence formed by the massive big data to obtain the time sequence subsequence specifically includes:

acquiring an initial segmentation point corresponding to the time sequence by using a flow algorithm; in the neighborhood of the initial dividing point, marking any data point in the neighborhood as a target data point, and respectively acquiring the number of data points which are positioned at the left side and the right side of the target data point and belong to the neighborhood of the target data; taking the larger value of the number of data points corresponding to the left side of the target data point and the number of data points corresponding to the right side of the target data point as a numerator, taking the total number of the data points contained in the neighborhood of the target data point as a denominator, and taking the normalized value of the ratio of the numerator to the denominator as a single degree of the neighborhood direction of the target data point; and recording the data point corresponding to the maximum value of the neighborhood direction singleness in the neighborhood of the initial segmentation point as a final segmentation point, and segmenting the time sequence by using the final segmentation point to obtain a time sequence subsequence.

Preferably, the step of performing data cleaning on the massive big data according to the classification result specifically includes:

and taking the local outlier factor of the data point in each category in the classification result as the abnormal degree of the data point, and removing the data point corresponding to the abnormal degree larger than the preset degree threshold.

The invention also provides an intelligent cleaning system for the mass big data, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the computer program realizes the steps of the intelligent cleaning method for the mass big data when being executed by the processor.

The embodiment of the invention has at least the following beneficial effects:

firstly, according to the data difference between time sequence subsequences corresponding to massive big data, a morphological similarity measurement index between the time sequence subsequences is obtained, and the morphological similarity of the data difference condition between two time sequence subsequences is reflected by using the morphological similarity measurement index; meanwhile, when the difference index is calculated later, the situation that the two time sequence sub-sequences have the length difference is considered, namely, the importance index corresponding to the two time sequence sub-sequences is obtained according to the length difference; furthermore, considering that the overall morphology between the time sequence subsequences is similar, but the difference of the outlier states also occurs between the time sequence subsequences, further obtaining the state factors of the time sequence subsequences according to the abnormal condition of the data in the time sequence subsequences, and reflecting the outlier states of the whole data in the time sequence subsequences by using the state factors; finally, the measurement distance between the time sequence subsequences, the morphological similarity of the data difference condition between the time sequence subsequences, the length difference between the time sequence subsequences and the outlier state between the time sequence subsequences are combined to obtain a difference index, namely the difference index between the time sequence subsequences, the time sequence subsequences are classified according to the difference index, and a more accurate time sequence data classification result can be obtained. And data cleaning is carried out on massive big data according to the accurate classification result, so that the accurate data cleaning result can be obtained, and the data cleaning efficiency is improved. Finally, the method and the device can improve the accuracy of the time sequence data classification result and the data cleaning efficiency and accuracy.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions and advantages of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are only some embodiments of the invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a method for intelligently cleaning massive big data.

Detailed Description

In order to further describe the technical means and effects adopted by the invention to achieve the preset aim, the following is a detailed description of a specific implementation, structure, characteristics and effects of the method and system for intelligent cleaning of massive big data according to the invention with reference to the accompanying drawings and preferred embodiments. In the following description, different "one embodiment" or "another embodiment" means that the embodiments are not necessarily the same. Furthermore, the particular features, structures, or characteristics of one or more embodiments may be combined in any suitable manner.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.

The invention provides a mass big data intelligent cleaning method and a system specific scheme by combining the drawings.

The main purpose of the invention is as follows: the method comprises the steps of obtaining a time sequence subsequence through a flow algorithm, classifying the time sequence subsequence, transmitting clusters obtained through classification as data blocks of parallel calculation into a calculation node to calculate local outlier factors, and further finishing data cleaning on data. In the process, in order to avoid the waste of calculation power of the parallel calculation nodes, and in order to ensure that the parallel calculation achieves optimal calculation efficiency and accuracy of calculating local outlier factors when data transmission is carried out between the calculation nodes, the data quantity of the data blocks needs to be ensured to be similar.

The specific scene aimed by the invention is as follows: in the process of cleaning massive big data, the massive big data is required to be divided into data blocks in order to ensure the data cleaning efficiency due to overlarge data quantity, and high-efficiency processing is performed through parallel computing nodes.

Example 1:

referring to fig. 1, a method flowchart of an intelligent cleaning method for massive big data according to an embodiment of the present invention is shown, and the method includes the following steps:

step one, acquiring a time sequence sub-sequence corresponding to massive big data.

To ensure the accuracy and reliability of transactions and settlements, and to provide higher quality services to customers. The clearing house needs to analyze and process the transaction data in the financial market, such as risk early warning, transaction verification, settlement processing and the like. For financial assets of customers served by institutions, real-time data of financial markets, such as trade data of securities, futures, foreign exchange, bonds, etc., is required to be analyzed and processed.

In this embodiment, massive big data during financial transaction is taken as an example to describe, namely, a data interface is established between a clearing data center and a financial transaction data source, financial transaction data corresponding to a client of the clearing center is collected, and time sequence data is formed according to a time stamp of the transaction data and recorded as a time sequence.

Before the data block for parallel computation is acquired, the time sequence formed by massive big data is firstly required to be divided to obtain a time sequence subsequence. For the segmentation of time series data, the existing segmentation method flow (Fast Low-cost Online Semantic Segmentation) algorithm is commonly used for time series data segmentation because of its versatility and efficiency. In the patent document with publication No. CN113780295a, a method for dividing time series data by using a flow algorithm is disclosed.

When time series division is performed by the flow algorithm, a correction arc crossing (Corrected Arc Crossings, CAC) sequence is acquired, and then a division point is selected by setting a threshold value in a conventional manner. In a scenario where data division is performed on a large amount of large data and the data blocks are transferred into parallel computing nodes to perform outlier detection, since the computing efficiency of the parallel computing nodes needs to be maximized and the detection effect on outliers is optimized, it is necessary to make the timing lengths of the data blocks relatively similar, and to ensure that when the parallel computing nodes perform outlier detection, errors in outlier detection of data points near the division points due to data division are not caused by the division of the data blocks, and meanwhile, data transmission between the parallel computing nodes is reduced to ensure the highest efficiency.

In the process of calculating the local outlier of the parallel nodes by using the data blocks obtained by clustering, because the time sequence data in the data blocks need to obtain data points of a neighborhood when calculating the local outlier, if the time sequence data is divided more inaccurately, the data calculated by the current parallel nodes need to obtain data in other parallel nodes, so that data transmission between the parallel nodes is needed, and for massive big data, more data transmission between the parallel nodes is needed, and meanwhile, the data processing efficiency is reduced.

Based on the method, an initial segmentation point is obtained from long time sequence data according to a flow algorithm, a final accurate segmentation point is determined through unidirectionality of a K-distance neighborhood of a data point in a local range of the initial segmentation point, and a sub-sequence is obtained through segmentation. In the embodiment of the present invention, for the relevant parameters in the flow algorithm, the number of division points is set to numregions=n/50, where N represents the total number of data points included in the time sequence formed by massive large data, numregions represents the number of division points, the forbidden zone range L is set to 20, and the implementer can set according to the specific implementation scenario.

After all the initial segmentation points are obtained through the traditional flow algorithm, since the K-distance neighborhood of the data point o is determined through the K data points nearest to the data point o in the time sequence, for the final accurate segmentation point, in order to ensure that fewer data transmissions are required in the process of calculating the local outlier in parallel calculation of the data point, namely, when the data point is represented to perform parallel calculation on the local outlier, the K-distance neighborhood data of the data point exists in the same subsequence, and the segmentation of the subsequence can also follow the basic principle of the flow algorithm, therefore, one data point with the highest K-distance neighborhood unidirectionality needs to be selected as the final segmentation point in the K-distance neighborhood of the initial segmentation point.

It should be noted that the number of data points located in the same direction within the K-distance neighborhood of data points characterizes the unidirectionality of the K-distance neighborhood of data points. In this embodiment, the K-distance neighborhood refers to a neighborhood range constituted by K data points having the smallest data difference from the data points. In this embodiment, the number K of data points in the neighborhood range has a value of 20, and the practitioner can set the data points according to a specific implementation scenario.

Based on the data points, in the neighborhood of the initial segmentation point, marking any one data point in the neighborhood as a target data point, and respectively acquiring the number of the data points which are positioned at the left side and the right side of the target data point and belong to the neighborhood of the target data; taking the larger value of the number of data points corresponding to the left side of the target data point and the number of data points corresponding to the right side of the target data point as a numerator, taking the total number of the data points contained in the neighborhood of the target data point as a denominator, and taking the normalized value of the ratio of the numerator to the denominator as a single degree of the neighborhood direction of the target data point.

In this embodiment, the data point i in the K-distance neighborhood of the initial division point a is described as the target data point, and is expressed as:

wherein (1)>

Representing a neighborhood direction single degree of a data point i in a K-distance neighborhood of an initial segmentation point a, namely, a neighborhood direction single degree of a target data point, +.>

Represents the number of data points located to the right of and belonging to the K-distance neighborhood of data point i, +.>

Represents the number of data points located to the left of data point i and belonging to the K distance neighborhood of data point i, K represents the total number of data points contained within the neighborhood of data point i, max () represents the function of maximizing, and Norm () represents the normalization function. />

The ratio of more data points positioned on the same side in the neighborhood of the data point i is represented, and the larger the ratio is, the more the number of data points positioned on the same side in the neighborhood of the data point i is, the larger the corresponding neighborhood direction is, and the larger the unidirectionality of the data point i in the K-distance neighborhood is.

The neighborhood direction single degree of the data point represents the degree of the direction single degree of the data point in the K-distance neighborhood, and the larger the neighborhood direction single degree value is, the more the number of the data points positioned on the same side in the neighborhood range of the data point is, and the more the degree of the single degree of the direction of the data point in the K-distance neighborhood is. The smaller the one-degree value of the neighborhood direction is, the smaller the number of the data points positioned on the same side in the neighborhood range of the data points is, and the smaller the unidirectional degree of the data points in the K-distance neighborhood is.

According to the method, a neighborhood direction single degree corresponding to the data point of each initial segmentation point in the K-distance neighborhood range is obtained, each data point in the neighborhood is screened according to the neighborhood direction single degree, the data point corresponding to the maximum value of the neighborhood direction single degree in the neighborhood of the initial segmentation point is recorded as a final segmentation point, and the time sequence is segmented by the final segmentation point to obtain a time sequence subsequence.

The final segmentation point is selected according to the unidirectional degree of the data point in the K-distance neighborhood of each segmentation point in the neighborhood range, and compared with the segmentation point in the traditional flow algorithm for segmenting long time sequence data, the method can reduce data transmission among parallel node data in the parallel computing process, and improves the data processing efficiency.

Step two, obtaining morphological similarity measurement indexes between any two time sequence subsequences according to the difference between the data in any two time sequence subsequences; and obtaining importance indexes corresponding to any two time sequence subsequences according to the length difference between any two time sequence subsequences.

After all time sequence sub-sequences corresponding to big data of financial transactions needing to be subjected to data cleaning are obtained, when the time sequence sequences formed by massive big data are segmented by utilizing final segmentation points in the embodiment of the invention, although the time sequence sub-sequences obtained by segmentation are considered to enable parallel nodes to carry out less data transmission among nodes in the process of calculation, the situation that the length difference among the time sequence sub-sequences is large can exist, and the distance among the time sequence sub-sequences needs to be optimized in the process of clustering the time sequence sub-sequences.

In the clustering process of the time sequence sub-sequences, the time sequence sub-sequences with different lengths are required to be subjected to distance measurement, and the distance calculation between the long and short sequences can be performed due to the different lengths of the time sequence sub-sequences. In the process of calculating the DTW distance, the distance between the time sequence sub-sequences may have a distance measurement difference due to the different lengths. Therefore, the distance measurement of the subsequences needs to be optimized in the clustering process, and the average length of the time sequence subsequences in the category in the clustering process is limited, so that the data volume in the category obtained by clustering is similar, and the calculated volume of each parallel node is ensured to be similar.

In the conventional K-means clustering, the distance is measured by euclidean distance between data points, and when the time series are clustered, the distance is measured by DTW distance between the time series. The distance measurement between time sequences needs to consider the length difference of the sequences in the clustering process, and the larger the length difference between the two sequences is, the larger the measurement of the data distribution similarity is needed for the two sequences. Therefore, the distance between time sequence subsequences in the clustering process needs to be considered in two aspects, namely data distribution similarity and DTW distance, and the specific gravity between the two can be measured by the length difference.

Based on the above, a morphological similarity measurement index between any two time sequence subsequences is obtained according to the difference between corresponding data in any two time sequence subsequences, specifically, for any two time sequence subsequences, a change rate corresponding to data is obtained according to the difference between every two adjacent data in the time sequence subsequences; taking absolute values of differences between all change rate averages corresponding to the two time sequence subsequences as morphological similarity measurement indexes between any two time sequence subsequences, and expressing the morphological similarity measurement indexes as follows by a formula:

wherein (1)>

Representing the rate of change of the mth data in the time sequence sub-sequence u, <>

Indicating the length of the time sequence sub-sequence u, i.e. the total amount of data contained in the time sequence sub-sequence u,/->

Representing the rate of change of the nth data correspondence in the time sequence subsequence v, <>

The length of the time sequence subsequence v is indicated, i.e. the total amount of data contained in the time sequence subsequence v.

In this embodiment, the method for obtaining the change rate corresponding to the mth data in the time sequence subsequence u may be that a difference value between the mth data and the mth-1 data is calculated, and a ratio between the difference value and the mth-1 data is used as the change rate corresponding to the mth data, and it is to be noted that when the change rate corresponding to the first data is calculated, the value of the last data is defaulted to be 0.

In other embodiments, the method for obtaining the change rate corresponding to the mth data in the time sequence sub-sequence u may further be that, since the time sequence sub-sequence is a time sequence, there is a time interval between two adjacent data in the time sequence sub-sequence, that is, there is a time interval between the mth data and the m-1 data in the time sequence sub-sequence u. Based on the time interval between the mth data and the (m-1) th data in the time sequence subsequence is obtained, the difference value between the mth data and the (m-1) th data is calculated, and the ratio of the difference value to the time interval is taken as the corresponding change rate of the mth data.

The method for acquiring the rate of change corresponding to the data in the time sequence subsequence is the same as the method for acquiring the rate of change corresponding to the mth data in the time sequence subsequence u.

The difference between the data change conditions of the time sequence subsequence u and the time sequence subsequence v is represented, and the larger the difference is, the more dissimilar the data distribution conditions between the two time sequence subsequences are, namely the more dissimilar the morphology is, and the larger the corresponding morphology similarity measurement index is. The smaller the difference is, the more similar the data distribution situation between the two time sequence subsequences is, namely, the more similar the morphology is, and the smaller the corresponding morphology measurement index is.

When there is a difference in length between the time sequence sub-sequences, the last data point of the shorter time sequence sub-sequence corresponds to the data point of the longer time sequence sub-sequence in the process of calculating the DTW distance, and the phenomenon is caused by the difference in length between the time sequence sub-sequences. When the data distribution situation between two time sequence subsequences is more similar, the measurement distance between the two time sequence subsequences needs to be reduced, so that the time sequence subsequences with similar distribution situation but large length difference can be gathered into one type in the clustering process.

When the measurement distance between time sequence subsequences is obtained, the importance degree corresponding to the morphology similarity is obtained by normalizing the lengths of all the subsequences by using the maximum value, and the importance degree is used for measuring the measurement condition of the DTW distance and the morphology similarity. It can be further explained that the larger the difference in length between the time-series sub-sequences, the more important the measure of morphological similarity between the two time-series sub-sequences.

Based on the importance index, the importance index corresponding to any two time sequence subsequences is obtained according to the length difference between any two time sequence subsequences, and specifically, the absolute value of the difference of the length between every two time sequence subsequences is obtained and recorded as the length difference of the time sequence subsequences; for any two time sequence subsequences, the ratio between the length difference corresponding to the two time sequence subsequences and the maximum value in all the length differences is used as an importance index corresponding to any two time sequence subsequences, and the importance index is expressed as follows by a formula:

wherein (1)>

Representing the length of the time sequence subsequence u, +.>

Representing the length of the time sequence subsequence v, max () representing the function of maximizing,/and>

indicating the difference in length->

Representing the maximum value of the length differences corresponding to all any two time sequence sub-sequences. />

Representing a pair of a time sequence subsequence u and a time sequence subsequence vThe larger the length difference, the larger the corresponding importance index value, and the more important the similarity of the data distribution condition between the time sequence sub-sequence u and the time sequence sub-sequence v, namely the measurement of the morphological similarity between the two time sequence sub-sequences.

The importance index characterizes the measurement importance of the similarity of the data distribution conditions among the time sequence subsequences, and the larger the importance index is, the more attention is required to be paid to the data distribution conditions among the time sequence subsequences, namely the morphological similarity among the time sequence subsequences. The smaller the importance index is, the less attention is required to the data distribution condition among time sequence subsequences, namely the morphological similarity among the time sequence subsequences, and the DTW distance is directly calculated when the measurement distance is obtained.

Step three, obtaining a state factor of the time sequence subsequence according to the abnormal condition of the data in the time sequence subsequence; obtaining a difference index between the time sequence subsequences according to the distance between the time sequence subsequences, the morphological similarity measurement index, the importance index and the state factor; and classifying the time sequence subsequences according to the difference indexes, and cleaning the data of massive big data according to the classification results.

It should be noted that, the morphological similarity measure is a measure of the overall state of change between the time-series subsequences, and although the overall morphology between the time-series subsequences is similar, an outlier state difference may also occur between the time-series subsequences. It will be appreciated that, although the difference in the sorting distance between the time-series subsequences is small, there may be a difference between the outliers corresponding to the data in the time-series subsequences, and thus there is a higher overall value of the local outlier in the time-series subsequences with a larger outlier, and a lower overall value of the local outlier in the time-series subsequences with a smaller outlier, when the time-series subsequences with the difference between the two outliers are evaluated for abnormal data points, the local outlier with a lower value may be ignored, so that abnormal data points in the time-series subsequences with lower overall values of the local outlier may not be detected, and thus the data cleaning result is less accurate.

Based on this, the state factor of the time sequence subsequence is obtained according to the abnormal condition of the data in the time sequence subsequence, specifically, in this embodiment, the average link distance corresponding to the data in the time sequence subsequence is obtained by using a COF outlier factor detection algorithm, and the normalized value of the average link distance of all the data in the time sequence subsequence is used as the state factor of the time sequence subsequence.

The method for acquiring the average link distance corresponding to the data in the time sequence subsequence by using the COF outlier factor detection algorithm is a well-known technique, and the specific acquisition method is not described in the embodiment. The average link distance corresponding to the data reflects the abnormal condition of the data, and can reflect the outlier state of the data, namely, the larger the average link distance corresponding to the data is, the more likely the data is the outlier data point, so that the state factor of the time sequence subsequence characterizes the outlier state degree of the whole data in the time sequence subsequence, the larger the value of the state factor is, the larger the outlier state degree of the whole data in the time sequence subsequence is, and the overall numerical value of the local outlier factor of the data in the time sequence subsequence is calculated. The smaller the value of the state factor is, the smaller the degree of the outlier state of the whole data in the time sequence subsequence is, and the lower the whole data of the local outlier factor of the data in the time sequence subsequence is calculated.

Further, the closer the state factors of the two time sequence sub-sequences are, the closer the data in the two time sequence sub-sequences will be in the process of calculating the local outlier state factor. The larger the difference between the state factors of the two time sequence subsequences is, the larger the difference between the data in the two time sequence subsequences appears in the process of calculating the local outlier state factors, so that abnormal data in the time sequence subsequences with smaller state factors can be ignored, and certain data errors exist in the data cleaning process.

Based on the method, on the premise of considering the similarity of the data distribution conditions among the time sequence subsequences, the outlier state of the whole data in the time sequence subsequences is further measured through the state factors corresponding to the time sequence subsequences, so that the situation that abnormal data in the time sequence subsequences with lower values are ignored when abnormal data point evaluation is carried out on the time sequence subsequences with different abnormal values is avoided, and the abnormal data can be detected more accurately in the data cleaning process.

Further, according to the distance between the time sequence subsequences, the morphological similarity measurement index, the importance index and the state factor, a difference index between the time sequence subsequences is obtained, specifically, the DTW distance between any two time sequence subsequences is obtained, the difference between the state factors corresponding to the two time sequence subsequences is obtained, and according to the difference between the DTW distance and the state factor, the morphological similarity measurement index and the importance index, the difference index between the time sequence subsequences is obtained, and expressed as a formula:

wherein (1)>

Represents the DTW distance between the time sequence sub-sequence u and the time sequence sub-sequence v,

Representing the state factor corresponding to the time sequence subsequence v.

The importance index corresponding to the time sequence subsequence u and the time sequence subsequence v

The larger the weight of the similarity measurement index between the time sequence subsequences, namely the importance index is, the more attention is required to be paid to the data distribution situation between the time sequence subsequences, namely the morphological similarity between the time sequence subsequences. The smaller the importance index is, the larger the weight corresponding to the DTW distance is, the more no attention is required to be paid to the data distribution situation among time sequence subsequences, namely the morphological similarity among the time sequence subsequences, and the DTW distance is directly calculated when the measurement distance is obtained. />

The larger the difference of the state factors between the time sequence subsequence u and the time sequence subsequence v is, the larger the difference of the state factors is, which means that the larger the difference of the data in the two time sequence subsequences appears in the process of calculating the local outlier state factors, the larger the corresponding DTW distance is, and the larger the finally calculated measurement distance is, namely the larger the difference index between the two time sequence subsequences is.

The DTW distance between the time sequence subsequences is limited by the length difference between the time sequence subsequences, the measurement distance in the clustering process is obtained through the morphological similarity measurement index of the time sequence subsequences and the DTW distance, the difference between the overall outlier states of all data in the time sequence subsequences is further considered, and finally, the measurement distance index which characterizes the time sequence subsequences accurately and the difference index between the time sequence subsequences are obtained.

And then, acquiring difference indexes between all arbitrary two time sequence subsequences according to the method, classifying the time sequence subsequences according to the difference indexes to obtain a plurality of categories, and forming the time sequence subsequences in each category into a data block for parallel calculation. In this embodiment, the K-means clustering algorithm is used to classify the time sequence subsequences according to the difference index, the difference index is used as an index for measuring the similarity between the time sequence subsequences, the similarity is inversely proportional to the difference index, the larger the similarity is, the smaller the difference index is, in this embodiment, the value of the cluster class number in the K-means clustering algorithm needs to be selected by an implementer according to the number of parallel computing nodes, for example, the cluster class number is set to be the same value as the number of the parallel computing nodes, or the cluster class number is set to be an integer multiple of the number of the parallel computing nodes. Meanwhile, the implementer can select other suitable clustering algorithms for classification according to specific implementation scenes.

It should be noted that parallel computing refers to an operation that can execute a plurality of instructions at a time, so as to increase the computing speed, that is, all data blocks are transferred into parallel computing nodes, and related data is computed in parallel in each computing node, so as to increase the efficiency of data cleaning.

And cleaning the mass big data according to the classification result, specifically, taking the local outlier factor of the data point in each class in the classification result as the abnormal degree of the data point, and eliminating the data point corresponding to the abnormal degree larger than the preset degree threshold.

In this embodiment, a COF outlier detection algorithm is used to obtain a local outlier factor corresponding to each data point, and the algorithm is a well-known technique and will not be described herein.

The larger the local outlier factor corresponding to the data point is, the more likely the data is abnormal, and the greater the corresponding degree of abnormality is, the more the data needs to be removed. The smaller the local outlier factor corresponding to the data point is, the more likely the data is normal data, and the smaller the corresponding abnormality degree is.

Therefore, when the abnormality degree of the data point is greater than the degree threshold, it is indicated that the data point is more likely to be abnormal data, so that the data point needs to be removed, and the average value of the adjacent data points in the time sequence subsequence can be used as the fitting value of the data point, so that the number of the adjacent data points can be set according to the specific implementation scene. In this embodiment, the value of the range threshold is 0.7, and the practitioner may set the value of the range threshold according to a specific implementation scenario.

In summary, the time sequence subsequences are clustered by the difference index obtained by calculation in the embodiment of the invention, the time sequence subsequences with more similar data distribution change are divided into the same category to form the data blocks, and compared with the data blocks formed by not classifying the time sequence subsequences, the data distribution in the data blocks in each parallel computing node can be ensured to be in a relatively close fluctuation range, so that deviation in the process of calculating local outliers of the data in the parallel computing node is avoided, and the accuracy in the data cleaning process is improved.

In the method for partitioning the time sequence subsequence, the data transmission quantity among the parallel computing nodes is evaluated when the parallel computing is performed in the K-distance neighborhood of the initial partitioning point, the unidirectional degree of the data point in the K-distance neighborhood of each partitioning point in the neighborhood range is obtained, and the final partitioning point is selected, so that the data transmission among the parallel node data in the parallel computing process can be reduced, and the data processing efficiency is improved.

Further, when the classification measurement distance between the time sequence sub-sequences is obtained, the classification measurement distance is optimized through measurement of the morphological similarity between the time sequence sub-sequences and the importance degree of the morphological similarity, compared with the traditional DTW distance, the classification measurement distance between the time sequence sub-sequences with larger length difference can be calculated, the classification measurement distance is reduced by using the morphological similarity between the time sequence sub-sequences, and therefore the time sequence sub-sequences with larger length difference in the clustering process can be classified into one class according to the morphological similarity.

Finally, the classification measurement distance after optimization of the morphological similarity is further optimized through the outlier states of all the data in the time sequence subsequences, so that when the morphological similarity among the time sequence subsequences is large, the classification measurement distance is further optimized through the outlier states of all the data in the time sequence subsequences, and when the abnormality degree of the data is measured by the parallel computing nodes, abnormal data error detection caused by outlier state difference among different time sequence subsequences in the same data block is avoided.

Example 2:

the embodiment provides an intelligent cleaning system for mass big data, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the computer program realizes the steps of the intelligent cleaning method for mass big data when being executed by the processor. Since embodiment 1 has already described a detailed explanation of a method for intelligent cleaning of massive large data, it will not be described here too much.

The above embodiments are only for illustrating the technical solution of the present application, and are not limiting; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the scope of the embodiments of the present application, and are intended to be included within the scope of the present application.

Claims

1. The intelligent cleaning method for the massive big data is characterized by comprising the following steps of:

acquiring a time sequence subsequence corresponding to massive big data;

2. The method for intelligently cleaning massive big data according to claim 1, wherein the step of obtaining the morphological similarity measurement index between any two time sequence subsequences according to the difference between the data in any two time sequence subsequences is specifically as follows:

3. The method for intelligently cleaning massive big data according to claim 1, wherein the obtaining the importance index corresponding to any two time sequence subsequences according to the length difference between any two time sequence subsequences specifically comprises:

4. The intelligent cleaning method for massive big data according to claim 1, wherein the state factor for obtaining the time sequence subsequence according to the abnormal condition of the data in the time sequence subsequence is specifically:

5. The method for intelligently cleaning massive big data according to claim 1, wherein the obtaining the difference index between the time sequence subsequences according to the distance between the time sequence subsequences, the morphological similarity measurement index, the importance index and the state factor specifically comprises:

6. The intelligent cleaning method for massive big data according to claim 5, wherein the calculating method of the difference index is specifically as follows:

wherein (1)>

Representing the state factor corresponding to the time sequence subsequence v.

7. The method for intelligently cleaning massive big data according to claim 1, wherein the time sequence subsequence corresponding to the massive big data is specifically: and dividing the time sequence formed by the massive big data to obtain a time sequence sub-sequence.

8. The method for intelligently cleaning massive big data according to claim 7, wherein the dividing the time sequence formed by massive big data to obtain the time sequence subsequence specifically comprises:

acquiring an initial segmentation point corresponding to the time sequence by using a flow algorithm;

in the neighborhood of the initial dividing point, marking any data point in the neighborhood as a target data point, and respectively acquiring the number of data points which are positioned at the left side and the right side of the target data point and belong to the neighborhood of the target data;

taking the larger value of the number of data points corresponding to the left side of the target data point and the number of data points corresponding to the right side of the target data point as a numerator, taking the total number of the data points contained in the neighborhood of the target data point as a denominator, and taking the normalized value of the ratio of the numerator to the denominator as a single degree of the neighborhood direction of the target data point;

and recording the data point corresponding to the maximum value of the neighborhood direction singleness in the neighborhood of the initial segmentation point as a final segmentation point, and segmenting the time sequence by using the final segmentation point to obtain a time sequence subsequence.

9. The method for intelligently cleaning massive big data according to claim 1, wherein the step of cleaning the massive big data according to the classification result is specifically as follows:

10. A mass big data intelligent cleaning system comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the computer program when executed by the processor implements the steps of a mass big data intelligent cleaning method according to any of claims 1-9.