CN116383190A - Intelligent cleaning method and system for massive big data - Google Patents

Intelligent cleaning method and system for massive big data Download PDF

Info

Publication number
CN116383190A
CN116383190A CN202310537830.7A CN202310537830A CN116383190A CN 116383190 A CN116383190 A CN 116383190A CN 202310537830 A CN202310537830 A CN 202310537830A CN 116383190 A CN116383190 A CN 116383190A
Authority
CN
China
Prior art keywords
time sequence
data
subsequences
subsequence
difference
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202310537830.7A
Other languages
Chinese (zh)
Other versions
CN116383190B (en
Inventor
贾庆佳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Qingdao Off Site Market Clearing Center Co ltd
Original Assignee
Qingdao Off Site Market Clearing Center Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Qingdao Off Site Market Clearing Center Co ltd filed Critical Qingdao Off Site Market Clearing Center Co ltd
Priority to CN202310537830.7A priority Critical patent/CN116383190B/en
Publication of CN116383190A publication Critical patent/CN116383190A/en
Application granted granted Critical
Publication of CN116383190B publication Critical patent/CN116383190B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2474Sequence data queries, e.g. querying versioned data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Quality & Reliability (AREA)
  • Fuzzy Systems (AREA)
  • Mathematical Physics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Preliminary Treatment Of Fibers (AREA)

Abstract

The invention relates to the technical field of data processing, in particular to an intelligent cleaning method and system for massive big data, wherein the method comprises the following steps: acquiring a time sequence subsequence corresponding to massive big data; obtaining a morphological similarity measurement index between any two time sequence subsequences according to the difference between the data in any two time sequence subsequences; obtaining importance indexes corresponding to any two time sequence subsequences according to the length difference between any two time sequence subsequences; obtaining a state factor of the time sequence subsequence according to the abnormal condition of the data in the time sequence subsequence; obtaining a difference index between the time sequence subsequences according to the distance between the time sequence subsequences, the morphological similarity measurement index, the importance index and the state factor; and classifying the time sequence subsequences according to the difference indexes, and cleaning the data of massive big data according to the classification results. The invention can obtain more accurate data cleaning results of time sequence data.

Description

Intelligent cleaning method and system for massive big data
Technical Field
The invention relates to the technical field of data processing, in particular to an intelligent cleaning method and system for massive big data.
Background
Data cleaning is an important step in data processing, can improve the quality and accuracy of data, ensures the reliability and effectiveness of the data, and is beneficial to decision making and business application of a clearing center. In the process of data cleaning, abnormal value detection is required to be performed on massive large data, and due to the fact that the data size is large, in order to improve data processing efficiency, collected data are required to be subjected to data slicing, abnormal value detection is performed on each data block in a parallel computing mode, and abnormal values of each data block are obtained. And finally, gathering and counting the abnormal values dispersed in different data blocks through abnormal value aggregation to obtain a global abnormal value detection result.
When the collected data is segmented, the existing classification algorithm only considers the data difference between time sequences, but does not consider the difference change between time sequences with different lengths, so that the data classification result is inaccurate, and the accuracy of the data cleaning result of each data block obtained by classification is low.
Disclosure of Invention
In order to solve the technical problem of lower accuracy of data cleaning results of each data block obtained by classification, the invention aims to provide a method and a system for intelligently cleaning massive big data, and the adopted technical scheme is as follows:
acquiring a time sequence subsequence corresponding to massive big data;
obtaining a morphological similarity measurement index between any two time sequence subsequences according to the difference between the data in any two time sequence subsequences; obtaining importance indexes corresponding to any two time sequence subsequences according to the length difference between any two time sequence subsequences;
obtaining a state factor of the time sequence subsequence according to the abnormal condition of the data in the time sequence subsequence; obtaining a difference index between the time sequence subsequences according to the distance between the time sequence subsequences, the morphological similarity measurement index, the importance index and the state factor; and classifying the time sequence subsequences according to the difference indexes, and cleaning the data of massive big data according to the classification results.
Preferably, the method for obtaining the morphological similarity measurement index between any two time sequence subsequences according to the difference between the data in any two time sequence subsequences specifically includes:
for any two time sequence subsequences, obtaining a change rate corresponding to the data according to the difference between every two adjacent data in the time sequence subsequences; and taking the absolute value of the difference value between all the change rate average values corresponding to the two time sequence subsequences as a morphological similarity measurement index between any two time sequence subsequences.
Preferably, the obtaining the importance index corresponding to any two time sequence subsequences according to the length difference between any two time sequence subsequences specifically includes:
acquiring the absolute value of the difference of the length between every two time sequence subsequences and recording the absolute value as the length difference of the time sequence subsequences;
for any two time sequence subsequences, the ratio between the length difference corresponding to the two time sequence subsequences and the maximum value in all the length differences is used as an importance index corresponding to any two time sequence subsequences.
Preferably, the state factor for obtaining the time sequence subsequence according to the abnormal condition of the data in the time sequence subsequence is specifically:
and acquiring the average link distance corresponding to the data in the time sequence subsequence by using a COF outlier factor detection algorithm, and taking a normalized value of the average link distance of all the data in the time sequence subsequence as a state factor of the time sequence subsequence.
Preferably, the obtaining the difference index between the time sequence subsequences according to the distance between the time sequence subsequences, the morphological similarity measurement index, the importance index and the state factor specifically includes:
and acquiring the DTW distance between any two time sequence subsequences, acquiring the difference between state factors corresponding to the two time sequence subsequences, and acquiring a difference index between the time sequence subsequences according to the DTW distance, the difference between the state factors, the morphological similarity measurement index and the importance index.
Preferably, the method for calculating the difference index specifically includes:
Figure SMS_2
Figure SMS_6
wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure SMS_8
representing the index of difference between the time sequence subsequence u and the time sequence subsequence v,/>
Figure SMS_1
An importance index corresponding to the time sequence subsequence u and the time sequence subsequence v is represented by +.>
Figure SMS_5
Represents the DTW distance between the time sequence subsequence u and the time sequence subsequence v,/and>
Figure SMS_7
represents the DTW distance between the time sequence subsequence u and the time sequence subsequence v,/>
Figure SMS_9
A morphological similarity measure indicative of the temporal sub-sequence u and the temporal sub-sequence v, +.>
Figure SMS_3
Representing the state factor corresponding to the time sequence sub-sequence u, < ->
Figure SMS_4
Representing the state factor corresponding to the time sequence subsequence v.
Preferably, the time sequence subsequence corresponding to the massive big data is specifically: and dividing the time sequence formed by the massive big data to obtain a time sequence sub-sequence.
Preferably, the dividing the time sequence formed by the massive big data to obtain the time sequence subsequence specifically includes:
acquiring an initial segmentation point corresponding to the time sequence by using a flow algorithm; in the neighborhood of the initial dividing point, marking any data point in the neighborhood as a target data point, and respectively acquiring the number of data points which are positioned at the left side and the right side of the target data point and belong to the neighborhood of the target data; taking the larger value of the number of data points corresponding to the left side of the target data point and the number of data points corresponding to the right side of the target data point as a numerator, taking the total number of the data points contained in the neighborhood of the target data point as a denominator, and taking the normalized value of the ratio of the numerator to the denominator as a single degree of the neighborhood direction of the target data point; and recording the data point corresponding to the maximum value of the neighborhood direction singleness in the neighborhood of the initial segmentation point as a final segmentation point, and segmenting the time sequence by using the final segmentation point to obtain a time sequence subsequence.
Preferably, the step of performing data cleaning on the massive big data according to the classification result specifically includes:
and taking the local outlier factor of the data point in each category in the classification result as the abnormal degree of the data point, and removing the data point corresponding to the abnormal degree larger than the preset degree threshold.
The invention also provides an intelligent cleaning system for the mass big data, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the computer program realizes the steps of the intelligent cleaning method for the mass big data when being executed by the processor.
The embodiment of the invention has at least the following beneficial effects:
firstly, according to the data difference between time sequence subsequences corresponding to massive big data, a morphological similarity measurement index between the time sequence subsequences is obtained, and the morphological similarity of the data difference condition between two time sequence subsequences is reflected by using the morphological similarity measurement index; meanwhile, when the difference index is calculated later, the situation that the two time sequence sub-sequences have the length difference is considered, namely, the importance index corresponding to the two time sequence sub-sequences is obtained according to the length difference; furthermore, considering that the overall morphology between the time sequence subsequences is similar, but the difference of the outlier states also occurs between the time sequence subsequences, further obtaining the state factors of the time sequence subsequences according to the abnormal condition of the data in the time sequence subsequences, and reflecting the outlier states of the whole data in the time sequence subsequences by using the state factors; finally, the measurement distance between the time sequence subsequences, the morphological similarity of the data difference condition between the time sequence subsequences, the length difference between the time sequence subsequences and the outlier state between the time sequence subsequences are combined to obtain a difference index, namely the difference index between the time sequence subsequences, the time sequence subsequences are classified according to the difference index, and a more accurate time sequence data classification result can be obtained. And data cleaning is carried out on massive big data according to the accurate classification result, so that the accurate data cleaning result can be obtained, and the data cleaning efficiency is improved. Finally, the method and the device can improve the accuracy of the time sequence data classification result and the data cleaning efficiency and accuracy.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions and advantages of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are only some embodiments of the invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a method for intelligently cleaning massive big data.
Detailed Description
In order to further describe the technical means and effects adopted by the invention to achieve the preset aim, the following is a detailed description of a specific implementation, structure, characteristics and effects of the method and system for intelligent cleaning of massive big data according to the invention with reference to the accompanying drawings and preferred embodiments. In the following description, different "one embodiment" or "another embodiment" means that the embodiments are not necessarily the same. Furthermore, the particular features, structures, or characteristics of one or more embodiments may be combined in any suitable manner.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.
The invention provides a mass big data intelligent cleaning method and a system specific scheme by combining the drawings.
The main purpose of the invention is as follows: the method comprises the steps of obtaining a time sequence subsequence through a flow algorithm, classifying the time sequence subsequence, transmitting clusters obtained through classification as data blocks of parallel calculation into a calculation node to calculate local outlier factors, and further finishing data cleaning on data. In the process, in order to avoid the waste of calculation power of the parallel calculation nodes, and in order to ensure that the parallel calculation achieves optimal calculation efficiency and accuracy of calculating local outlier factors when data transmission is carried out between the calculation nodes, the data quantity of the data blocks needs to be ensured to be similar.
The specific scene aimed by the invention is as follows: in the process of cleaning massive big data, the massive big data is required to be divided into data blocks in order to ensure the data cleaning efficiency due to overlarge data quantity, and high-efficiency processing is performed through parallel computing nodes.
Example 1:
referring to fig. 1, a method flowchart of an intelligent cleaning method for massive big data according to an embodiment of the present invention is shown, and the method includes the following steps:
step one, acquiring a time sequence sub-sequence corresponding to massive big data.
To ensure the accuracy and reliability of transactions and settlements, and to provide higher quality services to customers. The clearing house needs to analyze and process the transaction data in the financial market, such as risk early warning, transaction verification, settlement processing and the like. For financial assets of customers served by institutions, real-time data of financial markets, such as trade data of securities, futures, foreign exchange, bonds, etc., is required to be analyzed and processed.
In this embodiment, massive big data during financial transaction is taken as an example to describe, namely, a data interface is established between a clearing data center and a financial transaction data source, financial transaction data corresponding to a client of the clearing center is collected, and time sequence data is formed according to a time stamp of the transaction data and recorded as a time sequence.
Before the data block for parallel computation is acquired, the time sequence formed by massive big data is firstly required to be divided to obtain a time sequence subsequence. For the segmentation of time series data, the existing segmentation method flow (Fast Low-cost Online Semantic Segmentation) algorithm is commonly used for time series data segmentation because of its versatility and efficiency. In the patent document with publication No. CN113780295a, a method for dividing time series data by using a flow algorithm is disclosed.
When time series division is performed by the flow algorithm, a correction arc crossing (Corrected Arc Crossings, CAC) sequence is acquired, and then a division point is selected by setting a threshold value in a conventional manner. In a scenario where data division is performed on a large amount of large data and the data blocks are transferred into parallel computing nodes to perform outlier detection, since the computing efficiency of the parallel computing nodes needs to be maximized and the detection effect on outliers is optimized, it is necessary to make the timing lengths of the data blocks relatively similar, and to ensure that when the parallel computing nodes perform outlier detection, errors in outlier detection of data points near the division points due to data division are not caused by the division of the data blocks, and meanwhile, data transmission between the parallel computing nodes is reduced to ensure the highest efficiency.
In the process of calculating the local outlier of the parallel nodes by using the data blocks obtained by clustering, because the time sequence data in the data blocks need to obtain data points of a neighborhood when calculating the local outlier, if the time sequence data is divided more inaccurately, the data calculated by the current parallel nodes need to obtain data in other parallel nodes, so that data transmission between the parallel nodes is needed, and for massive big data, more data transmission between the parallel nodes is needed, and meanwhile, the data processing efficiency is reduced.
Based on the method, an initial segmentation point is obtained from long time sequence data according to a flow algorithm, a final accurate segmentation point is determined through unidirectionality of a K-distance neighborhood of a data point in a local range of the initial segmentation point, and a sub-sequence is obtained through segmentation. In the embodiment of the present invention, for the relevant parameters in the flow algorithm, the number of division points is set to numregions=n/50, where N represents the total number of data points included in the time sequence formed by massive large data, numregions represents the number of division points, the forbidden zone range L is set to 20, and the implementer can set according to the specific implementation scenario.
After all the initial segmentation points are obtained through the traditional flow algorithm, since the K-distance neighborhood of the data point o is determined through the K data points nearest to the data point o in the time sequence, for the final accurate segmentation point, in order to ensure that fewer data transmissions are required in the process of calculating the local outlier in parallel calculation of the data point, namely, when the data point is represented to perform parallel calculation on the local outlier, the K-distance neighborhood data of the data point exists in the same subsequence, and the segmentation of the subsequence can also follow the basic principle of the flow algorithm, therefore, one data point with the highest K-distance neighborhood unidirectionality needs to be selected as the final segmentation point in the K-distance neighborhood of the initial segmentation point.
It should be noted that the number of data points located in the same direction within the K-distance neighborhood of data points characterizes the unidirectionality of the K-distance neighborhood of data points. In this embodiment, the K-distance neighborhood refers to a neighborhood range constituted by K data points having the smallest data difference from the data points. In this embodiment, the number K of data points in the neighborhood range has a value of 20, and the practitioner can set the data points according to a specific implementation scenario.
Based on the data points, in the neighborhood of the initial segmentation point, marking any one data point in the neighborhood as a target data point, and respectively acquiring the number of the data points which are positioned at the left side and the right side of the target data point and belong to the neighborhood of the target data; taking the larger value of the number of data points corresponding to the left side of the target data point and the number of data points corresponding to the right side of the target data point as a numerator, taking the total number of the data points contained in the neighborhood of the target data point as a denominator, and taking the normalized value of the ratio of the numerator to the denominator as a single degree of the neighborhood direction of the target data point.
In this embodiment, the data point i in the K-distance neighborhood of the initial division point a is described as the target data point, and is expressed as:
Figure SMS_10
wherein (1)>
Figure SMS_11
Representing a neighborhood direction single degree of a data point i in a K-distance neighborhood of an initial segmentation point a, namely, a neighborhood direction single degree of a target data point, +.>
Figure SMS_12
Represents the number of data points located to the right of and belonging to the K-distance neighborhood of data point i, +.>
Figure SMS_13
Represents the number of data points located to the left of data point i and belonging to the K distance neighborhood of data point i, K represents the total number of data points contained within the neighborhood of data point i, max () represents the function of maximizing, and Norm () represents the normalization function. />
Figure SMS_14
The ratio of more data points positioned on the same side in the neighborhood of the data point i is represented, and the larger the ratio is, the more the number of data points positioned on the same side in the neighborhood of the data point i is, the larger the corresponding neighborhood direction is, and the larger the unidirectionality of the data point i in the K-distance neighborhood is.
The neighborhood direction single degree of the data point represents the degree of the direction single degree of the data point in the K-distance neighborhood, and the larger the neighborhood direction single degree value is, the more the number of the data points positioned on the same side in the neighborhood range of the data point is, and the more the degree of the single degree of the direction of the data point in the K-distance neighborhood is. The smaller the one-degree value of the neighborhood direction is, the smaller the number of the data points positioned on the same side in the neighborhood range of the data points is, and the smaller the unidirectional degree of the data points in the K-distance neighborhood is.
According to the method, a neighborhood direction single degree corresponding to the data point of each initial segmentation point in the K-distance neighborhood range is obtained, each data point in the neighborhood is screened according to the neighborhood direction single degree, the data point corresponding to the maximum value of the neighborhood direction single degree in the neighborhood of the initial segmentation point is recorded as a final segmentation point, and the time sequence is segmented by the final segmentation point to obtain a time sequence subsequence.
The final segmentation point is selected according to the unidirectional degree of the data point in the K-distance neighborhood of each segmentation point in the neighborhood range, and compared with the segmentation point in the traditional flow algorithm for segmenting long time sequence data, the method can reduce data transmission among parallel node data in the parallel computing process, and improves the data processing efficiency.
Step two, obtaining morphological similarity measurement indexes between any two time sequence subsequences according to the difference between the data in any two time sequence subsequences; and obtaining importance indexes corresponding to any two time sequence subsequences according to the length difference between any two time sequence subsequences.
After all time sequence sub-sequences corresponding to big data of financial transactions needing to be subjected to data cleaning are obtained, when the time sequence sequences formed by massive big data are segmented by utilizing final segmentation points in the embodiment of the invention, although the time sequence sub-sequences obtained by segmentation are considered to enable parallel nodes to carry out less data transmission among nodes in the process of calculation, the situation that the length difference among the time sequence sub-sequences is large can exist, and the distance among the time sequence sub-sequences needs to be optimized in the process of clustering the time sequence sub-sequences.
In the clustering process of the time sequence sub-sequences, the time sequence sub-sequences with different lengths are required to be subjected to distance measurement, and the distance calculation between the long and short sequences can be performed due to the different lengths of the time sequence sub-sequences. In the process of calculating the DTW distance, the distance between the time sequence sub-sequences may have a distance measurement difference due to the different lengths. Therefore, the distance measurement of the subsequences needs to be optimized in the clustering process, and the average length of the time sequence subsequences in the category in the clustering process is limited, so that the data volume in the category obtained by clustering is similar, and the calculated volume of each parallel node is ensured to be similar.
In the conventional K-means clustering, the distance is measured by euclidean distance between data points, and when the time series are clustered, the distance is measured by DTW distance between the time series. The distance measurement between time sequences needs to consider the length difference of the sequences in the clustering process, and the larger the length difference between the two sequences is, the larger the measurement of the data distribution similarity is needed for the two sequences. Therefore, the distance between time sequence subsequences in the clustering process needs to be considered in two aspects, namely data distribution similarity and DTW distance, and the specific gravity between the two can be measured by the length difference.
Based on the above, a morphological similarity measurement index between any two time sequence subsequences is obtained according to the difference between corresponding data in any two time sequence subsequences, specifically, for any two time sequence subsequences, a change rate corresponding to data is obtained according to the difference between every two adjacent data in the time sequence subsequences; taking absolute values of differences between all change rate averages corresponding to the two time sequence subsequences as morphological similarity measurement indexes between any two time sequence subsequences, and expressing the morphological similarity measurement indexes as follows by a formula:
Figure SMS_15
wherein (1)>
Figure SMS_16
A morphological similarity measure indicative of the temporal sub-sequence u and the temporal sub-sequence v, +.>
Figure SMS_17
Representing the rate of change of the mth data in the time sequence sub-sequence u, <>
Figure SMS_18
Indicating the length of the time sequence sub-sequence u, i.e. the total amount of data contained in the time sequence sub-sequence u,/->
Figure SMS_19
Representing the rate of change of the nth data correspondence in the time sequence subsequence v, <>
Figure SMS_20
The length of the time sequence subsequence v is indicated, i.e. the total amount of data contained in the time sequence subsequence v.
In this embodiment, the method for obtaining the change rate corresponding to the mth data in the time sequence subsequence u may be that a difference value between the mth data and the mth-1 data is calculated, and a ratio between the difference value and the mth-1 data is used as the change rate corresponding to the mth data, and it is to be noted that when the change rate corresponding to the first data is calculated, the value of the last data is defaulted to be 0.
In other embodiments, the method for obtaining the change rate corresponding to the mth data in the time sequence sub-sequence u may further be that, since the time sequence sub-sequence is a time sequence, there is a time interval between two adjacent data in the time sequence sub-sequence, that is, there is a time interval between the mth data and the m-1 data in the time sequence sub-sequence u. Based on the time interval between the mth data and the (m-1) th data in the time sequence subsequence is obtained, the difference value between the mth data and the (m-1) th data is calculated, and the ratio of the difference value to the time interval is taken as the corresponding change rate of the mth data.
The method for acquiring the rate of change corresponding to the data in the time sequence subsequence is the same as the method for acquiring the rate of change corresponding to the mth data in the time sequence subsequence u.
Figure SMS_21
The difference between the data change conditions of the time sequence subsequence u and the time sequence subsequence v is represented, and the larger the difference is, the more dissimilar the data distribution conditions between the two time sequence subsequences are, namely the more dissimilar the morphology is, and the larger the corresponding morphology similarity measurement index is. The smaller the difference is, the more similar the data distribution situation between the two time sequence subsequences is, namely, the more similar the morphology is, and the smaller the corresponding morphology measurement index is.
When there is a difference in length between the time sequence sub-sequences, the last data point of the shorter time sequence sub-sequence corresponds to the data point of the longer time sequence sub-sequence in the process of calculating the DTW distance, and the phenomenon is caused by the difference in length between the time sequence sub-sequences. When the data distribution situation between two time sequence subsequences is more similar, the measurement distance between the two time sequence subsequences needs to be reduced, so that the time sequence subsequences with similar distribution situation but large length difference can be gathered into one type in the clustering process.
When the measurement distance between time sequence subsequences is obtained, the importance degree corresponding to the morphology similarity is obtained by normalizing the lengths of all the subsequences by using the maximum value, and the importance degree is used for measuring the measurement condition of the DTW distance and the morphology similarity. It can be further explained that the larger the difference in length between the time-series sub-sequences, the more important the measure of morphological similarity between the two time-series sub-sequences.
Based on the importance index, the importance index corresponding to any two time sequence subsequences is obtained according to the length difference between any two time sequence subsequences, and specifically, the absolute value of the difference of the length between every two time sequence subsequences is obtained and recorded as the length difference of the time sequence subsequences; for any two time sequence subsequences, the ratio between the length difference corresponding to the two time sequence subsequences and the maximum value in all the length differences is used as an importance index corresponding to any two time sequence subsequences, and the importance index is expressed as follows by a formula:
Figure SMS_22
wherein (1)>
Figure SMS_23
An importance index corresponding to the time sequence subsequence u and the time sequence subsequence v is represented by +.>
Figure SMS_24
Representing the length of the time sequence subsequence u, +.>
Figure SMS_25
Representing the length of the time sequence subsequence v, max () representing the function of maximizing,/and>
Figure SMS_26
indicating the difference in length->
Figure SMS_27
Representing the maximum value of the length differences corresponding to all any two time sequence sub-sequences. />
Figure SMS_28
Representing a pair of a time sequence subsequence u and a time sequence subsequence vThe larger the length difference, the larger the corresponding importance index value, and the more important the similarity of the data distribution condition between the time sequence sub-sequence u and the time sequence sub-sequence v, namely the measurement of the morphological similarity between the two time sequence sub-sequences.
The importance index characterizes the measurement importance of the similarity of the data distribution conditions among the time sequence subsequences, and the larger the importance index is, the more attention is required to be paid to the data distribution conditions among the time sequence subsequences, namely the morphological similarity among the time sequence subsequences. The smaller the importance index is, the less attention is required to the data distribution condition among time sequence subsequences, namely the morphological similarity among the time sequence subsequences, and the DTW distance is directly calculated when the measurement distance is obtained.
Step three, obtaining a state factor of the time sequence subsequence according to the abnormal condition of the data in the time sequence subsequence; obtaining a difference index between the time sequence subsequences according to the distance between the time sequence subsequences, the morphological similarity measurement index, the importance index and the state factor; and classifying the time sequence subsequences according to the difference indexes, and cleaning the data of massive big data according to the classification results.
It should be noted that, the morphological similarity measure is a measure of the overall state of change between the time-series subsequences, and although the overall morphology between the time-series subsequences is similar, an outlier state difference may also occur between the time-series subsequences. It will be appreciated that, although the difference in the sorting distance between the time-series subsequences is small, there may be a difference between the outliers corresponding to the data in the time-series subsequences, and thus there is a higher overall value of the local outlier in the time-series subsequences with a larger outlier, and a lower overall value of the local outlier in the time-series subsequences with a smaller outlier, when the time-series subsequences with the difference between the two outliers are evaluated for abnormal data points, the local outlier with a lower value may be ignored, so that abnormal data points in the time-series subsequences with lower overall values of the local outlier may not be detected, and thus the data cleaning result is less accurate.
Based on this, the state factor of the time sequence subsequence is obtained according to the abnormal condition of the data in the time sequence subsequence, specifically, in this embodiment, the average link distance corresponding to the data in the time sequence subsequence is obtained by using a COF outlier factor detection algorithm, and the normalized value of the average link distance of all the data in the time sequence subsequence is used as the state factor of the time sequence subsequence.
The method for acquiring the average link distance corresponding to the data in the time sequence subsequence by using the COF outlier factor detection algorithm is a well-known technique, and the specific acquisition method is not described in the embodiment. The average link distance corresponding to the data reflects the abnormal condition of the data, and can reflect the outlier state of the data, namely, the larger the average link distance corresponding to the data is, the more likely the data is the outlier data point, so that the state factor of the time sequence subsequence characterizes the outlier state degree of the whole data in the time sequence subsequence, the larger the value of the state factor is, the larger the outlier state degree of the whole data in the time sequence subsequence is, and the overall numerical value of the local outlier factor of the data in the time sequence subsequence is calculated. The smaller the value of the state factor is, the smaller the degree of the outlier state of the whole data in the time sequence subsequence is, and the lower the whole data of the local outlier factor of the data in the time sequence subsequence is calculated.
Further, the closer the state factors of the two time sequence sub-sequences are, the closer the data in the two time sequence sub-sequences will be in the process of calculating the local outlier state factor. The larger the difference between the state factors of the two time sequence subsequences is, the larger the difference between the data in the two time sequence subsequences appears in the process of calculating the local outlier state factors, so that abnormal data in the time sequence subsequences with smaller state factors can be ignored, and certain data errors exist in the data cleaning process.
Based on the method, on the premise of considering the similarity of the data distribution conditions among the time sequence subsequences, the outlier state of the whole data in the time sequence subsequences is further measured through the state factors corresponding to the time sequence subsequences, so that the situation that abnormal data in the time sequence subsequences with lower values are ignored when abnormal data point evaluation is carried out on the time sequence subsequences with different abnormal values is avoided, and the abnormal data can be detected more accurately in the data cleaning process.
Further, according to the distance between the time sequence subsequences, the morphological similarity measurement index, the importance index and the state factor, a difference index between the time sequence subsequences is obtained, specifically, the DTW distance between any two time sequence subsequences is obtained, the difference between the state factors corresponding to the two time sequence subsequences is obtained, and according to the difference between the DTW distance and the state factor, the morphological similarity measurement index and the importance index, the difference index between the time sequence subsequences is obtained, and expressed as a formula:
Figure SMS_30
Figure SMS_34
wherein (1)>
Figure SMS_36
Representing the index of difference between the time sequence subsequence u and the time sequence subsequence v,/>
Figure SMS_31
An importance index corresponding to the time sequence subsequence u and the time sequence subsequence v is represented by +.>
Figure SMS_33
Represents the DTW distance between the time sequence sub-sequence u and the time sequence sub-sequence v,
Figure SMS_35
represents the DTW distance between the time sequence subsequence u and the time sequence subsequence v,/>
Figure SMS_37
A morphological similarity measure indicative of the temporal sub-sequence u and the temporal sub-sequence v, +.>
Figure SMS_29
Representing the state factor corresponding to the time sequence sub-sequence u, < ->
Figure SMS_32
Representing the state factor corresponding to the time sequence subsequence v.
The importance index corresponding to the time sequence subsequence u and the time sequence subsequence v
Figure SMS_38
The larger the weight of the similarity measurement index between the time sequence subsequences, namely the importance index is, the more attention is required to be paid to the data distribution situation between the time sequence subsequences, namely the morphological similarity between the time sequence subsequences. The smaller the importance index is, the larger the weight corresponding to the DTW distance is, the more no attention is required to be paid to the data distribution situation among time sequence subsequences, namely the morphological similarity among the time sequence subsequences, and the DTW distance is directly calculated when the measurement distance is obtained. />
Figure SMS_39
The larger the difference of the state factors between the time sequence subsequence u and the time sequence subsequence v is, the larger the difference of the state factors is, which means that the larger the difference of the data in the two time sequence subsequences appears in the process of calculating the local outlier state factors, the larger the corresponding DTW distance is, and the larger the finally calculated measurement distance is, namely the larger the difference index between the two time sequence subsequences is.
The DTW distance between the time sequence subsequences is limited by the length difference between the time sequence subsequences, the measurement distance in the clustering process is obtained through the morphological similarity measurement index of the time sequence subsequences and the DTW distance, the difference between the overall outlier states of all data in the time sequence subsequences is further considered, and finally, the measurement distance index which characterizes the time sequence subsequences accurately and the difference index between the time sequence subsequences are obtained.
And then, acquiring difference indexes between all arbitrary two time sequence subsequences according to the method, classifying the time sequence subsequences according to the difference indexes to obtain a plurality of categories, and forming the time sequence subsequences in each category into a data block for parallel calculation. In this embodiment, the K-means clustering algorithm is used to classify the time sequence subsequences according to the difference index, the difference index is used as an index for measuring the similarity between the time sequence subsequences, the similarity is inversely proportional to the difference index, the larger the similarity is, the smaller the difference index is, in this embodiment, the value of the cluster class number in the K-means clustering algorithm needs to be selected by an implementer according to the number of parallel computing nodes, for example, the cluster class number is set to be the same value as the number of the parallel computing nodes, or the cluster class number is set to be an integer multiple of the number of the parallel computing nodes. Meanwhile, the implementer can select other suitable clustering algorithms for classification according to specific implementation scenes.
It should be noted that parallel computing refers to an operation that can execute a plurality of instructions at a time, so as to increase the computing speed, that is, all data blocks are transferred into parallel computing nodes, and related data is computed in parallel in each computing node, so as to increase the efficiency of data cleaning.
And cleaning the mass big data according to the classification result, specifically, taking the local outlier factor of the data point in each class in the classification result as the abnormal degree of the data point, and eliminating the data point corresponding to the abnormal degree larger than the preset degree threshold.
In this embodiment, a COF outlier detection algorithm is used to obtain a local outlier factor corresponding to each data point, and the algorithm is a well-known technique and will not be described herein.
The larger the local outlier factor corresponding to the data point is, the more likely the data is abnormal, and the greater the corresponding degree of abnormality is, the more the data needs to be removed. The smaller the local outlier factor corresponding to the data point is, the more likely the data is normal data, and the smaller the corresponding abnormality degree is.
Therefore, when the abnormality degree of the data point is greater than the degree threshold, it is indicated that the data point is more likely to be abnormal data, so that the data point needs to be removed, and the average value of the adjacent data points in the time sequence subsequence can be used as the fitting value of the data point, so that the number of the adjacent data points can be set according to the specific implementation scene. In this embodiment, the value of the range threshold is 0.7, and the practitioner may set the value of the range threshold according to a specific implementation scenario.
In summary, the time sequence subsequences are clustered by the difference index obtained by calculation in the embodiment of the invention, the time sequence subsequences with more similar data distribution change are divided into the same category to form the data blocks, and compared with the data blocks formed by not classifying the time sequence subsequences, the data distribution in the data blocks in each parallel computing node can be ensured to be in a relatively close fluctuation range, so that deviation in the process of calculating local outliers of the data in the parallel computing node is avoided, and the accuracy in the data cleaning process is improved.
In the method for partitioning the time sequence subsequence, the data transmission quantity among the parallel computing nodes is evaluated when the parallel computing is performed in the K-distance neighborhood of the initial partitioning point, the unidirectional degree of the data point in the K-distance neighborhood of each partitioning point in the neighborhood range is obtained, and the final partitioning point is selected, so that the data transmission among the parallel node data in the parallel computing process can be reduced, and the data processing efficiency is improved.
Further, when the classification measurement distance between the time sequence sub-sequences is obtained, the classification measurement distance is optimized through measurement of the morphological similarity between the time sequence sub-sequences and the importance degree of the morphological similarity, compared with the traditional DTW distance, the classification measurement distance between the time sequence sub-sequences with larger length difference can be calculated, the classification measurement distance is reduced by using the morphological similarity between the time sequence sub-sequences, and therefore the time sequence sub-sequences with larger length difference in the clustering process can be classified into one class according to the morphological similarity.
Finally, the classification measurement distance after optimization of the morphological similarity is further optimized through the outlier states of all the data in the time sequence subsequences, so that when the morphological similarity among the time sequence subsequences is large, the classification measurement distance is further optimized through the outlier states of all the data in the time sequence subsequences, and when the abnormality degree of the data is measured by the parallel computing nodes, abnormal data error detection caused by outlier state difference among different time sequence subsequences in the same data block is avoided.
Example 2:
the embodiment provides an intelligent cleaning system for mass big data, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the computer program realizes the steps of the intelligent cleaning method for mass big data when being executed by the processor. Since embodiment 1 has already described a detailed explanation of a method for intelligent cleaning of massive large data, it will not be described here too much.
The above embodiments are only for illustrating the technical solution of the present application, and are not limiting; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the scope of the embodiments of the present application, and are intended to be included within the scope of the present application.

Claims (10)

1. The intelligent cleaning method for the massive big data is characterized by comprising the following steps of:
acquiring a time sequence subsequence corresponding to massive big data;
obtaining a morphological similarity measurement index between any two time sequence subsequences according to the difference between the data in any two time sequence subsequences; obtaining importance indexes corresponding to any two time sequence subsequences according to the length difference between any two time sequence subsequences;
obtaining a state factor of the time sequence subsequence according to the abnormal condition of the data in the time sequence subsequence; obtaining a difference index between the time sequence subsequences according to the distance between the time sequence subsequences, the morphological similarity measurement index, the importance index and the state factor; and classifying the time sequence subsequences according to the difference indexes, and cleaning the data of massive big data according to the classification results.
2. The method for intelligently cleaning massive big data according to claim 1, wherein the step of obtaining the morphological similarity measurement index between any two time sequence subsequences according to the difference between the data in any two time sequence subsequences is specifically as follows:
for any two time sequence subsequences, obtaining a change rate corresponding to the data according to the difference between every two adjacent data in the time sequence subsequences; and taking the absolute value of the difference value between all the change rate average values corresponding to the two time sequence subsequences as a morphological similarity measurement index between any two time sequence subsequences.
3. The method for intelligently cleaning massive big data according to claim 1, wherein the obtaining the importance index corresponding to any two time sequence subsequences according to the length difference between any two time sequence subsequences specifically comprises:
acquiring the absolute value of the difference of the length between every two time sequence subsequences and recording the absolute value as the length difference of the time sequence subsequences;
for any two time sequence subsequences, the ratio between the length difference corresponding to the two time sequence subsequences and the maximum value in all the length differences is used as an importance index corresponding to any two time sequence subsequences.
4. The intelligent cleaning method for massive big data according to claim 1, wherein the state factor for obtaining the time sequence subsequence according to the abnormal condition of the data in the time sequence subsequence is specifically:
and acquiring the average link distance corresponding to the data in the time sequence subsequence by using a COF outlier factor detection algorithm, and taking a normalized value of the average link distance of all the data in the time sequence subsequence as a state factor of the time sequence subsequence.
5. The method for intelligently cleaning massive big data according to claim 1, wherein the obtaining the difference index between the time sequence subsequences according to the distance between the time sequence subsequences, the morphological similarity measurement index, the importance index and the state factor specifically comprises:
and acquiring the DTW distance between any two time sequence subsequences, acquiring the difference between state factors corresponding to the two time sequence subsequences, and acquiring a difference index between the time sequence subsequences according to the DTW distance, the difference between the state factors, the morphological similarity measurement index and the importance index.
6. The intelligent cleaning method for massive big data according to claim 5, wherein the calculating method of the difference index is specifically as follows:
Figure QLYQS_2
Figure QLYQS_5
wherein (1)>
Figure QLYQS_7
Representing the index of difference between the time sequence subsequence u and the time sequence subsequence v,/>
Figure QLYQS_3
An importance index corresponding to the time sequence subsequence u and the time sequence subsequence v is represented by +.>
Figure QLYQS_6
Represents the DTW distance between the time sequence subsequence u and the time sequence subsequence v,/and>
Figure QLYQS_8
represents the DTW distance between the time sequence subsequence u and the time sequence subsequence v,/>
Figure QLYQS_9
A morphological similarity measure indicative of the temporal sub-sequence u and the temporal sub-sequence v, +.>
Figure QLYQS_1
Representing the state factor corresponding to the time sequence sub-sequence u, < ->
Figure QLYQS_4
Representing the state factor corresponding to the time sequence subsequence v.
7. The method for intelligently cleaning massive big data according to claim 1, wherein the time sequence subsequence corresponding to the massive big data is specifically: and dividing the time sequence formed by the massive big data to obtain a time sequence sub-sequence.
8. The method for intelligently cleaning massive big data according to claim 7, wherein the dividing the time sequence formed by massive big data to obtain the time sequence subsequence specifically comprises:
acquiring an initial segmentation point corresponding to the time sequence by using a flow algorithm;
in the neighborhood of the initial dividing point, marking any data point in the neighborhood as a target data point, and respectively acquiring the number of data points which are positioned at the left side and the right side of the target data point and belong to the neighborhood of the target data;
taking the larger value of the number of data points corresponding to the left side of the target data point and the number of data points corresponding to the right side of the target data point as a numerator, taking the total number of the data points contained in the neighborhood of the target data point as a denominator, and taking the normalized value of the ratio of the numerator to the denominator as a single degree of the neighborhood direction of the target data point;
and recording the data point corresponding to the maximum value of the neighborhood direction singleness in the neighborhood of the initial segmentation point as a final segmentation point, and segmenting the time sequence by using the final segmentation point to obtain a time sequence subsequence.
9. The method for intelligently cleaning massive big data according to claim 1, wherein the step of cleaning the massive big data according to the classification result is specifically as follows:
and taking the local outlier factor of the data point in each category in the classification result as the abnormal degree of the data point, and removing the data point corresponding to the abnormal degree larger than the preset degree threshold.
10. A mass big data intelligent cleaning system comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the computer program when executed by the processor implements the steps of a mass big data intelligent cleaning method according to any of claims 1-9.
CN202310537830.7A 2023-05-15 2023-05-15 Intelligent cleaning method and system for massive financial transaction big data Active CN116383190B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310537830.7A CN116383190B (en) 2023-05-15 2023-05-15 Intelligent cleaning method and system for massive financial transaction big data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310537830.7A CN116383190B (en) 2023-05-15 2023-05-15 Intelligent cleaning method and system for massive financial transaction big data

Publications (2)

Publication Number Publication Date
CN116383190A true CN116383190A (en) 2023-07-04
CN116383190B CN116383190B (en) 2023-08-25

Family

ID=86964207

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310537830.7A Active CN116383190B (en) 2023-05-15 2023-05-15 Intelligent cleaning method and system for massive financial transaction big data

Country Status (1)

Country Link
CN (1) CN116383190B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116612641A (en) * 2023-07-19 2023-08-18 天津中德应用技术大学 Vehicle queue control data processing method based on intelligent network connection
CN116703485A (en) * 2023-08-04 2023-09-05 山东创亿智慧信息科技发展有限责任公司 Advertisement accurate marketing method and system based on big data
CN117422345A (en) * 2023-12-18 2024-01-19 泰安金冠宏食品科技有限公司 Oil-residue separation quality assessment method and system
CN117556108A (en) * 2024-01-12 2024-02-13 泰安金冠宏食品科技有限公司 Abnormal detection method for oil-residue separation efficiency based on data analysis

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112966017A (en) * 2021-03-01 2021-06-15 北京青萌数海科技有限公司 Abnormal subsequence detection method with indefinite length in time sequence
US20210216386A1 (en) * 2018-07-23 2021-07-15 Mitsubishi Electric Corporation Time-sequential data diagnosis device, additional learning method, and recording medium
CN113705726A (en) * 2021-09-15 2021-11-26 北京沃东天骏信息技术有限公司 Traffic classification method and device, electronic equipment and computer readable medium
WO2021238455A1 (en) * 2020-05-29 2021-12-02 中兴通讯股份有限公司 Data processing method and device, and computer-readable storage medium
CN115982611A (en) * 2023-03-14 2023-04-18 北京易能中网技术有限公司 Clustering algorithm-based power user energy characteristic analysis method

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210216386A1 (en) * 2018-07-23 2021-07-15 Mitsubishi Electric Corporation Time-sequential data diagnosis device, additional learning method, and recording medium
WO2021238455A1 (en) * 2020-05-29 2021-12-02 中兴通讯股份有限公司 Data processing method and device, and computer-readable storage medium
CN112966017A (en) * 2021-03-01 2021-06-15 北京青萌数海科技有限公司 Abnormal subsequence detection method with indefinite length in time sequence
CN113705726A (en) * 2021-09-15 2021-11-26 北京沃东天骏信息技术有限公司 Traffic classification method and device, electronic equipment and computer readable medium
CN115982611A (en) * 2023-03-14 2023-04-18 北京易能中网技术有限公司 Clustering algorithm-based power user energy characteristic analysis method

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
XIN WANG等: "Detecting anomalies in symbolic sequence dataset", 《IEEE》, pages 443 - 447 *
展鹏: "基于时间序列挖掘的异常检测关键技术研究", 《中国博士学位论文全文数据库 基础科学辑》, no. 04, pages 002 - 65 *
曹洋洋等: "基于形态距离及自适应权重的相似性度量", 《计算机应用研究》, vol. 35, no. 09, pages 2638 - 2642 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116612641A (en) * 2023-07-19 2023-08-18 天津中德应用技术大学 Vehicle queue control data processing method based on intelligent network connection
CN116612641B (en) * 2023-07-19 2023-09-22 天津中德应用技术大学 Vehicle queue control data processing method based on intelligent network connection
CN116703485A (en) * 2023-08-04 2023-09-05 山东创亿智慧信息科技发展有限责任公司 Advertisement accurate marketing method and system based on big data
CN116703485B (en) * 2023-08-04 2023-10-20 山东创亿智慧信息科技发展有限责任公司 Advertisement accurate marketing method and system based on big data
CN117422345A (en) * 2023-12-18 2024-01-19 泰安金冠宏食品科技有限公司 Oil-residue separation quality assessment method and system
CN117422345B (en) * 2023-12-18 2024-03-12 泰安金冠宏食品科技有限公司 Oil-residue separation quality assessment method and system
CN117556108A (en) * 2024-01-12 2024-02-13 泰安金冠宏食品科技有限公司 Abnormal detection method for oil-residue separation efficiency based on data analysis
CN117556108B (en) * 2024-01-12 2024-03-26 泰安金冠宏食品科技有限公司 Abnormal detection method for oil-residue separation efficiency based on data analysis

Also Published As

Publication number Publication date
CN116383190B (en) 2023-08-25

Similar Documents

Publication Publication Date Title
CN116383190B (en) Intelligent cleaning method and system for massive financial transaction big data
US9454902B2 (en) Performing-time-series based predictions with projection thresholds using secondary time-series-based information stream
US20200151628A1 (en) Adaptive Fraud Detection
CN107742127A (en) A kind of improved anti-electricity-theft intelligent early-warning system and method
CN115577275A (en) Time sequence data anomaly monitoring system and method based on LOF and isolated forest
CN109934301B (en) Power load cluster analysis method, device and equipment
WO2022237088A1 (en) Root cause locating method, electronic device, and storage medium
CN116739645A (en) Order abnormity supervision system based on enterprise management
CN117608499B (en) Intelligent traffic data optimal storage method based on Internet of things
CN111191720A (en) Service scene identification method and device and electronic equipment
CN116258864B (en) Village planning construction big data management system
CN116257651B (en) Intelligent monitoring system for abnormal sound of through channel cab apron
CN116151950B (en) Intelligent banking outlet scheduling management method, system and storage medium
CN111625578A (en) Feature extraction method suitable for time sequence data in cultural science and technology fusion field
CN115082135B (en) Method, device, equipment and medium for identifying online time difference
CN116295506A (en) Method, device, equipment and medium for predicting vehicle remaining mileage
CN116719714A (en) Training method and corresponding device for screening model of test case
CN114722098A (en) Typical load curve identification method based on normal cloud model and density clustering algorithm
CN113852629B (en) Network connection abnormity identification method based on natural neighbor self-adaptive weighted kernel density and computer storage medium
KR20220123845A (en) Meathod and device for measuring similarity between time series data
CN117235651B (en) Enterprise information data optimization management system based on Internet of things
CN113535527A (en) Load shedding method and system for real-time flow data predictive analysis
CN117196831B (en) Financial service-oriented risk prediction method and system
CN116070150B (en) Abnormality monitoring method based on operation parameters of breathing machine
Ghoniemy et al. Robust scoring and ranking of object tracking techniques

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant