CN115185932A - Data processing method and device - Google Patents

Data processing method and device Download PDF

Info

Publication number
CN115185932A
CN115185932A CN202210663208.6A CN202210663208A CN115185932A CN 115185932 A CN115185932 A CN 115185932A CN 202210663208 A CN202210663208 A CN 202210663208A CN 115185932 A CN115185932 A CN 115185932A
Authority
CN
China
Prior art keywords
data
abnormal
data points
time
numerical
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210663208.6A
Other languages
Chinese (zh)
Inventor
宋韶旭
赵东明
贺文迪
龚怿焜
王建民
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
Original Assignee
Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University filed Critical Tsinghua University
Priority to CN202210663208.6A priority Critical patent/CN115185932A/en
Publication of CN115185932A publication Critical patent/CN115185932A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2465Query processing support for facilitating data mining operations in structured databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2474Sequence data queries, e.g. querying versioned data

Abstract

The embodiment of the application provides a data processing method and a data processing device, wherein the method comprises the following steps: acquiring time sequence data to be processed from terminal equipment, wherein the time sequence data to be processed comprises N data points, and N is an integer greater than 1; according to the time sequence data to be processed, determining abnormal data points meeting preset conditions and marking the abnormal data points; the preset condition is used for screening out any one or more abnormal data points: the missing of the time stamp is abnormal, the numerical value is abnormal as null value, the time interval of the adjacent data points does not meet the collection interval condition, the numerical value distribution is abnormal, the numerical value change speed distribution is abnormal or the numerical value change acceleration distribution is abnormal. The usability of the time sequence data can be effectively fed back, and the accuracy of the time sequence data analysis or mining result is improved.

Description

Data processing method and device
Technical Field
The embodiment of the application relates to the technical field of data processing, in particular to a data processing method and device.
Background
With the development of a new generation of information technology revolution and the growth of the requirement of industrial upgrading in the industry, the scale of the industrial internet of things is rapidly increasing, and correspondingly, more time sequence data can be generated.
The time series data is time series data and represents the numerical value change of a certain index at different acquisition moments along with time. In an actual application scenario, monitoring of various devices in the industrial internet of things can be achieved through analysis or mining of time series data.
However, when analysis and mining are performed based on time series data in the prior art, some abnormal data often exist in the data, which causes that partial assumption of the data deviates from an actual result, and the problem that the accuracy of the analysis or mining result is often low.
Disclosure of Invention
The embodiment of the application provides a data processing method and device, so that the usability of time sequence data is effectively fed back, and the accuracy of time sequence data analysis or mining results is improved.
In a first aspect, an embodiment of the present application provides a data processing method, including:
acquiring time sequence data to be processed from terminal equipment, wherein the time sequence data to be processed comprises N data points, and N is an integer greater than 1;
determining abnormal data points meeting preset conditions according to the time sequence data to be processed and marking the abnormal data points; the preset conditions are used for screening data points of any one or more of the following anomalies: the missing of the timestamp is abnormal, the numerical value is abnormal, the time interval of the adjacent data points does not meet the collection interval condition, the numerical value distribution is abnormal, the numerical value change speed distribution is abnormal or the numerical value change acceleration distribution is abnormal.
In a possible implementation manner, determining an abnormal data point meeting a preset condition and marking the abnormal data point according to the time series data to be processed includes:
marking data points with missing timestamps in the time sequence data to be processed and data points with numerical values of null values as integrity abnormal data, and obtaining a first group of processed time sequence data except the integrity abnormal data in the time sequence data to be processed;
calculating the time interval of any two adjacent data points in the first group of processed time series data;
marking data points in the first group of processed time sequence data, the time intervals of which do not meet the acquisition interval condition, as integrity abnormal data, timeliness abnormal data or consistency abnormal data; the integrity abnormal data comprises adjacent data points of which the time interval exceeds L times of the standard acquisition interval, the timeliness abnormal data comprises data points of which the time interval is less than Q times of the standard acquisition interval and at least one time interval exceeding L times of the standard acquisition interval exists in a time window, and the consistency abnormal data comprises data points of which the time interval is less than Q times of the standard interval and no time interval exceeding L times of the standard acquisition interval exists in the time window, wherein L is a number greater than or equal to 2, and Q is a number greater than 0 and less than or equal to 1/2.
In one possible implementation, the marking, as integrity abnormal data, timeliness abnormal data, or consistency abnormal data, a data point in the first set of processed time series data for which the time interval does not satisfy the acquisition interval condition includes:
marking adjacent data points with the time interval larger than L times of the standard acquisition interval as integrity abnormal data;
when obtaining the redundant data points with the time interval less than Q times of the standard acquisition interval, searching whether adjacent data points with the time interval exceeding L times of the standard acquisition interval exist in a time window of the redundant data points;
if adjacent data points with the time interval exceeding L times of the standard acquisition interval exist, the redundant data points are moved to the position between the adjacent data points with the time interval exceeding L times of the standard acquisition interval, and the redundant data points are marked as timeliness abnormal data;
if there are no adjacent data points within the time window having a time interval that exceeds L times the standard acquisition interval, the redundant data points are marked as consistent outlier data.
In one possible implementation manner, the method further includes:
restoring the first group of processed time sequence data to obtain a second group of processed time sequence data, wherein the time interval of any two adjacent data points in the second group of processed time sequence data meets the acquisition interval condition;
and marking the validity abnormal data in the second group of processed time sequence data according to the distribution of the second group of processed time sequence data.
In one possible implementation, the repairing the first set of processed time series data includes:
and performing time stamp repairing on the data points marked with the timeliness abnormal data and the data points marked with the consistency abnormal data in the first group of processed time series data, and performing interpolation repairing on the data points marked with the integrity abnormal data in the first group of processed time series data.
In one possible implementation manner, marking validity abnormal data in the second set of processed time series data according to the distribution of the second set of processed time series data includes:
calculating the numerical value distribution, the numerical value change speed distribution and/or the numerical value change acceleration distribution of data points in the second group of processed time sequence data;
and marking data points with abnormal value distribution, abnormal value change speed distribution and/or abnormal value change acceleration distribution in the second group of processed time series data.
In one possible implementation, the marking data points of the second set of processed time series data with abnormal value distribution, abnormal value change speed distribution and/or abnormal value change acceleration distribution includes:
marking data points of which the absolute value of the difference value between the numerical value and the numerical value average value in the second group of processed time sequence data exceeds the standard deviation of the K multiple value as data points with abnormal numerical value distribution;
marking data points of the numerical value change speed standard deviation with the absolute value of the difference value of the numerical value change speed and the average value of the numerical value change speed in the second group of processed time sequence data exceeding K times as data points of abnormal numerical value change speed distribution;
and marking the data points of the numerical change acceleration standard deviation with the absolute value of the difference value of the numerical change acceleration and the average value of the numerical change acceleration in the second group of processed time sequence data exceeding K times as the data points of the abnormal distribution of the numerical change acceleration.
In one possible implementation, the method further includes:
reading sample time sequence data from the terminal equipment, wherein the sample time sequence data comprises M data points;
calculating the approximate median of the time intervals among the M data points to obtain a standard acquisition interval;
and calculating the numerical average value, the numerical standard deviation, the numerical change speed average value, the numerical change speed standard deviation, the numerical change acceleration average value and the numerical change acceleration standard deviation of the M data points.
In a second aspect, an embodiment of the present application provides a data processing apparatus, including:
the acquisition module is used for acquiring to-be-processed time sequence data from the terminal equipment, wherein the to-be-processed time sequence data comprises N data points, and N is an integer greater than 1;
the first determining module is used for determining abnormal data points meeting preset conditions and marking the abnormal data points according to the time sequence data to be processed; the preset conditions are used for screening data points of any one or more of the following anomalies: the missing of the timestamp is abnormal, the numerical value is abnormal, the time interval of the adjacent data points does not meet the collection interval condition, the numerical value distribution is abnormal, the numerical value change speed distribution is abnormal or the numerical value change acceleration distribution is abnormal.
In a possible implementation manner, the first determining module is specifically configured to:
marking data points with missing timestamps in the time sequence data to be processed and data points with numerical values of null values as integrity abnormal data, and obtaining a first group of processed time sequence data except the integrity abnormal data in the time sequence data to be processed;
calculating the time interval of any two adjacent data points in the first group of processed time sequence data;
marking data points in the first group of processed time sequence data, the time intervals of which do not meet the acquisition interval condition, as integrity abnormal data, timeliness abnormal data or consistency abnormal data; the integrity abnormal data comprises adjacent data points of which the time interval exceeds L times of the standard acquisition interval, the timeliness abnormal data comprises data points of which the time interval is smaller than Q times of the standard acquisition interval and at least one time interval exceeding L times of the standard acquisition interval exists in a time window, the consistency abnormal data comprises data points of which the time interval is smaller than Q times of the standard interval and no time interval exceeding L times of the standard acquisition interval exists in the time window, wherein L is a number larger than or equal to 2, and Q is a number larger than 0 and smaller than or equal to 1/2.
In a possible implementation manner, the first determining module 702 is further specifically configured to:
marking adjacent data points with the time interval larger than L times of the standard acquisition interval as integrity abnormal data;
when obtaining the redundant data points with the time interval less than Q times of the standard acquisition interval, searching whether adjacent data points with the time interval exceeding L times of the standard acquisition interval exist in a time window of the redundant data points;
if adjacent data points with time intervals exceeding L times of the standard acquisition interval exist, the redundant data points are moved to the position between the adjacent data points with time intervals exceeding L times of the standard acquisition interval, and the redundant data points are marked as timeliness abnormal data;
if there are no adjacent data points within the time window having a time interval that exceeds L times the standard acquisition interval, the redundant data points are marked as consistent anomalous data.
In one possible implementation manner, the method further includes:
and the restoration module is used for restoring the first group of processed time sequence data to obtain a second group of processed time sequence data, wherein the time interval of any two adjacent data points in the second group of processed time sequence data meets the acquisition interval condition.
And the marking module is used for marking the validity abnormal data in the second group of processed time sequence data according to the distribution of the second group of processed time sequence data.
In a possible implementation manner, the repair module is specifically configured to:
and performing time stamp repairing on the data points marked with the timeliness abnormal data and the data points marked with the consistency abnormal data in the first group of processed time sequence data, and performing interpolation repairing on the data points marked with the integrity abnormal data in the first group of processed time sequence data.
In one possible implementation, the marking module is specifically configured to:
calculating the numerical value distribution, the numerical value change speed distribution and/or the numerical value change acceleration distribution of data points in the second group of processed time sequence data;
and marking data points with abnormal value distribution, abnormal value change speed distribution and/or abnormal value change acceleration distribution in the second group of processed time series data.
In a possible implementation manner, the marking module is further specifically configured to:
marking data points of which the absolute value of the difference value between the numerical value and the numerical value average value in the second group of processed time sequence data exceeds the standard deviation of the K times of value as data points with abnormal numerical distribution;
marking data points of the numerical value change speed standard deviation with the absolute value of the difference value between the numerical value change speed in the second group of processed time sequence data and the average value of the numerical value change speed exceeding K times as data points with abnormal numerical value change speed distribution;
and marking the data points of the numerical change acceleration standard deviation with the absolute value of the difference value of the numerical change acceleration and the average value of the numerical change acceleration in the second group of processed time sequence data exceeding K times as the data points of the abnormal distribution of the numerical change acceleration.
In one possible implementation manner, the method further includes:
the reading module is used for reading sample time sequence data from the terminal equipment, and the sample time sequence data comprises M data points;
the first calculation module is used for calculating the approximate median of the time intervals among the M data points to obtain a standard acquisition interval;
and the second calculation module is used for calculating the numerical average value, the numerical standard deviation, the numerical change speed average value, the numerical change speed standard deviation, the numerical change acceleration average value and the numerical change acceleration standard deviation of the M data points.
In one possible implementation manner, the method further includes:
and the second determining module is used for calculating the integrity index, the consistency index, the timeliness index, the validity index and/or the overall data quality index of the time sequence data to be processed according to the respective proportion of the data points marked with the integrity abnormal data, the consistency abnormal data, the timeliness abnormal data and/or the validity abnormal data in the time sequence data to be processed.
In a third aspect, an embodiment of the present application provides a data processing apparatus, including: at least one processor and memory;
the memory stores computer-executable instructions;
the at least one processor executes computer-executable instructions stored by the memory, causing the at least one processor to perform the data processing method as in the first aspect or any one of the possible implementations of the first aspect.
In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium, where computer-executable instructions are stored, and when a processor executes the computer-executable instructions, a data processing method as in the first aspect or any one of the possible implementation manners of the first aspect is implemented.
In the embodiment of the application, to-be-processed time sequence data is obtained from terminal equipment, wherein the to-be-processed time sequence data comprises N data points, and N is an integer greater than 1; determining abnormal data points meeting preset conditions according to the time sequence data to be processed and marking the abnormal data points; the preset conditions are used for screening out data points of any one or more of the following anomalies: the missing of the time stamp is abnormal, the numerical value is abnormal as null value, the time interval of the adjacent data points does not meet the collection interval condition, the numerical value distribution is abnormal, the numerical value change speed distribution is abnormal or the numerical value change acceleration distribution is abnormal. Therefore, the quality of the time sequence data is evaluated based on multiple abnormal data types of the time sequence data, the usability of the time sequence data can be effectively fed back, and the accuracy of the time sequence data analysis or mining result is improved.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application.
FIG. 1 is a schematic diagram of a scenario in which an embodiment of the present application is applied;
fig. 2 is a schematic flowchart of a data processing method according to an embodiment of the present application;
FIG. 3 is a schematic diagram of anomaly data provided by an embodiment of the present application;
FIG. 4 is a schematic diagram of anomaly data provided by an embodiment of the present application;
fig. 5 is a schematic diagram of a data processing flow provided in an embodiment of the present application;
fig. 6 is a schematic flow diagram of a system for analyzing temporal data quality of an internet of things according to an embodiment of the present application;
fig. 7 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present application;
fig. 8 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present application.
With the above figures, there are shown specific embodiments of the present application, which will be described in more detail below. The drawings and written description are not intended to limit the scope of the inventive concepts in any manner, but rather to illustrate the concepts of the application by those skilled in the art with reference to specific embodiments.
Detailed Description
In the embodiments of the present application, the words "first", "second", and the like are used to distinguish the same items or similar items having substantially the same functions and actions. For example, the first chip and the second chip are only used for distinguishing different chips, and the order of the chips is not limited. Those skilled in the art will appreciate that the terms "first," "second," etc. do not denote any order or quantity, nor do the terms "first," "second," etc. denote any order or importance.
It should be noted that in the embodiments of the present application, words such as "exemplary" or "for example" are used to indicate examples, illustrations or explanations. Any embodiment or design described herein as "exemplary" or "e.g.," is not necessarily to be construed as preferred or advantageous over other embodiments or designs. Rather, use of the word "exemplary" or "such as" is intended to present concepts related in a concrete fashion.
In the embodiments of the present application, "at least one" means one or more, "a plurality" means two or more. "and/or" describes the association relationship of the associated object, indicating that there may be three relationships, for example, a and/or B, which may indicate: a alone, A and B together, and B alone, wherein A and B may be singular or plural. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. "at least one of the following" or similar expressions refer to any combination of these items, including any combination of the singular or plural items. For example, at least one (one) of a, b, or c, may represent: a, b, c, a-b, a-c, b-c, or a-b-c, wherein a, b, c can be single or multiple.
With the development of a new generation of information technology revolution and the growth of the requirement of industrial upgrading in the industry, and the policy drive of informatization, digitization and intelligent transformation of domestic manufacturing industry, the scale of the industrial internet of things is rapidly increasing. The most front end of the industrial internet of things, namely monitoring and evaluating the operation condition of equipment, often generates a large batch of time sequence data.
The time-series data is time-series data, and is a data sequence recorded in time series, and each data in the same data sequence must have the same diameter, and is required to be comparable. The time series data can be time period number or time point number. These data have the following characteristics compared to traditional relational data: the characteristics of time dimensions such as acquisition time and sequence have important significance on the value of time sequence data; new data points are continuously transmitted by real-time monitoring, and the data scale is continuously increased along with the time; the information density is low, and effective information can be obtained by analyzing the monitoring indexes on a long time span based on big data.
Time series data is often stored for a long time for offline data analysis, such as may be used in an industrial enterprise application scenario: analyzing the fault to see what the main equipment fault is; capacity is analyzed, and how to optimize configuration is considered to improve production efficiency; analyzing energy consumption to see how to reduce the production cost; potential safety hazards are analyzed so as to reduce the fault duration; the method is used for detecting the intelligent ammeter, the power grid generating equipment and the like in a centralized way in the power industry; the device is used for monitoring oil wells, transportation pipeline transportation fleets and the like in real time in the petrochemical industry; the system is used for monitoring real-time road conditions of smart cities, intersection traffic of data at a gate and the like in a park; the system is used for monitoring transaction records, access records, ATM and POS machines and the like in the financial industry; the intelligent security system is used for monitoring building access control, vehicle management, well covers, electronic fences and the like; the system is used for monitoring fire fighting, crowd gathering, hazardous chemicals, structural health, elevators and the like in the aspect of emergency response.
However, in an actual application scenario, time series data often has various quality problems, the reliability of the acquisition terminal, the delay of network transmission and other factors affect the usability of the data, and the data analysis work is disturbed, if part of data points are limited by network condition loss, incomplete data brings great challenges to learning a data pattern and summarizing rules; when the terminal sensor works abnormally, wrong data is returned, and the real operation condition of the equipment can not be obtained.
For example, in a scene of monitoring the running condition of a vehicle engine, data such as rotating speed, exhaust gas and oil consumption should be synchronously acquired at fixed time, but complete reliability is often difficult to ensure among a plurality of sensors, error data of part of the sensors can cause that data mining work cannot find out a correct oil consumption curve from the data, and data loss and delay of part of the sensors at the same time can cause that data points cannot reflect rules among the sensors, so that the value of the whole data is reduced.
Therefore, when the existing analysis and mining are performed based on the time series data, some abnormal data often exist in the data, and cannot be found and evaluated in advance, so that partial assumption of the data deviates from an actual result, and the problem that the accuracy of an analysis or mining result is low often occurs.
In view of this, the embodiment of the present application provides a data processing method, which calculates four data quality indexes of integrity, consistency, timeliness and validity of time-series data through statistical analysis, so as to effectively reflect the availability of the whole data.
Fig. 1 shows a schematic view of a scenario to which an embodiment of the present application is applied. As shown in fig. 1, one or more terminal devices in the industrial internet of things send their own time series data to a server, and the server may process the time series data based on the data processing method provided in the embodiment of the present application.
Illustratively, the data processing method may include: acquiring time sequence data to be processed from terminal equipment, wherein the time sequence data to be processed comprises N data points, N is an integer greater than 1, determining abnormal data points meeting preset conditions according to the time sequence data to be processed, and marking the abnormal data points, wherein the preset conditions are used for screening any one or more of the following abnormal data points: the missing of the timestamp is abnormal, the numerical value is abnormal, the time interval of the adjacent data points does not meet the collection interval condition, the numerical value distribution is abnormal, the numerical value change speed distribution is abnormal or the numerical value change acceleration distribution is abnormal. Therefore, the quality of the time sequence data is evaluated based on multiple abnormal data types of the time sequence data, the usability of the time sequence data can be effectively fed back, and the accuracy of the time sequence data analysis or mining result is improved.
The technical solutions of the embodiments of the present application are described in detail by specific examples below. The following embodiments may be combined with each other or implemented independently, and details of the same or similar concepts or processes may not be repeated in some embodiments.
Fig. 2 is a schematic flowchart of a data processing method according to an embodiment of the present application. As shown in fig. 2, the method may include:
s201, acquiring time sequence data to be processed from the terminal equipment, wherein the time sequence data to be processed comprises N data points, and N is an integer greater than 1.
The to-be-processed time sequence data can be data collected in the industrial internet of things, such as environment detection data, equipment maintenance data, system log record data and the like. The data points in the pending time series data may include values and time stamps.
In a possible implementation, the data acquisition device may acquire the to-be-processed time series data from the terminal device periodically or based on an actual situation, for example, the actual situation may be a power failure restart, a fault, and the like. The data acquisition equipment can send the acquired to-be-processed time sequence data to a server cluster or a single computer combined with the cloud storage platform.
Illustratively, a sensor of the terminal device periodically acquires operation data of the terminal device and sends the operation data to the cloud server in real time.
It should be noted that the embodiment of the application can be applied to any fixed frequency acquisition and analysis scenario of industrial internet of things which approximately conforms to normal distribution, such as monitoring of ambient temperature and wind speed, quality inspection of produced products, and the like.
S202, according to the time sequence data to be processed, determining abnormal data points meeting preset conditions and marking the abnormal data points; the preset conditions are used for screening data points of any one or more of the following anomalies: the missing of the timestamp is abnormal, the numerical value is abnormal, the time interval of the adjacent data points does not meet the collection interval condition, the numerical value distribution is abnormal, the numerical value change speed distribution is abnormal or the numerical value change acceleration distribution is abnormal.
Wherein the acquisition interval condition may be an allowable range of a time interval of two adjacent data points, e.g. may be between half the standard acquisition interval and twice the standard acquisition interval. A timestamp missing exception may be a data point with only a numerical value, no timestamp; a numerical null exception may be that a data point has only a time stamp, no numerical value; the anomaly in the time interval of adjacent data points not meeting the acquisition interval condition may be the time interval of adjacent data points exceeding twice the standard interval or the time interval of adjacent data points being less than half the standard interval; the numerical distribution anomaly may be a distance between a numerical value and its distribution mean exceeding K times the standard deviation of the numerical value; the abnormal distribution of the numerical change speed can be that the distance between the numerical change speed and the distribution mean value exceeds the K-time standard deviation of the numerical change speed; the abnormal distribution of the numerical variation acceleration can be that the distance between the variation acceleration and the distribution mean value of the variation acceleration exceeds K times of standard deviation of the variation acceleration;
in possible implementation, data points with missing timestamp abnormality, null value abnormality of numerical values, abnormal time interval of adjacent data points which do not meet the acquisition interval condition, abnormal numerical value distribution, abnormal numerical value change speed distribution or abnormal numerical value change acceleration distribution in the time sequence data to be processed are screened out, and the screened abnormal points are marked.
In the embodiment of the application, to-be-processed time sequence data is acquired from terminal equipment, the to-be-processed time sequence data comprises N data points, N is an integer greater than 1, abnormal data points meeting preset conditions are determined and marked according to the to-be-processed time sequence data, and the preset conditions are used for screening out any one or more of the following abnormal data points: the missing of the time stamp is abnormal, the numerical value is abnormal as null value, the time interval of the adjacent data points does not meet the collection interval condition, the numerical value distribution is abnormal, the numerical value change speed distribution is abnormal or the numerical value change acceleration distribution is abnormal. Therefore, the quality of the time sequence data is evaluated from multiple abnormal data types of the time sequence data, the usability of the time sequence data can be effectively fed back, and the accuracy of the time sequence data analysis or mining result is improved.
Optionally, on the basis of the embodiment corresponding to fig. 2, in a possible implementation, determining an abnormal data point meeting a preset condition and marking the abnormal data point according to the to-be-processed time series data includes: marking data points with missing timestamps in the time sequence data to be processed and data points with numerical values of null values as integrity abnormal data, and obtaining a first group of processed time sequence data except the integrity abnormal data in the time sequence data to be processed;
calculating the time interval of any two adjacent data points in the first group of processed time sequence data;
marking data points in the first group of processed time sequence data, the time intervals of which do not meet the acquisition interval condition, as integrity abnormal data, timeliness abnormal data or consistency abnormal data; the integrity abnormal data comprises adjacent data points of which the time interval exceeds L times of the standard acquisition interval, the timeliness abnormal data comprises data points of which the time interval is less than Q times of the standard acquisition interval and at least one time interval exceeding L times of the standard acquisition interval exists in a time window, and the consistency abnormal data comprises data points of which the time interval is less than Q times of the standard interval and no time interval exceeding L times of the standard acquisition interval exists in the time window, wherein L is a number greater than or equal to 2, and Q is a number greater than 0 and less than or equal to 1/2.
For example, the standard acquisition interval may be an average of the time intervals between adjacent data points in the time series data, or the standard acquisition interval may be an approximate median of the time intervals between adjacent data points in the time series data.
In possible implementation, the time stamp and the numerical value of each data point in the time sequence data to be processed are analyzed, the data point with the missing time stamp and the data point with the numerical value being a null value in the time sequence data to be processed are screened out, and the two data points are marked as integrity abnormal data. The time interval of any two adjacent data points in the first set of processed time series data is calculated. And marking the data points or time periods of the first group of processed time sequence data, the time intervals of which do not meet the collection interval condition, as integrity abnormal data, time-efficiency abnormal data or consistency abnormal data.
Illustratively, the timestamp and the numerical value of each data point in the to-be-processed time series data are analyzed, the data point missing the timestamp and the data point with the numerical value being a null value are marked as integrity abnormal data, and a first set of processed time series data except the integrity abnormal data in the to-be-processed time series data is obtained.
A time interval between any current data point and a previous data point in the first set of processed time series data is calculated.
If the time interval between the current data point and the previous data point is more than 2 times of the standard acquisition interval, the absence of the data point in the time interval is judged, which may be that the server fails to acquire partial data due to equipment failure, network packet loss and other factors, and the current data point is marked as integrity abnormal data.
If the time interval between the current data point and the previous data point is smaller than half of the standard acquisition interval and the time interval between adjacent data points in the time window, for example, the time interval between adjacent data points in 20 data points before and after the current data point is larger than 2 times of the standard acquisition interval, it is determined that the current data point has delay or disorder, and the current data point is marked as time-efficient abnormal data.
If the time interval between the current data point and the previous data point is less than half of the standard acquisition interval and the time interval between the adjacent data points in the time window, for example, the time interval between the adjacent data points does not exist in 20 data points before and after the current data point is more than 2 times of the standard acquisition interval, the current data point is judged to be a redundant point, and the current data point is marked as consistency abnormal data.
In the embodiment of the application, whether the timestamp and the numerical value of each data point in the time sequence data to be processed are incomplete or not is checked, so that the first group of processed time sequence data to be processed is analyzable data, the corresponding abnormal data point or time period is marked as integrity abnormal data, consistency abnormal data or timeliness abnormal data based on the incomplete condition of the data point and the comparison between the time interval of two adjacent data points and the acquisition interval condition, preparation is made for subsequent data quality index updating, the abnormal data point is convenient to backtrack and recheck, and the purpose of assisting data mining work is achieved.
Optionally, the step of marking the data points in the first group of processed time series data, for which the time interval does not satisfy the acquisition interval condition, with integrity abnormal data, time-dependent abnormal data, or consistency abnormal data includes:
marking adjacent data points with the time interval larger than L times of the standard acquisition interval as integrity abnormal data;
when obtaining the redundant data points with the time interval less than Q times of the standard acquisition interval, searching whether adjacent data points with the time interval exceeding L times of the standard acquisition interval exist in a time window of the redundant data points;
if adjacent data points with time intervals exceeding L times of the standard acquisition interval exist, the redundant data points are moved to the position between the adjacent data points with time intervals exceeding L times of the standard acquisition interval, and the redundant data points are marked as timeliness abnormal data;
if there are no adjacent data points within the time window having a time interval that exceeds L times the standard acquisition interval, the redundant data points are marked as consistent anomalous data.
Redundant data points can be understood as data points with a time interval less than Q times the standard acquisition interval, or as data points with the same time stamp.
Exemplarily, fig. 3 shows an abnormal data diagram provided in an embodiment of the present application. As shown in fig. 3, there is a data point missing abnormality between time 10. As shown in fig. 3, the value of null exception occurring at time 10. Redundant data points appear at time 10. And a data point missing exception exists between the time 10.
In the embodiment of the application, based on the comparison between the time interval and the collection interval condition in the first group of processed time sequence data, the integrity abnormal data, the timeliness abnormal data or the consistency abnormal data in the first group of processed time sequence data are marked, and preparation is made for subsequent data quality index updating.
Optionally, the first group of processed time series data is repaired to obtain a second group of processed time series data, wherein the time interval between any two adjacent data points in the second group of processed time series data meets the acquisition interval condition;
and marking the validity abnormal data in the second group of processed time sequence data according to the distribution of the second group of processed time sequence data.
In the embodiment of the application, the first group of processed time sequence data is repaired to obtain the second group of processed time sequence data, wherein the time intervals of any two adjacent data points meet the acquisition interval condition, so that the numerical validity evaluation can be carried out, and the numerical validity evaluation can be carried out by marking the validity abnormal data in the second group of processed time sequence data according to the distribution of the second group of processed time sequence data. The distribution of the second set of processed time series data may include a distribution of values, a distribution of speeds of change of values, and a distribution of accelerations of change of values.
In the embodiment of the application, the first group of processed time sequence data is repaired, and the validity abnormal data in the second group of processed time sequence data is marked according to the second group of processed time sequence data obtained after repair. Therefore, after the first group of processed time sequence data is reasonably repaired, effectiveness evaluation is carried out, on one hand, the quality of the time sequence data to be processed is improved, on the other hand, the reliability of the effectiveness evaluation is also improved, so that the availability of the time sequence data is effectively fed back, and the accuracy of time sequence data analysis or mining results is improved.
Optionally, the repairing the first group of processed time series data includes:
and performing time stamp repairing on the data points marked with the timeliness abnormal data and the data points marked with the consistency abnormal data in the first group of processed time series data, and performing interpolation repairing on the data points marked with the integrity abnormal data in the first group of processed time series data.
In possible implementation, the data points marked with the timeliness abnormal data in the first group of processed time sequence data are moved and filled to the data point missing position, so that the time period corresponding to the data point missing abnormality is recovered to the standard acquisition interval. And carrying out interpolation repair on data points marked with abnormal integrity in the first group of processed time sequence data by an interpolation method, wherein the interpolation method can be a multi-interpolation method, a hot platform interpolation method, a Lagrange interpolation method, a Newton interpolation method and the like.
In the embodiment of the application, different abnormal data in the first group of processed time sequence data are subjected to targeted repair, so that the repair quality of the first group of processed time sequence data is improved, the value validity is evaluated in the following process, and the availability of the time sequence data is effectively fed back.
Optionally, marking validity abnormal data in the second group of processed time series data according to the distribution of the second group of processed time series data includes:
calculating the numerical distribution, the numerical change speed distribution and/or the numerical change acceleration distribution of data points in the second group of processed time series data;
and marking data points with abnormal value distribution, abnormal value change speed distribution and/or abnormal value change acceleration distribution in the second group of processed time series data.
For example, fig. 4 shows an abnormal data diagram provided in the embodiment of the present application, where the abnormal validity data includes abnormal value distribution, abnormal value change speed distribution, and abnormal value change acceleration distribution, where when the value distribution of a data point exceeds the threshold range of the value, the data point has abnormal value distribution, when the value change speed distribution of the data point exceeds the threshold range of the value change speed, the data point has abnormal value change speed distribution, and when the value change acceleration distribution of the data point exceeds the threshold range of the value change acceleration, the data point has abnormal value change acceleration distribution.
It is understood that, the anomaly of three dimensions of the numerical distribution anomaly, the numerical change velocity distribution anomaly and the numerical change acceleration distribution anomaly may exist in one data point at the same time, but the data point is recorded as an anomaly point. When at least one of the three-dimensional anomalies exists in one data point, the data point is marked as validity anomaly data.
In the embodiment of the application, the validity abnormity of the second group of processed time sequence data is evaluated from three dimensions of numerical distribution, numerical change speed distribution and/or numerical change acceleration distribution, so that the validity index of the time sequence data to be processed is calculated in the following manner, and the availability of the time sequence data is effectively fed back.
Optionally, marking data points of the second group of processed time series data with abnormal value distribution, abnormal value change speed distribution and/or abnormal value change acceleration distribution includes:
marking data points of which the absolute value of the difference value between the numerical value and the numerical value average value in the second group of processed time sequence data exceeds the standard deviation of the K times of value as data points with abnormal numerical distribution;
marking data points of the numerical value change speed standard deviation with the absolute value of the difference value of the numerical value change speed and the average value of the numerical value change speed in the second group of processed time sequence data exceeding K times as data points of abnormal numerical value change speed distribution;
and marking the data points of the numerical change acceleration standard deviation with the absolute value of the difference value of the numerical change acceleration and the average value of the numerical change acceleration in the second group of processed time sequence data exceeding K times as the data points of the abnormal distribution of the numerical change acceleration.
In possible implementation, calculating whether the absolute value of the difference value between the numerical value of the data point in the second group of processed time series data and the numerical value average value exceeds the numerical value standard deviation multiplied by K, if so, determining that the data point has abnormal numerical value distribution, and marking the data point as the data point with abnormal numerical value distribution; similarly, calculating whether the absolute value of the difference value between the numerical change speed of the data point in the second group of processed time series data and the average value of the numerical change speed exceeds the standard deviation of the numerical change speed of K times, if so, determining that the data point has abnormal numerical change speed distribution, and marking the data point as the data point with abnormal numerical change speed distribution; and calculating whether the absolute value of the difference value between the numerical variation acceleration of the data point in the second group of processed time sequence data and the average value of the numerical variation acceleration exceeds the standard deviation of the numerical variation acceleration of K times, if so, determining that the data point has abnormal numerical variation acceleration distribution, and marking the data point as the data point with abnormal numerical variation acceleration distribution.
In the embodiment of the application, the numerical distribution abnormality, the numerical change speed distribution abnormality and/or the numerical change acceleration distribution abnormality of the second group of processed time sequence data are obtained by combining the average value and the standard deviation, so that the validity index of the time sequence data to be processed is calculated in the following manner, and the availability of the time sequence data is effectively fed back.
Optionally, sample time series data from the terminal device is read, where the sample time series data includes M data points; calculating the approximate median of the time intervals among the M data points to obtain a standard acquisition interval; and calculating the numerical average value, the numerical standard deviation, the numerical change speed average value, the numerical change speed standard deviation, the numerical change acceleration average value and the numerical change acceleration standard deviation of the M data points.
The sample time series data can be a small amount of data in the time series data to be processed, and no abnormal data point exists.
In possible implementation, sample time series data from the terminal device is read, the sample time series data comprises M data points, and an approximate median of time intervals between adjacent data points in the M data is calculated by using a sketch (sketch) algorithm to obtain a standard acquisition interval; the method comprises the steps of calculating the numerical average value and the numerical standard deviation of M data points by using normal distribution fitting data characteristics, calculating the change speed between every two adjacent data points in the M data points, namely dividing the numerical change by time, further calculating the numerical change speed average value and the numerical change speed standard deviation by using the normal distribution fitting data characteristics, and similarly calculating the numerical change acceleration, the numerical change acceleration average value and the numerical change acceleration standard deviation between every three adjacent data points.
In the embodiment of the application, statistical characteristics are initialized based on sample time sequence data, a standard acquisition interval, a numerical average value, a numerical standard deviation, a numerical change speed average value, a numerical change speed standard deviation, a numerical change acceleration average value and a numerical change acceleration standard deviation are calculated, and partial intermediate results are reserved to reduce the calculation amount of subsequent flow calculation.
Optionally, the integrity index, the consistency index, the timeliness index, the effectiveness index and/or the overall data quality index of the to-be-processed time series data are calculated according to respective proportions of the data points marked with the integrity abnormal data, the consistency abnormal data, the timeliness abnormal data and/or the effectiveness abnormal data in the to-be-processed time series data.
The integrity index mainly reflects the severity of a loss problem of time series data generated in the processes of acquisition, transmission and the like, and mainly comprises data point loss abnormality and data incomplete abnormality, wherein the data point loss abnormality refers to the loss of part of data points generated by the fact that the acquired data points cannot acquire data due to factors such as equipment failure, network packet loss and the like; the data incomplete exception refers to two elements in a data point, and one of a timestamp and a numerical value is missing due to reasons such as information transmission errors, so that the data point cannot be normally used.
The consistency index mainly reflects the redundancy degree of the time sequence data, in the same time sequence data, the same time point usually has only one acquisition result, but due to the problems of network delay, inaccurate time, retransmission and the like, data points with the same timestamp, namely redundant data points, exist in the time sequence data.
The timeliness index mainly reflects the time sequence data acquisition and the punctuality degree of the data reaching a storage end, and due to factors such as the hysteresis of network transmission, the arrival sequence and the arrival interval of data points may be inconsistent with the acquisition.
The validity index mainly reflects the correctness of numerical values, time sequence data collected in an industrial scene usually conform to certain distribution rules or variation trends, and a data validity judgment criterion is specified according to specific data types and characteristics.
In possible implementation, after batch time sequence data processing is completed or after streaming time sequence data reception reaches a set value, the quality indexes of the four data are calculated according to the abnormal conditions of the existing data points. The integrity index unifies various abnormal points into the influence degree of the abnormal points on the time dimension, and the proportion of the time period without abnormal data points to the time length of all data points is used as the index for measuring the integral integrity. The consistency indicator may be defined as the proportion of non-redundant data points to all data points. The timeliness index may be defined as the absence of delays and disorder, i.e., the proportion of data points that arrive on time to all data points. The validity indicator may be defined as the proportion of valid data points to all data points. And the normalization integrity index, the consistency index, the timeliness index and the effectiveness index are overall data quality indexes, abnormal points are reasonably stored, and the overall data quality condition is output.
Illustratively, dividing a time period without an abnormal time period by the total time of the time sequence data to be processed to obtain an integrity index, wherein for a data point with a numerical value of null value abnormality, the abnormal time period is a time interval between the data point and a previous data point, and for timestamp missing abnormality and data point missing abnormality, the abnormal time period is a time interval between the data point before a missing position and a data point after the missing position; dividing data points which are not marked as consistency abnormal data by total data points to obtain a consistency index; dividing data points which are not marked as timeliness abnormal data by total data points to obtain timeliness indexes; the data points that are not marked as validity anomaly data are divided by the total data points to obtain a validity indicator.
In the embodiment of the application, after the quality evaluation of the time sequence data is completed, the system also stores the data portrait, so that the overall situation of the time sequence data can be conveniently mastered, and the rules of abnormal data points can be understood, wherein the rules comprise the standard acquisition interval of the time sequence data, the distribution parameters of three dimensions of numerical values, the numerical value change speed and the numerical value change acceleration, the quantity of various abnormal points and other information.
In the embodiment of the application, the integrity index, the consistency index, the timeliness index, the effectiveness index and/or the overall data quality index data of the time sequence data to be processed are calculated, and the data portrait is stored, so that a user can quickly master the overall situation of the data before data analysis or mining work is carried out, and review and analysis are carried out on typical problems existing in the data.
On the basis of the foregoing embodiments, in order to more clearly describe the technical solution provided by the embodiments of the present application, please refer to fig. 5 exemplarily, and fig. 5 shows a schematic data processing flow diagram provided by the embodiments of the present application, which includes:
s501, reading in a certain batch of time sequence data initially, and initializing each statistical index.
The statistical indexes may include a standard acquisition interval, a numerical average, a numerical standard deviation, a numerical variation speed average, a numerical variation speed standard deviation, a numerical variation acceleration average, a numerical variation acceleration standard deviation, and the like. For specific implementation, reference may be made to the above embodiments, which are not described herein again.
And S502, inputting a new data point.
And acquiring time sequence data to be processed from the terminal equipment, wherein the time sequence data to be processed comprises N data points, and N is an integer greater than 1.
S503, checking whether the data point format is correct.
And analyzing the time stamp and the numerical value of the data point, and if the format of the data point is correct, entering step S504. If the data point format is incorrect, the method comprises the following steps:
s503a, the numerical value is null value abnormal. If the value of the data point is missing or the format is wrong, the data point is judged to have null value abnormality.
S503b, the timestamp is abnormal in missing. If the timestamp of the data point is missing or the format of the data point is wrong, judging that the data point has timestamp missing abnormality.
S504, calculating the time interval between the current data point and the previous data point, and comparing the time interval with the standard acquisition interval.
And S505, judging the data point abnormal type. For specific implementation, reference may be made to the above embodiments, which are not described herein again. Data point exception types include:
and S505a, data point redundancy exception.
And S505b, the data points are out of order and abnormal.
And S505c, data point missing abnormity.
And S506a, updating the consistency index. The consistency index is updated based on the data redundancy exceptions.
And S506b, updating the timeliness index. And updating the timeliness index based on the data out-of-order abnormality.
And S506c, updating the integrity index. The integrity indicator is updated based on the value being a null anomaly, a timestamp miss anomaly, and/or a data point miss anomaly.
And S507, repairing the time stamp of the data point.
And repairing the time stamp of the data point aiming at the data point redundancy exception and the data point disorder exception. In one possible implementation, the values of the data points may also be repaired. For specific implementation, reference may be made to the above embodiments, which are not described herein again.
And S508, calculating the numerical change speed and the numerical change acceleration.
And S509, judging that the effectiveness is abnormal.
And judging the validity abnormality of the data point based on the numerical value, the numerical value change speed and the numerical value change acceleration as well as the respective statistical indexes of the numerical value, the numerical value change speed and the numerical value change acceleration. For specific implementation, reference may be made to the above embodiments, which are not described herein again.
If the data point has an abnormal validity, step S511 is performed, and if the data point has no abnormal validity, steps S510 to S511 are performed. The validity exceptions include: numerical distribution abnormality, numerical change speed distribution abnormality, and numerical change acceleration distribution abnormality.
And S510, updating three dimension distribution parameters.
And updating the numerical value, the numerical value change speed and the statistical index of the numerical value change acceleration of the data point based on the data point without the effectiveness abnormity.
And S511, updating the effectiveness index.
Updating the validity indicator for the data point based on the validity anomaly for the data point. For specific implementation, reference may be made to the above embodiments, which are not described herein again.
And S512, completing quality analysis of the data points.
And normalizing the integrity index, the consistency index, the timeliness index and the effectiveness index of the data point to obtain the overall data quality index, and completing the quality analysis of the data point.
In the embodiment of the application, the data quality index is calculated based on multiple abnormal data types of the time sequence data, so that the quality of the time sequence data is evaluated, the availability of the time sequence data can be effectively fed back, and the accuracy of time sequence data analysis or mining results is improved.
On the basis of the foregoing embodiments, for example, fig. 6 shows a schematic flow diagram of an internet of things time series data quality analysis system provided in the embodiment of the present application, where the system includes:
s601, accessing a cloud platform, and reading time sequence data acquired by the Internet of things.
The cloud platform for storing the time sequence data of the Internet of things is accessed, and the acquired time sequence data is read from the cloud platform.
And S602, reading time sequence data of one month.
And S603, evaluating the quality of the time series data and calculating a data quality index.
The embodiment of the application evaluates the data quality of the data and calculates the data quality index by using the steps of the embodiment based on the sensor data collected within one month, and meanwhile, intermediate results are reserved to support the calculation of the subsequent steps.
S604, storing abnormal data point information.
In order to facilitate the review and analysis of typical data quality problems in time series data, the embodiment of the application stores information related to abnormal points, including data points in a front section of window and a rear section of window and data portrait indexes at that time, into a cloud platform database to form an analysis tool capable of comprehensively inquiring according to the data quality indexes, the types of the abnormal data points and the time of the abnormal data points.
And S605, aggregating the data quality indexes.
According to the embodiment of the application, after the data quality index calculation of a single month is completed, the result of the current month is recorded, and a data quality change table is formed together with the previous result, so that the data quality change condition of a long period can be analyzed, and the equipment can be checked as early as possible in a production environment to find out the fault reason. In addition to the month-based index, the embodiment of the present application aggregates the quarterly and yearly data quality and image index based on the intermediate calculation result of each month, and is used for grasping the data quality situation of a longer period.
In the embodiment of the application, four data quality indexes are designed according to the main characteristics of time sequence data in the scene of an industrial internet of things, each data quality index comprises typical abnormal problems in various time sequence data, the indexes are used for rapidly grasping the overall data quality condition, several important characteristics of the time sequence data are calculated by using a statistical method, the important characteristics comprise standard acquisition intervals, numerical value distribution, numerical value change speed distribution, numerical value change acceleration distribution and the like, the abnormal condition of a data point is judged, four data quality indexes of the overall data are calculated, a long-period data quality index curve and an analysis tool are formed, the usability of the overall data is effectively reflected, abnormal point backtracking and rechecking are provided, and the purpose of assisting data mining work is achieved.
It should be understood that the above examples are illustrative and are not to be construed as specifically limiting the embodiments of the present application.
Fig. 7 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present application. As shown in fig. 7, the data processing apparatus 70 includes: an acquisition module 701, a first determination module 702, a repair module 703, a marking module 704, a reading module 705, a first calculation module 706, a second calculation module 707, and a second determination module 708.
The obtaining module 701 is configured to obtain to-be-processed time series data from a terminal device, where the to-be-processed time series data includes N data points, and N is an integer greater than 1.
A first determining module 702, configured to determine, according to the to-be-processed time series data, an abnormal data point that meets a preset condition and mark the abnormal data point; the preset conditions are used for screening out data points of any one or more of the following anomalies: the missing of the time stamp is abnormal, the numerical value is abnormal as null value, the time interval of the adjacent data points does not meet the collection interval condition, the numerical value distribution is abnormal, the numerical value change speed distribution is abnormal or the numerical value change acceleration distribution is abnormal.
Optionally, the first determining module 702 is specifically configured to:
marking data points with missing timestamps in the time sequence data to be processed and data points with numerical values of null values as integrity abnormal data, and obtaining a first group of processed time sequence data except the integrity abnormal data in the time sequence data to be processed;
calculating the time interval of any two adjacent data points in the first group of processed time series data;
marking data points in the first group of processed time sequence data, the time intervals of which do not meet the acquisition interval condition, as integrity abnormal data, timeliness abnormal data or consistency abnormal data; the integrity abnormal data comprises adjacent data points of which the time interval exceeds L times of the standard acquisition interval, the timeliness abnormal data comprises data points of which the time interval is less than Q times of the standard acquisition interval and at least one time interval exceeding L times of the standard acquisition interval exists in a time window, and the consistency abnormal data comprises data points of which the time interval is less than Q times of the standard interval and no time interval exceeding L times of the standard acquisition interval exists in the time window, wherein L is a number greater than or equal to 2, and Q is a number greater than 0 and less than or equal to 1/2.
Optionally, the first determining module 702 is further specifically configured to:
marking adjacent data points with the time interval larger than L times of the standard acquisition interval as integrity abnormal data;
when obtaining the redundant data points with the time interval less than Q times of the standard acquisition interval, searching whether adjacent data points with the time interval exceeding L times of the standard acquisition interval exist in a time window of the redundant data points;
if adjacent data points with the time interval exceeding L times of the standard acquisition interval exist, the redundant data points are moved to the position between the adjacent data points with the time interval exceeding L times of the standard acquisition interval, and the redundant data points are marked as timeliness abnormal data;
if there are no adjacent data points within the time window having a time interval that exceeds L times the standard acquisition interval, the redundant data points are marked as consistent anomalous data.
Optionally, the data processing apparatus 70 further includes:
the repairing module 703 is configured to repair the first set of processed time series data to obtain a second set of processed time series data, where a time interval between any two adjacent data points in the second set of processed time series data meets the acquisition interval condition.
And a marking module 704, configured to mark validity abnormal data in the second set of processed time series data according to distribution of the second set of processed time series data.
Optionally, the repair module 703 is specifically configured to:
and performing time stamp repairing on the data points marked with the timeliness abnormal data and the data points marked with the consistency abnormal data in the first group of processed time series data, and performing interpolation repairing on the data points marked with the integrity abnormal data in the first group of processed time series data.
Optionally, the marking module 704 is specifically configured to:
calculating the numerical value distribution, the numerical value change speed distribution and/or the numerical value change acceleration distribution of data points in the second group of processed time sequence data;
and marking data points with abnormal value distribution, abnormal value change speed distribution and/or abnormal value change acceleration distribution in the second group of processed time series data.
Optionally, the marking module 704 is further specifically configured to:
marking data points of which the absolute value of the difference value between the numerical value and the numerical value average value in the second group of processed time sequence data exceeds the standard deviation of the K times of value as data points with abnormal numerical distribution;
marking data points of the numerical value change speed standard deviation with the absolute value of the difference value between the numerical value change speed in the second group of processed time sequence data and the average value of the numerical value change speed exceeding K times as data points with abnormal numerical value change speed distribution;
and marking the data points of the numerical change acceleration standard deviation with the absolute value of the difference value of the numerical change acceleration and the average value of the numerical change acceleration in the second group of processed time sequence data exceeding K times as the data points of the abnormal distribution of the numerical change acceleration.
Optionally, the data processing apparatus 70 further includes:
a reading module 705, configured to read sample timing data from a terminal device, where the sample timing data includes M data points;
a first calculating module 706, configured to calculate an approximate median of time intervals between the M data points, so as to obtain a standard acquisition interval;
the second calculating module 707 is configured to calculate a numerical average value, a numerical standard deviation, a numerical change speed average value, a numerical change speed standard deviation, a numerical change acceleration average value, and a numerical change acceleration standard deviation of the M data points.
Optionally, the data processing apparatus 70 further includes:
the second determining module 708 is configured to calculate an integrity index, a consistency index, a timeliness index, an effectiveness index, and/or an overall data quality index of the to-be-processed time series data according to respective proportions of the data points marked with the integrity abnormal data, the consistency abnormal data, the timeliness abnormal data, and/or the effectiveness abnormal data in the to-be-processed time series data.
The data processing apparatus provided in the embodiment of the present application may be configured to execute the method embodiments described above, and the implementation principle and the technical effect are similar, which are not described herein again in the embodiment of the present application.
Fig. 8 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present application. As shown in fig. 8, a data processing apparatus 80 provided in the embodiment of the present application includes: at least one processor 801 and a memory 802. The data processing device 80 further comprises a communication component 803. The processor 801, the memory 802, and the communication unit 803 are connected by a bus 804.
In particular implementations, at least one processor 801 executes computer-executable instructions stored by memory 802, causing at least one processor 801 to perform a data processing method as performed by data processing apparatus 80 above.
For the specific implementation process of the processor 801, reference may be made to the above method embodiments, which implement principles and technical effects similar to each other, and details are not described herein again.
In the embodiment shown in fig. 8, it should be understood that the Processor 801 may be a Central Processing Unit (CPU), other general purpose processors, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of a method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware processor, or may be implemented by a combination of hardware and software modules in a processor. The Memory 602 may include a high-speed Random Access Memory (RAM), a Non-volatile Memory (NVM), at least one disk Memory, a usb disk, a removable hard disk, a read-only Memory, a magnetic disk, or an optical disk.
The embodiment of the present application further provides a storage medium, where computer execution instructions are stored in the storage medium, and when the computer execution instructions are executed by the processor, the data processing method is implemented. The storage medium may be implemented by any type of volatile or non-volatile storage device or combination thereof, such as a Static Random Access Memory (SRAM), an Electrically Erasable Programmable Read-Only Memory (EPROM), a Programmable Read-Only Memory (PROM), a Read-Only Memory (ROM), a magnetic Memory, a flash Memory, a magnetic disk or an optical disk. A storage media may be any available media that can be accessed by a general purpose or special purpose computer.
An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. Of course, the storage medium may also be integral to the processor. The processor and the storage medium may reside in an Application Specific Integrated Circuits (ASIC). Of course, the processor and the storage medium may reside as discrete components in an electronic device or host device.
Embodiments of the present application further provide a program product, such as a computer program, which when executed by a processor, implements the data processing method covered by the embodiments of the present application.
Those of ordinary skill in the art will understand that: all or a portion of the steps of implementing the above-described method embodiments may be performed by hardware associated with program instructions. The program may be stored in a computer-readable storage medium. When executed, the program performs steps comprising the method embodiments described above; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.
Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the embodiments of the present application, and are not limited thereto; although the embodiments of the present application have been described in detail with reference to the foregoing embodiments, those skilled in the art will understand that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and these modifications or substitutions do not depart from the scope of the embodiments of the present application.

Claims (10)

1. A method of data processing, the method comprising:
acquiring time sequence data to be processed from terminal equipment, wherein the time sequence data to be processed comprises N data points, and N is an integer greater than 1;
determining abnormal data points meeting preset conditions according to the time sequence data to be processed and marking the abnormal data points; the preset condition is used for screening out any one or more abnormal data points: the missing of the time stamp is abnormal, the numerical value is abnormal as null value, the time interval of the adjacent data points does not meet the collection interval condition, the numerical value distribution is abnormal, the numerical value change speed distribution is abnormal or the numerical value change acceleration distribution is abnormal.
2. The method according to claim 1, wherein the determining abnormal data points meeting a preset condition and marking the abnormal data points according to the to-be-processed time series data comprises:
marking data points with missing timestamps in the time sequence data to be processed and data points with numerical values of null values as integrity abnormal data, and obtaining a first group of processed time sequence data except the integrity abnormal data in the time sequence data to be processed;
calculating the time interval of any two adjacent data points in the first set of processed time series data;
marking data points in the first set of processed time series data, the time intervals of which do not meet the acquisition interval condition, as the integrity abnormal data, the timeliness abnormal data or the consistency abnormal data; the integrity abnormal data comprises adjacent data points with a time interval exceeding L times of a standard acquisition interval, the timeliness abnormal data comprises data points with a time interval smaller than Q times of the standard acquisition interval and at least one time interval exceeding L times of the standard acquisition interval in a time window, the consistency abnormal data comprises data points with a time interval smaller than Q times of the standard interval and without a time interval exceeding L times of the standard acquisition interval in the time window, wherein L is a number larger than or equal to 2, and Q is a number larger than 0 and smaller than or equal to 1/2.
3. The method of claim 2, wherein said flagging data points in the first set of post-processing time series data for which a time interval does not satisfy the acquisition interval condition as the integrity anomaly data, time-dependent anomaly data, or consistency anomaly data comprises:
tagging neighboring data points having a time interval greater than the L times the standard acquisition interval as the integrity anomaly data;
when obtaining the redundant data points with the time interval smaller than the Q times of the standard acquisition interval, searching whether adjacent data points with the time interval exceeding the L times of the standard acquisition interval exist in a time window of the redundant data points;
if adjacent data points exist, the time interval of which exceeds the L times of the standard acquisition interval, the redundant data points are moved to the position between the adjacent data points, the time interval of which exceeds the L times of the standard acquisition interval, and the redundant data points are marked as the timeliness abnormal data;
and if no adjacent data point with the time interval exceeding the L times of the standard acquisition interval exists in the time window, marking the redundant data point as the consistency abnormal data.
4. The method of claim 3, further comprising:
restoring the first group of processed time sequence data to obtain a second group of processed time sequence data, wherein the time interval of any two adjacent data points in the second group of processed time sequence data meets the acquisition interval condition;
and marking the validity abnormal data in the second group of processed time sequence data according to the distribution of the second group of processed time sequence data.
5. The method of claim 4, wherein the repairing the first set of processed time series data comprises:
and performing time stamp repairing on the data points marked with the time-efficiency abnormal data and the data points marked with the consistency abnormal data in the first group of processed time sequence data, and performing interpolation repairing on the data points marked with the integrity abnormal data in the first group of processed time sequence data.
6. The method of claim 4, wherein the marking of the validity anomaly data in the second set of post-processing time-series data according to the distribution of the second set of post-processing time-series data comprises:
calculating the numerical value distribution, the numerical value change speed distribution and/or the numerical value change acceleration distribution of data points in the second group of processed time sequence data;
and marking data points with abnormal value distribution, abnormal value change speed distribution and/or abnormal value change acceleration distribution in the second group of processed time series data.
7. The method of claim 6, wherein the tagging data points for numerical distribution anomalies, numerical change velocity distribution anomalies, and/or numerical change acceleration distribution anomalies in the second set of post-processing time series data comprises:
marking data points of which the absolute value of the difference value between the numerical value and the numerical value average value in the second group of processed time sequence data exceeds the standard deviation of the K times of value as data points with abnormal numerical distribution;
marking the data points of the numerical value change speed standard deviation with the absolute value of the difference value of the numerical value change speed and the average value of the numerical value change speed in the second group of processed time sequence data exceeding K times as the data points with abnormal numerical value change speed distribution;
and marking the data points of the numerical change acceleration standard deviation with the absolute value of the difference value of the numerical change acceleration and the average value of the numerical change acceleration in the second group of processed time sequence data exceeding K times as the data points with abnormal numerical change acceleration distribution.
8. The method of claim 7, further comprising:
reading sample time sequence data from the terminal equipment, wherein the sample time sequence data comprises M data points;
calculating the approximate median of the time intervals among the M data points to obtain a standard acquisition interval;
and calculating the numerical average value, the numerical standard deviation, the numerical change speed average value, the numerical change speed standard deviation, the numerical change acceleration average value and the numerical change acceleration standard deviation of the M data points.
9. A data processing apparatus, characterized by comprising: at least one processor and memory;
the memory stores computer execution instructions;
execution of the computer-executable instructions stored by the memory by the at least one processor causes the at least one processor to perform the method of any one of claims 1-8.
10. A computer-readable storage medium having stored thereon computer-executable instructions that, when executed by a processor, perform the method of any one of claims 1-8.
CN202210663208.6A 2022-06-13 2022-06-13 Data processing method and device Pending CN115185932A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210663208.6A CN115185932A (en) 2022-06-13 2022-06-13 Data processing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210663208.6A CN115185932A (en) 2022-06-13 2022-06-13 Data processing method and device

Publications (1)

Publication Number Publication Date
CN115185932A true CN115185932A (en) 2022-10-14

Family

ID=83512926

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210663208.6A Pending CN115185932A (en) 2022-06-13 2022-06-13 Data processing method and device

Country Status (1)

Country Link
CN (1) CN115185932A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116108086A (en) * 2023-02-27 2023-05-12 广州汇通国信科技有限公司 Time sequence data evaluation method and device, electronic equipment and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116108086A (en) * 2023-02-27 2023-05-12 广州汇通国信科技有限公司 Time sequence data evaluation method and device, electronic equipment and storage medium
CN116108086B (en) * 2023-02-27 2023-09-26 广州汇通国信科技有限公司 Time sequence data evaluation method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
US20190228296A1 (en) Significant events identifier for outlier root cause investigation
CN112735094B (en) Geological disaster prediction method and device based on machine learning and electronic equipment
CN111027615B (en) Middleware fault early warning method and system based on machine learning
CN111459700A (en) Method and apparatus for diagnosing device failure, diagnostic device, and storage medium
CN111045894B (en) Database abnormality detection method, database abnormality detection device, computer device and storage medium
CN104137078A (en) Operation management device, operation management method, and program
US20160255109A1 (en) Detection method and apparatus
EP3163519A1 (en) Methods for detecting one or more aircraft anomalies and devices thereof
CN111143438A (en) Workshop field data real-time monitoring and anomaly detection method based on stream processing
CN111541559A (en) Fault positioning method based on causal rule
Kirchen et al. Metrics for the evaluation of data quality of signal data in industrial processes
CN111813585A (en) Prediction and processing of slow discs
CN111737244A (en) Data quality inspection method, device, computer system and storage medium
US20180307218A1 (en) System and method for allocating machine behavioral models
Atzmueller et al. Anomaly detection and structural analysis in industrial production environments
CN115185932A (en) Data processing method and device
CN112380073B (en) Fault position detection method and device and readable storage medium
JP5668425B2 (en) Failure detection apparatus, information processing method, and program
US20200133253A1 (en) Industrial asset temporal anomaly detection with fault variable ranking
CN114418775A (en) Method, device, equipment and medium for checking annual fund investment data
EP2915059B1 (en) Analyzing data with computer vision
CN112416896A (en) Data abnormity warning method and device, storage medium and electronic device
CN111666171A (en) Fault identification method and device, electronic equipment and readable storage medium
CN117439899B (en) Communication machine room inspection method and system based on big data
Stang et al. The effect of data quality on data mining-Improving prediction accuracy by generic data cleansing.

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination