WO2021238455A1

WO2021238455A1 - Data processing method and device, and computer-readable storage medium

Info

Publication number: WO2021238455A1
Application number: PCT/CN2021/086644
Authority: WO
Inventors: 蒋勇; 彭鑫; 叶德忠
Original assignee: 中兴通讯股份有限公司
Priority date: 2020-05-29
Filing date: 2021-04-12
Publication date: 2021-12-02
Also published as: CN113742387A

Abstract

A data processing method and device, and a computer-readable storage medium. The data processing method comprises: acquiring a target data sequence (S100); acquiring a first anomalous data segment from the target data sequence (S200); acquiring a first data search space from the target data sequence (S300); acquiring, according to the first anomalous data segment and from the first data search space, a second anomalous data segment corresponding to the first anomalous data segment (S400); and labeling the second anomalous data segment (S500).

Description

Data processing method, equipment and computer readable storage medium

Cross-references to related applications

This application is filed based on a Chinese patent application with an application number of 202010473617.0 and an application date of May 29, 2020, and claims the priority of the Chinese patent application. The entire content of the Chinese patent application is hereby incorporated by reference into this application.

Technical field

The embodiments of the present invention relate to, but are not limited to, the field of information processing technology, and in particular, to a data processing method, device, and computer-readable storage medium.

Background technique

With the development of big data and artificial intelligence technology, more and more intelligent and efficient machine learning technologies have been introduced into the operation and maintenance of communication networks, such as indicator anomaly perception, trend prediction, fault root cause analysis, and so on. These technologies usually need to rely on high-quality training data sets to achieve better application effects, and reliable label data is a part of high-quality training data sets. Moreover, in recent years, the application of deep learning technology in the field of image and speech recognition has achieved great success, and it is inseparable from the label data set obtained by relying on a large amount of manpower to label.

However, manual labeling of huge training data sets is very expensive and requires a lot of human resources and time resources. For example, for a medium-scale network, there are millions of massive time series data. If you manually label all abnormal data in these data, it is impossible to complete the task, even if some empirical formulas are used to solve the problem. There are also problems with inaccurate and incomplete results when performing auxiliary labeling. Therefore, how to improve the efficiency of labeling abnormal data in the data is a technical problem to be solved urgently.

Summary of the invention

The following is an overview of the topics detailed in this article. This summary is not intended to limit the scope of protection of the claims.

In the first aspect, embodiments of the present invention provide a data processing method, device, and computer-readable storage medium, which at least solve the above technical problems to a certain extent.

In a second aspect, an embodiment of the present invention provides a data processing method, including obtaining a target data sequence; obtaining a first abnormal data segment in the target data sequence; obtaining a first data search space in the target data sequence Acquire a second abnormal data segment corresponding to the first abnormal data segment in the first data search space according to the first abnormal data segment; mark the second abnormal data segment.

In a third aspect, an embodiment of the present invention also provides a device, including: a memory, a processor, and a computer program stored in the memory and running on the processor. When the processor executes the computer program, the above The data processing method of the second aspect is described.

In a fourth aspect, an embodiment of the present invention also provides a computer-readable storage medium that stores computer-executable instructions, and the computer-executable instructions are used to execute the above-mentioned data processing method.

Other features and advantages of the present invention will be described in the following description, and partly become obvious from the description, or understood by implementing the present invention. The purpose and other advantages of the present invention can be realized and obtained through the structures specifically pointed out in the specification, claims and drawings.

Description of the drawings

The accompanying drawings are used to provide a further understanding of the technical solution of the present invention, and constitute a part of the specification. Together with the embodiments of the present invention, they are used to explain the technical solution of the present invention, and do not constitute a limitation to the technical solution of the present invention.

FIG. 1 is a schematic diagram of a system architecture platform for executing a data processing method according to an embodiment of the present invention;

Figure 2 is a flowchart of a data processing method provided by an embodiment of the present invention;

FIG. 3 is a flowchart of a data processing method provided by another embodiment of the present invention;

4 is a flowchart of a data processing method provided by another embodiment of the present invention;

FIG. 5 is a flowchart of a data processing method provided by another embodiment of the present invention;

FIG. 6 is a flowchart of a data processing method provided by another embodiment of the present invention;

FIG. 7 is a flowchart of a data processing method provided by another embodiment of the present invention;

FIG. 8 is a flowchart of a data processing method provided by another embodiment of the present invention;

FIG. 9 is a flowchart of a data processing method provided by another embodiment of the present invention;

FIG. 10 is a flowchart of a data processing method provided by another embodiment of the present invention;

FIG. 11 is a flowchart of a data processing method provided by another embodiment of the present invention;

FIG. 12 is a flowchart of a data processing method provided by another embodiment of the present invention;

FIG. 13 is a flowchart of a data processing method provided by another embodiment of the present invention;

FIG. 14 is a flowchart of a data processing method provided by another embodiment of the present invention;

FIG. 15 is a flowchart of a heuristic algorithm provided by an embodiment of the present invention;

16 is a flowchart of a heuristic algorithm provided by another embodiment of the present invention;

FIG. 17 is a main flow diagram of a data processing method provided by another embodiment of the present invention.

Detailed ways

In order to make the objectives, technical solutions, and advantages of the present invention clearer, the following further describes the present invention in detail with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present invention, but not used to limit the present invention.

It should be noted that although the functional module division is carried out in the device schematic diagram and the logical sequence is shown in the flowchart, in some cases, it can be executed in a different order from the module division in the device or the sequence in the flowchart. Steps shown or described. The terms "first", "second", etc. in the specification, claims, or the above-mentioned drawings are used to distinguish similar objects, and are not necessarily used to describe a specific order or sequence.

Regarding the data collected during network operation, such as time series indicator data, most of the time series indicator data will have certain periodic characteristics, and many recurring abnormal data tend to appear in the same position in different cycles, and The shape of the abnormal data will show a certain similarity. In particular, in a long-duration time series indicator data with a large amount of abnormal data, most of the abnormal data can be attributed to several types of abnormalities with similar characteristics, and the number of truly unique abnormal data is relatively small. In addition, there will be relatively similar abnormal data between similar and different time series indicator data. For example, a network element has abnormally high central processing unit (CPU) utilization during a certain period of time. This situation may also appear in the CPU utilization timing data of another network element that undertakes similar services.

Based on the foregoing, the embodiments of the present invention provide a data processing method, device, and computer-readable storage medium. According to the periodic characteristics of the abnormal data that occurs repeatedly in most data, the first abnormality is obtained in the target data sequence. The data segment and the first data search space enable the first abnormal data segment to be used as an abnormal data segment template, so that the corresponding second abnormal data segment can be obtained and marked in the first data search space according to the abnormal data segment template, That is, the purpose of labeling other abnormal data segments in the target data sequence is achieved. Therefore, for time series index data with a large amount of abnormal data but not many abnormal types, compared with the traditional manual labeling of abnormal data segments, this The solution provided by the embodiment of the invention can improve the labeling efficiency of abnormal data in the data, thereby saving human resources and time resources.

The embodiments of the present invention will be further described below in conjunction with the accompanying drawings.

As shown in FIG. 1, FIG. 1 is a schematic diagram of a system architecture platform for executing a data processing method provided by an embodiment of the present invention.

In the example of FIG. 1, the system architecture platform includes a memory 110 and a processor 120, where the memory 110 and the processor 120 may be connected by a bus or in other ways. In FIG. 1, the connection by a bus is taken as an example.

As a non-transitory computer-readable storage medium, the memory 110 can be used to store non-transitory software programs and non-transitory computer-executable programs. In addition, the memory 110 may include a high-speed random access memory, and may also include a non-transitory memory, such as at least one magnetic disk storage device, a flash memory device, or other non-transitory solid-state storage devices. In some embodiments, the memory 110 includes memories remotely arranged with respect to the processor 120, and these remote memories may be connected to the system architecture platform through a network. Examples of the aforementioned networks include, but are not limited to, the Internet, corporate intranets, local area networks, mobile communication networks, and combinations thereof.

Those skilled in the art can understand that the system architecture platform can be applied to various network controllers or network managers, which is not specifically limited in this embodiment. In addition, the network controller or network manager with the system architecture platform can be applied to various network systems, for example, can be applied to 3G communication network systems, LTE communication network systems, 5G communication network systems, and subsequent evolved mobile communication network systems Etc., this embodiment does not specifically limit this.

Those skilled in the art can understand that the system architecture platform shown in FIG. 1 does not constitute a limitation to the embodiment of the present invention, and may include more or less components than those shown in the figure, or a combination of certain components, or different components. Component arrangement.

In the system architecture platform shown in FIG. 1, the processor 120 can call the data processing program stored in the memory 110 to execute the data processing method.

Based on the foregoing system architecture platform, various embodiments of the data processing method of the present invention are presented below.

As shown in FIG. 2, FIG. 2 is a flowchart of a data processing method provided by an embodiment of the present invention. The data processing method includes but is not limited to step S100, step S200, step S300, step S400, and step S500.

Step S100: Obtain the target data sequence.

In one embodiment, the target data sequence may be time-series indicator data or other sequence data. The other sequence data may be non-time-series indicator data such as business type sequence data or business quantity sequence data. This embodiment does not There is no specific limitation. In addition, the target data sequence can be automatically obtained by the device with the above-mentioned system architecture platform in the network, or it can be obtained by entering into the device with the above-mentioned system architecture platform through manual operation, which is not specifically limited in this embodiment.

Step S200: Acquire the first abnormal data segment in the target data sequence.

In one embodiment, the first abnormal data segment is a data segment with abnormal data in the target data sequence. The first abnormal data segment in the target data sequence can be manually determined and selected, and then entered into the system with the above-mentioned system architecture. In the device of the platform, the device can obtain the first abnormal data segment, or it can be stored in the memory of the device so that the device can obtain the first abnormal data segment from the memory. After the first abnormal data segment in the target data sequence is obtained, the necessary basic conditions can be provided for labeling the remaining abnormal data segments in the target data sequence in the subsequent steps.

It is worth noting that in time series indicator data, abnormal data often occurs at one or more consecutive time points. The time point when the abnormal data occurs is called the abnormal time point, and the data corresponding to the set of abnormal time points The segment is called the abnormal data segment. An abnormal data segment may last a long time (that is, it contains many abnormal time points). Therefore, for an abnormal data segment, at least the following characteristics are required: starting time point, ending time point, including at least 3 time points, abnormal data There is no overlap point in time between segments.

Step S300: Acquire a first data search space in the target data sequence.

In an embodiment, the data search space is a part of candidate abnormal data extracted from the target data sequence by a machine learning method. By obtaining the data search space, most of the normal data can be filtered out, and similar data segments appearing in these normal data can be prevented from being misjudged as similar abnormal data and searched out, thereby improving the search accuracy of similar abnormal data; In addition, by obtaining the data search space, the search range of similar abnormal data can also be narrowed, thereby improving the search efficiency.

In an embodiment, the first abnormal data segment may be a data segment outside the first data search space, or may be a data segment in the first data search space, which is not specifically limited in this embodiment. When the first abnormal data segment is a data segment in the first data search space, the first abnormal data segment is obtained in the first data search space because the preliminary extraction of abnormal data has been carried out when the first data search space is obtained. , Can make the acquisition of the first abnormal data segment more accurate and effective.

Step S400: Acquire a second abnormal data segment corresponding to the first abnormal data segment in the first data search space according to the first abnormal data segment.

In an embodiment, the first abnormal data segment can be used as an abnormal data segment template, and the data segment in the first data search space can be compared with the first abnormal data segment to find out the data segment in the first data search space. The data segment that is the same or similar to the first abnormal data segment, that is, the data segment that is the same or similar to the first abnormal data segment, is the second abnormal data segment. Therefore, by taking the first abnormal data segment as the template of the abnormal data segment and obtaining the second abnormal data segment corresponding to the first abnormal data segment in the first data search space, it is possible to find the abnormal data in the target data sequence. Therefore, for time series index data with a large amount of abnormal data but not many abnormal types, compared with the traditional method of manually searching for abnormal data, this embodiment can improve the efficiency of finding abnormal data in the data, thereby saving Human resources and time resources.

In an embodiment, all the data in the first data search space can be regarded as a data segment, and the dynamic time warping (Dynamic Time Warping, DTW) algorithm can be used to calculate the similarity. When the calculation result is similar, the first data can be determined A second abnormal data segment corresponding to the first abnormal data segment in a data search space. In addition, the data in the first data search space can be divided into multiple data segments with the same length as the first abnormal data segment, and methods such as Euclidean distance or Pearson correlation coefficient can be used to compare the first abnormal data segment and the first abnormal data segment. Similarity calculations are performed on multiple data segments in the data search space to determine the second abnormal data segment corresponding to the first abnormal data segment in the first data search space.

Step S500: Mark the second abnormal data segment.

In one embodiment, when the second abnormal data segment corresponding to the first abnormal data segment is found in the first data search space, the second abnormal data segment can be marked, so as to facilitate the formation of abnormal label data. In order to obtain high-quality training data sets, which can be used in machine learning technologies such as deep learning technologies.

In one embodiment, since the data processing method uses the above-mentioned steps S100, S200, S300, S400, and S500, the first abnormal data segment and the first data search space are acquired in the target data sequence, so that the The first abnormal data segment can be used as an abnormal data segment template, so that the corresponding second abnormal data segment can be obtained and marked in the first data search space according to the abnormal data segment template, that is, to realize the detection of other abnormal data in the target data sequence. The purpose of labeling data segments. Therefore, for time series index data with a large amount of abnormal data but not many types of abnormalities, compared with the traditional manual labeling of abnormal data segments, the data processing method of this embodiment can improve the abnormalities in the data. The efficiency of data labeling can save human resources and time resources.

In addition, referring to FIG. 3, in an embodiment, step S300 includes but is not limited to the following steps:

Step S310: Obtain the first abnormal characteristic value of the target data sequence;

Step S320: Determine the first data position corresponding to the first abnormal characteristic value in the target data sequence according to the first abnormal characteristic value;

Step S330: Acquire the first data search space according to the first data location.

Those skilled in the art can understand that abnormal data is data that deviates from most of the data in the data set. Based on this, the first abnormal characteristic value in this embodiment refers to the deviation value between the abnormal data and the normal data.

In an embodiment, after the first abnormal characteristic value of the target data sequence is obtained, the first data position corresponding to the first abnormal characteristic value in the target data sequence may be determined according to the first abnormal characteristic value, that is, according to The first abnormal feature value determines the location of the abnormal data in the target data sequence. For example, when a data deviates from most of the data by greater than or equal to the first abnormal feature value, it can be determined that the location of the data is where an abnormal data is located. Location.

In an embodiment, the first data position may include a start abnormal position, a middle abnormal position, and an end abnormal position. After the start abnormal position, the middle abnormal position, and the end abnormal position are determined, the first data search space can be obtained. .

It is worth noting that when the first abnormal feature value is obtained, different algorithms such as LOF (Local Outlier Factor) algorithm, DBSCAN algorithm or isolation forest (Isolation Forest, iForest) algorithm can be used to obtain the first data search space. This implementation The examples are not specifically limited. Take the isolated forest algorithm as an example to illustrate. First, build an iTree (tree) based on the target data sequence, and use the data in the target data sequence as the tree's sample data, and then use the first abnormal feature value to compare the sample data Perform a binary division to distinguish the sample data that meets the first abnormal feature value and the sample data that does not meet the first abnormal feature value to form two data sets, and then repeat the above process for these two data sets. Until the data can no longer be divided or the maximum height of the tree is reached, the first data position corresponding to the first abnormal characteristic value in the target data sequence can be obtained, so that the first data search space can be determined according to the first data position.

In addition, those skilled in the art can understand that the LOF algorithm, the DBSCAN algorithm, and the isolated forest algorithm are all commonly used algorithms in the field. Therefore, the specific principles of these algorithms will not be repeated here.

In addition, referring to FIG. 4, in an embodiment, step S310 includes but is not limited to the following steps:

Step S311: Obtain the first baseline prediction data of the target data sequence;

Step S312: Obtain a first abnormal characteristic value according to the deviation value between the first baseline prediction data and the data in the target data sequence.

In one embodiment, when the step of obtaining the first abnormal characteristic value of the target data sequence is performed, the baseline prediction data of the normal data in the target data sequence may be obtained first, and then based on the data in the target data sequence and the baseline prediction data The deviation value (that is, the absolute difference) between, obtains the first abnormal characteristic value.

In one embodiment, different baseline prediction methods can be used to obtain baseline prediction data of normal data in the target data sequence, which is not specifically limited in this embodiment. For example, the baseline prediction method can use a difference method, a moving average method, and a weighting method. Time series baseline forecasting methods such as moving average method, exponential weighted moving average method, differential moving average autoregressive method or three-time exponential smoothing method can also use regression methods such as random forest and XGBooste (Xtreme Gradient Boosting). A variety of first abnormal feature values can be obtained by using a variety of baseline prediction methods, and corresponding steps are performed to obtain the first data search space by synthesizing different first abnormal feature values, which can facilitate the acquisition of the first data search space The accuracy and generalization ability.

In an embodiment, after obtaining the first baseline prediction data of the target data sequence, the first abnormal characteristic value can be obtained by the following formula:

R=|P _i -X _i |

Wherein, R is a first anomaly value, X _i is the target data in the sequence, P _i is the first baseline prediction target data sequence.

Those skilled in the art can understand that the difference method, moving average method, weighted moving average method, exponentially weighted moving average method, differential moving average autoregressive method, cubic exponential smoothing method, random forest and XGBooste are all commonly used in this field. Algorithms, therefore, for the specific principles of these algorithms, I will not repeat them here.

In addition, referring to FIG. 5, in an embodiment, step S400 includes but is not limited to the following steps:

Step S410: Determine a third data segment in the first data search space;

Step S420: Perform similarity calculation on the first abnormal data segment and the third data segment to obtain a first similarity metric value corresponding to the third data segment;

Step S430: Determine the corresponding third data segment as the second abnormal data segment according to the first similarity metric value.

In an embodiment, when the step of obtaining the second abnormal data segment corresponding to the first abnormal data segment in the first data search space according to the first abnormal data segment is performed, it may be determined in the first data search space first In the third data segment, when the third data segment is determined, the similarity measurement algorithm is used to calculate the similarity between the first abnormal data segment and the third data segment to obtain the first similarity measurement value corresponding to the third data segment. The similarity metric value indicates that the first abnormal data segment is similar to the third data segment, and it can be determined that the corresponding third data segment is the second abnormal data segment (that is, the remaining abnormal data segments in the first data search space). That is, by comparing the degree of similarity between the first abnormal data segment and the third data segment, it is determined whether the third data segment is the second abnormal data segment. Compared with the traditional manual marking of abnormal data segments, this embodiment can improve the abnormalities in the data. The efficiency of data labeling can save human resources and time resources.

In an embodiment, the number of the third data segment may be one or multiple, which is not specifically limited in this embodiment. When the number of the third data segment is one, all the data in the first data search space can be determined as the third data segment, or part of the continuous data in the first data search space can be determined as the third data segment. The embodiment is not specifically limited; when the number of the third data segment is multiple, the data in the first data search space can be divided into multiple data segments of equal length, or the data in the first data search space can be divided It is divided into multiple data segments of unequal length, which is not specifically limited in this embodiment.

In an embodiment, the calculation of the similarity between the first abnormal data segment and the third data segment can be achieved by using different similarity measurement algorithms. For example, for multiple third data segments of equal length to each other, Euclidean distance, Pearson correlation coefficient, or Spearman rank correlation coefficient can be used to calculate the similarity between the first abnormal data segment and the third data segment. ; For another example, for multiple third data segments with unequal lengths, the DTW algorithm or an improved fast DTW algorithm can be used to calculate the similarity between the first abnormal data segment and the third data segment. The specific implementation manner for calculating the similarity between the first abnormal data segment and the third data segment can be appropriately selected according to actual use needs, and this embodiment does not specifically limit it. It is worth noting that the improved fast DTW algorithm can include FastDTW algorithm, SparseDTW algorithm, LB_Keogh algorithm, and LB_Improved algorithm, etc. Among them, FastDTW algorithm can reduce the search space and data abstraction methods by limiting the accuracy difference. Next, the computational complexity is reduced.

Those skilled in the art can understand that Euclidean distance, Pearson correlation coefficient, Spearman rank correlation coefficient, DTW algorithm, and various improved fast DTW algorithms are all commonly used algorithms in the field. Therefore, for these algorithms The specific principle will not be repeated here.

In addition, in an embodiment, step S430 includes but is not limited to the following steps:

Step S431: When the first similarity metric value is less than the preset threshold, it is determined that the third data segment corresponding to the first similarity metric value is the second abnormal data segment.

In one embodiment, the first similarity metric value indicates the degree of similarity between the first abnormal data segment and the third data segment, and the smaller the value of the first similarity metric value is, it indicates that the first abnormal data segment and the third data segment are Therefore, when the first similarity measure value is less than the preset threshold, it can be determined that the first abnormal data segment and the third data segment have a higher degree of similarity, so that the first similarity measure can be determined The third data segment corresponding to the value is the second abnormal data segment.

In an embodiment, the preset threshold can be appropriately selected according to the similarity measurement algorithm used. For example, for the Euclidean distance and the DTW algorithm, different preset thresholds can be used, which is not specifically limited in this embodiment. .

In addition, referring to FIG. 6, in an embodiment, when the number of third data segments is more than two, step S430 may include but is not limited to the following steps:

Step S432: Obtain a first similarity metric value whose value is less than a preset threshold;

Step S433: Sort the first similarity measure values whose values are less than the preset threshold value from small to large to adjust the sorting of the corresponding third data segment;

Step S434: Determine that the first N third data segments are second abnormal data segments, where N is greater than or equal to 1.

In one embodiment, when the number of the third data segment is more than two, the number of the acquired first similarity metric values corresponding to the third data segment is also more than two. In this case , You can first obtain the first similarity metric value whose value is less than the preset threshold to filter out the third data segment with a certain degree of similarity with the first abnormal data segment, and exclude the remaining third data segments with less similarity. Next, sort the first similarity metric values whose values are less than the preset threshold from small to large to adjust the sorting of the corresponding third data segments, so that the third data segments that are similar to the first abnormal data segment to a certain degree can be Reorder according to the degree of similarity from high to low, and then, according to actual application conditions, determine the first few third data segments as the second abnormal data segments.

To illustrate with a specific example, the FastDTW algorithm can be used to compare each third data segment in the first data search space with the first abnormal data segment to calculate the similarity metric value of each third data segment, and then All third data segments are sorted according to similarity metric values to obtain several third data segments with a higher degree of similarity to the first abnormal data segment, so that the first data segment that needs to be labeled can be determined based on these third data segments. 2. Abnormal data segment.

In an embodiment, for different first abnormal data segments and different first data search spaces, the optimal value of N may be different. For example, if the value of N is too small, some abnormal data segments will be missed and not marked, and if the value of N is too large, some abnormal data segments with relatively low similarity may be identified, resulting in accuracy The problem of falling. Therefore, the value of N needs to be appropriately selected according to the actual application situation. If you want to select the value of N more accurately, you can calculate the AUC value by establishing the curve of accuracy and recall to obtain the best value of N. value.

It is worth noting that the accuracy rate refers to the proportion of correctly labeled abnormal data segments; the recall rate refers to the proportion of manually labeled abnormal data segments that are correctly labeled; AUC (Are Under Curve) is an evaluation indicator of a model. To put it simply, a pair of samples (a positive sample and a negative sample) are randomly selected, and then the trained classifier is used to predict the two samples, and the probability of the positive sample is predicted to be greater than the probability of the negative sample.

In addition, referring to FIG. 7, in an embodiment, step S100 may include but is not limited to the following steps:

Step S110: Obtain multiple data sequences to be tested;

Step S120, performing clustering processing on a plurality of data sequences to be tested to obtain a target data category;

Step S130: Determine a target data sequence from each target data category.

In one embodiment, the number of data sequences to be tested collected from the network is very large. If the first abnormal data segment in each data sequence to be tested is manually determined, the workload is relatively large. In addition, there may be cases in which the data sequence to be tested does not contain a large amount of similar abnormal data due to the short acquisition time of the data sequence to be tested. Therefore, in this case, obtain the first data sequence in the data sequence to be tested. An abnormal data segment will be more difficult. However, for a data indicator, according to the different resource objects bound to it in the network, many data sequences to be tested can be collected. For example, in a medium-scale network, there will be tens of thousands of port resources. Taking a data index as an example, tens of thousands of data sequences to be tested can be collected, and these data sequences to be tested themselves often have certain similarities. For example, it is used to count the traffic timing data of the base station access port deployed in school A and is used to count the traffic timing data of the base station access port deployed in school B. Since the daily life characteristics of school A and school B are similar, then this The two data series to be tested will be relatively similar to a large extent. In the similar data sequence to be tested, the characteristics of the abnormal data will also have certain commonalities. Based on the above situation, the multiple acquired data sequences to be tested can be clustered to obtain the target data category, and then a target data sequence can be determined from each target data category, so as to provide the necessary foundation for the subsequent steps condition.

In one embodiment, the number of target data classes obtained by clustering multiple data sequences to be tested may be one or multiple, depending on the similarity of the data sequences to be tested, for example, if These multiple data sequences to be tested are relatively similar, then all the data sequences to be tested can be classified as a target data category, and if some of the multiple data sequences to be tested are relatively similar, then you can The multiple data sequences to be tested are divided into multiple target data categories, and each target data category includes a part of the data sequences to be tested.

In addition, referring to FIG. 8, in an embodiment, the data processing method may further include the following steps:

Step S600, acquiring a second data search space in the remaining data sequences to be tested in each target data category;

Step S700: Use the first abnormal data segment in the target data sequence to obtain the second abnormal data segment in the second data search space in the remaining data sequence to be tested, respectively.

In one embodiment, when a target data sequence is determined in each target data category, the second abnormal data segment in the target data sequence can be obtained and labeled through the steps in the above-mentioned embodiment. For the technical principle and the technical effects brought about, reference may be made to the relevant description in the foregoing embodiment, which will not be repeated here.

In one embodiment, since the data sequence to be tested in each target data category has a certain similarity, the first abnormal data segment obtained in the target data sequence can be applied to the same target data category. Obtain and label the second abnormal data segment for the rest of the data sequence under test. Therefore, you can first obtain the second data search space from the remaining data sequence under test in each target data class, and then use the data in the target data sequence. The first abnormal data segment obtains the second abnormal data segment from the second data search space in the remaining data sequence to be tested respectively. Since only the first abnormal data segment in the target data sequence can be used to obtain the second abnormal data segment in the second data search space in the remaining data sequences to be tested, it can save to obtain the second abnormal data segment in each of the remaining data sequences to be tested. Corresponding to the operation steps of the first abnormal data segment, the second abnormal data segment in each data sequence to be tested can be obtained more concisely and efficiently, so that the labeling efficiency of abnormal data in multiple time series indicator data can be improved.

It is worth noting that the second data search space in this embodiment and the first data search space in the above embodiment are of the same type of technical features. The difference between the two is only in the different belonging objects, and the first data search space belongs to The target data sequence, and the second data search space belongs to the remaining data sequences to be tested in the same target data category. In order to avoid duplication of content, the second data search space is not described in detail here. For related explanations of the second data search space, reference may be made to the related explanations of the first data search space in the foregoing embodiment.

In addition, it is worth noting that step S700 in this embodiment is similar to step S400 in the embodiment shown in FIG. In the foregoing embodiment, the execution object of step S400 is the first data search space of the target data sequence, and the execution object of step S700 in this embodiment is the second data search space of the remaining data sequences to be tested in the same target data class. To avoid repetition of content, step S700 is not described in detail here. For related explanations of step S700, reference may be made to related explanations of step S400 in the foregoing embodiment.

In addition, referring to FIG. 9, in an embodiment, step S600 includes but is not limited to the following steps:

Step S610: Obtain the second abnormal characteristic value of the remaining data sequence to be tested in each target data category;

Step S620: Determine the second data position corresponding to the second abnormal characteristic value in the remaining data sequence to be tested according to the second abnormal characteristic value;

Step S630: Obtain the second data search space of the remaining data sequence to be tested according to the second data position.

In one embodiment, the second abnormal feature value and the second data location in this embodiment, and the first abnormal feature value and the first data location in the foregoing embodiment, belong to the same type of technical features, respectively. The only difference is that the attribution objects are different. The first abnormal feature value and the first data location are both attributable to the target data sequence, while the second abnormal feature value and the second data location are both attributable to the remaining data to be tested in the same target data category sequence. In order to avoid duplication of content, the second abnormal feature value and the second data location are not described in detail here. For the related explanation of the second abnormal feature value and the second data location, please refer to the first abnormal feature in the above embodiment. Explanation of the value and the position of the first data.

In an embodiment, step S610, step S620, and step S630 in this embodiment are similar to step S310, step S320, and step S330 in the embodiment shown in FIG. 3, and they have similar technical principles and Technical effect, the difference between the two is only in the execution target. In the above embodiment, the execution target of step S310, step S320 and step S330 is the target data sequence, while the execution target of step S610, step S620 and step S630 in this embodiment is The remaining data sequences to be tested in the same target data class. To avoid duplication of content, step S610, step S620, and step S630 are not described here in detail. For related explanations of step S610, step S620, and step S630, you can refer to step S310, step S320, and step S330 in the above embodiment. Related explanations.

In addition, referring to FIG. 10, in an embodiment, step S610 includes but is not limited to the following steps:

Step S611: Obtain the second baseline prediction data of the remaining data sequence to be tested in each target data category;

Step S612: Obtain the second abnormal characteristic value of the remaining data sequence to be tested according to the deviation value of the second baseline prediction data and the data in the remaining data sequence to be tested.

In one embodiment, the second baseline prediction data in this embodiment and the first baseline prediction data in the foregoing embodiment belong to the same type of technical features. The difference between the two is only that the attribution object is different, and the first baseline prediction The data belongs to the target data sequence, and the second baseline prediction data belongs to the remaining data sequences to be tested in the same target data category. In order to avoid duplication of content, the second baseline prediction data is not described in detail here. For related explanations of the second baseline prediction data, reference may be made to the related explanations of the first baseline prediction data in the foregoing embodiment.

In one embodiment, step S611 and step S612 in this embodiment are similar to step S311 and step S312 in the embodiment shown in FIG. 4, and they have similar technical principles and technical effects. The only difference is that the execution objects are different. In the above embodiment, the execution objects of step S311 and step S312 are the target data sequence, while the execution objects of step S611 and step S612 in this embodiment are the remaining data sequences to be tested in the same target data category. . In order to avoid repetition of content, step S611 and step S612 are not described in detail here. For related explanations of step S611 and step S612, please refer to the related explanations of step S311 and step S312 in the foregoing embodiment.

In addition, referring to FIG. 11, in an embodiment, step S700 includes but is not limited to the following steps:

Step S710, respectively determine the fourth data segment in the second data search space in the remaining data sequence to be tested;

Step S720: Perform similarity calculation on the first abnormal data segment in the target data sequence and the fourth data segment in the remaining data sequences to be tested to obtain a second similarity metric value corresponding to the fourth data segment;

Step S730: Determine the corresponding fourth data segment in the remaining data sequence to be tested as the second abnormal data segment in the remaining data sequence to be tested according to the second similarity metric value.

In one embodiment, the fourth data segment and the second similarity metric value in this embodiment, and the third data segment and the first similarity metric value in the foregoing embodiment, belong to the same type of technical features, and the difference between the two The only difference is that the attribution object is different. The third data segment belongs to the first data search space of the target data sequence, the first similarity measure value corresponds to the third data segment, and the fourth data segment belongs to the same target data category. In the second data search space of the remaining data sequence to be tested, the second similarity metric value corresponds to the fourth data segment. In order to avoid duplication of content, the fourth data segment and the second similarity measure value are not described in detail here. For the related explanation of the fourth data segment and the second similarity measure value, refer to the third data segment in the above embodiment. Explanation and explanation of the first similarity measure.

In one embodiment, step S710, step S720, and step S730 in this embodiment are similar to step S410, step S420, and step S430 in the embodiment shown in FIG. 5, and they have similar technical principles and The technical effect is that the difference between the two is only in the execution objects. In the above embodiment, the execution objects of step S410, step S420, and step S430 are the first data search space of the target data sequence. However, in this embodiment, steps S710, S720, and S720 are executed. The execution object of step S730 is the second data search space of the remaining data sequences to be tested in the same target data category. In order to avoid duplication of content, step S710, step S720, and step S730 are not described in detail here. For related explanations of step S710, step S720, and step S730, you can refer to steps S410, step S420, and step S430 in the above embodiment. Related explanations.

In addition, in an embodiment, step S730 includes but is not limited to the following steps:

Step S731: When the second similarity metric value is less than the preset threshold, it is determined that the corresponding fourth data segment in the remaining data sequence to be tested is the second abnormal data segment in the remaining data sequence to be tested.

In one embodiment, the second similarity metric value represents the degree of similarity between the first abnormal data segment and the fourth data segment, and the smaller the value of the second similarity metric value is, it represents the difference between the first abnormal data segment and the fourth data segment The higher the degree of similarity between the two, therefore, when the second similarity measure value is less than the preset threshold, it can be determined that the first abnormal data segment and the fourth data segment have a higher degree of similarity, and therefore the second similarity measure can be determined The fourth data segment corresponding to the value is the second abnormal data segment.

In addition, referring to FIG. 12, in an embodiment, when the number of fourth data segments is more than two, step S730 may include but is not limited to the following steps:

Step S732, respectively acquiring a second similarity metric value corresponding to the remaining data sequence to be tested and the value is less than a preset threshold;

Step S733, sorting the second similarity measure values whose values are less than the preset threshold value from small to large, so as to respectively adjust the sorting of the corresponding fourth data segments in the remaining data sequences to be tested;

Step S734: Determine the first N fourth data segments as the second abnormal data segments in the remaining data sequences to be tested, where N is greater than or equal to 1.

In one embodiment, step S732, step S733, and step S734 in this embodiment are similar to step S432, step S433, and step S434 in the embodiment shown in FIG. 6, and they have similar technical principles and Technical effect, the difference between the two is only in the execution target. In the above embodiment, the execution target of step S432, step S433, and step S434 is the target data sequence, while the execution target of step S732, step S733, and step S734 in this embodiment is The remaining data sequences to be tested in the same target data class. In order to avoid content duplication, step S732, step S733 and step S734 will not be described in detail here. For the relevant explanation of step S732, step S733 and step S734, please refer to step S432, step S433 and step S434 in the above embodiment. Related explanations.

In addition, referring to FIG. 13, in an embodiment, step S120 may include but is not limited to the following steps:

Step S121: Perform data preprocessing on multiple data sequences to be tested, respectively, to obtain multiple first preprocessed data sequences;

Step S122: Perform baseline extraction processing on the multiple first pre-processed data sequences, respectively, to obtain multiple second pre-processed data sequences;

Step S123, clustering a plurality of second pre-processed data sequences according to similarity, to obtain a target data category.

In one embodiment, when multiple data sequences to be tested need to be clustered, data preprocessing may be performed on the multiple data sequences to be tested respectively to obtain multiple first preprocessed data sequences, and then the multiple data sequences can be preprocessed. The first preprocessed data sequences are respectively subjected to baseline extraction processing to obtain multiple second preprocessed data sequences, and then the multiple second preprocessed data sequences are clustered according to the similarity, so as to obtain the target data category. When the multiple data sequences to be tested are clustered to obtain the corresponding target data category, the abnormal data segment labeling processing for each data sequence to be tested can be transformed into the abnormal data segment labeling processing for each target data category , Which can reduce the processing complexity and processing time, thereby improving the efficiency of labeling abnormal data in the data.

In an embodiment, performing baseline extraction processing on the first pre-processed data sequence can smooth out abnormal parts and noise parts in the data sequence to be tested, thereby improving the accuracy of the similarity measurement between the data sequences to be tested.

It is worth noting that the baseline extraction processing in step S122 in this embodiment has a similar technical principle to the step of obtaining baseline prediction data using the baseline prediction method in the embodiment shown in FIG. For the relevant explanations of performing the baseline extraction processing in S122, reference may be made to the relevant explanations of using the baseline prediction method to obtain baseline prediction data in the embodiment shown in FIG.

In addition, referring to FIG. 14, in an embodiment, step S121 may include but is not limited to the following steps:

Step S1211, performing missing value filling processing on the multiple data sequences to be tested respectively, to obtain multiple filling data sequences;

Step S1212: Perform data standardization processing on the multiple filling data sequences, respectively, to obtain multiple first preprocessed data sequences.

In one embodiment, the data sequence to be tested collected from the network may have missing values of varying degrees due to various reasons. These missing values will not only cause the length of each data sequence to be tested to be different, resulting in some Similarity measurement algorithms are difficult to use and will affect the accuracy of the baseline extraction process. In order to solve these problems, this embodiment first performs data filling on these missing values to obtain a filling data sequence, and then performs data standardization processing on the filling data sequence to obtain The first preprocessed data sequence.

In one embodiment, a linear interpolation filling method may be used to perform the missing value filling processing. The linear interpolation filling method can smooth the waveform of the data sequence to be measured, thereby facilitating the execution of the baseline extraction processing. For example, for a time series indicator data, the specific location of the missing value can be determined based on the continuity in time. After the specific location of the missing value is determined, the specific data that needs to be filled can be obtained based on the data before and after the location of the missing value. Numerical value, for example, the average value of the preceding and following data can be used as the specific numerical value to be filled. Those skilled in the art can understand that the linear interpolation filling method belongs to an algorithm commonly used in the art, and therefore, the specific principle of the algorithm will not be repeated here.

In one embodiment, performing data standardization processing on the filling data sequence can transform and map the data sequence to be measured to a specific interval, thereby helping to eliminate the dimensional difference between different data sequences to be measured, so that they can be put together Compare the similarity. In this embodiment, the Z-Score method may be used for data standardization processing, and the calculation formula is as follows:

Among them, x′ _i is the first preprocessed data sequence, x _i is the data sequence to be tested,

Is the mean value of the data series to be tested, and σ is the standard deviation of the data series to be tested.

In addition, in an embodiment, the clustering of multiple second pre-processed data sequences according to similarity in step S123 may specifically include, but is not limited to, the following steps:

Step S1231, using the DBSCAN algorithm to cluster a plurality of second pre-processed data sequences according to the similarity; among them, the parameters of the DBSCAN algorithm include the distance function, the threshold of the number of neighborhoods, and the threshold of the neighborhood distance; the result of the DBSCAN algorithm includes the number of categories and Abnormal proportions.

Those skilled in the art can understand that the DBSCAN algorithm is one of the commonly used clustering algorithms, and the DBSCAN algorithm does not need to determine the number of cluster centers in advance. The key parameters of DBSCAN algorithm include distance function, neighborhood number threshold and neighborhood distance threshold, while the result of DBSCAN algorithm includes classification number and abnormal proportion.

In one embodiment, for the distance function, the Euclidean distance function can be used in this embodiment; for the number of neighborhood threshold, this embodiment can be set to 4; and for the neighborhood distance threshold, the parameter needs to be dynamically based on the data set. Estimated, and this parameter has a significant impact on the clustering results.

In addition, referring to FIG. 15, in an embodiment, the neighborhood distance threshold may be obtained by a heuristic algorithm, where the heuristic algorithm includes but is not limited to the following steps:

Step S810: Calculate the similarity between a plurality of second pre-processed data sequences by using a distance function to obtain similarity matrix data;

Step S820: Calculate the k-dist distance based on the similarity matrix data to obtain the k-dist sequence;

Step S830, obtaining an initial distance threshold parameter based on the k-dist sequence;

Step S840: Adjust the initial distance threshold parameter to obtain the neighborhood distance threshold.

In an embodiment, the k-dist distance refers to the distance between a data object and its k-th closest object. When it is necessary to determine a suitable neighborhood distance threshold, the similarity between multiple second preprocessing data sequences can be calculated by using, for example, the Euclidean distance function and other distance functions to form similarity matrix data, and then based on the The similarity matrix data calculates the k-dist distance to obtain a k-dist sequence, and then obtains an initial distance threshold parameter based on the k-dist sequence, and then adjusts the initial distance threshold parameter to obtain an appropriate neighborhood distance threshold. After the appropriate neighborhood distance threshold is obtained through the above heuristic algorithm, the neighborhood distance threshold can be applied to the above embodiment using the DBSCAN algorithm to cluster multiple second preprocessed data sequences according to similarity, Thereby, the target data class can be obtained.

In one embodiment, before performing step S810, initial thresholds such as the maximum distance threshold, minimum length threshold, slope threshold, and slope difference threshold of the neighborhood may be set first, and the above steps are performed after completing the setting of these initial thresholds. S810, step S820, step S830, and step S840.

In an embodiment, when step S820 is performed, after calculating the k-dist distance of each k-dist point based on the similarity matrix data, the obtained k-dist distances can be sorted from small to large, and k-dist distances are excluded. -The k-dist point with a dist distance of 0 and the k-dist point with a k-dist distance exceeding the maximum distance threshold of the neighborhood, therefore, the remaining k-dist points constitute a k-dist sequence.

In addition, in an embodiment, step S830 may include but is not limited to the following steps:

Step S831: Calculate the slopes of each k-dist point in the k-dist sequence and the two adjacent points before and after, the slopes of the current two adjacent points are less than the preset slope threshold, and the current two adjacent points The slope difference of is smaller than the preset slope difference threshold, and the current k-dist point is determined to be the candidate distance threshold;

Step S832: Determine the one with the largest value among the candidate distance thresholds as the initial distance threshold parameter.

In an embodiment, when the step of obtaining the initial distance threshold parameter based on the k-dist sequence needs to be performed, the relatively flat k-dist point in the k-dist sequence may be first determined as the candidate distance threshold. The specific steps may be: First calculate the slope of each k-dist point in the k-dist sequence and its two adjacent points before and after it. If the slope of the current k-dist point and the previous adjacent point (which can be defined as the left slope) and the current k-dist The slope of the point and the next adjacent point (which can be defined as the right slope) is less than the preset slope threshold, and the difference between the left slope and the right slope is less than the preset slope difference threshold, then the current k The -dist point is determined as a candidate distance threshold. When multiple candidate distance thresholds are obtained, these candidate distance thresholds can be sorted from largest to smallest, and then the candidate distance threshold with the maximum value is taken as the initial distance threshold parameter.

In addition, in an embodiment, step S840 may include but is not limited to the following steps:

Step S841, obtaining the step length;

In step S842, the initial distance threshold parameter is adjusted according to the step length to obtain the distance adjustment threshold. When the number of classifications decreases, it is determined that the distance adjustment threshold obtained in the previous step adjustment is the neighborhood distance threshold.

In one embodiment, after the initial distance threshold parameter is determined, the initial distance threshold parameter can continue to be optimized. It is worth noting that when the initial distance threshold parameter is optimized, it is necessary to keep the classification number unchanged. Under the premise of reducing the abnormal proportion as much as possible. As the value of the initial distance threshold parameter increases, the proportion of abnormalities will continue to decrease, and the number of classifications may also decrease, so you can gradually increase the value of the initial distance threshold parameter in a step-by-step manner to determine the best neighborhood distance Threshold, that is, on the basis of the initial distance threshold parameter, the number of classifications and abnormal proportions are recalculated after each step length is increased, until the number of classifications drops, the step length is stopped increasing, at this time, the previous step can be determined to adjust The obtained distance adjustment threshold is the optimal neighborhood distance threshold.

In an embodiment, the step length can be set according to empirical values, or it can be set according to the candidate distance threshold. For example, when the step length is set according to the candidate distance threshold, the step length can be set as the candidate distance threshold. The distance threshold is one-tenth of the difference between the maximum distance threshold and the minimum distance threshold, which is not specifically limited in this embodiment.

In order to better explain the heuristic algorithm provided in the foregoing embodiment, the following describes in detail with a specific example:

In a specific example, as shown in FIG. 16, the heuristic algorithm specifically includes the following steps:

Step S901, threshold setting.

In this step, initial thresholds such as the maximum distance threshold, minimum length threshold, slope threshold, and slope difference threshold of the neighborhood are respectively set.

Step S902, the sequence similarity matrix is calculated.

In this step, the similarity between each pair of data sequences is calculated by the distance function to form the similarity matrix data.

In step S903, the k-dist distance is calculated and sorted.

In this step, the k-dist distance of each k-dist point is calculated based on the similarity matrix data and sorted from small to large.

Step S904, filtering according to the maximum distance threshold.

In this step, the k-dist points whose k-dist distance is 0 and the k-dist distance points whose k-dist distance exceeds the maximum distance threshold of the neighborhood are excluded.

In step S905, the k-dist sequence values are taken in order.

In step S906, it is judged whether the left slope and the right slope are both smaller than the slope threshold.

In this step, calculate the slope of each k-dist point and its two adjacent points before and after it, if the slope of the current k-dist point and the previous adjacent point (which can be defined as the left slope) and the current k-dist point and The slope (which can be defined as the right slope) of the latter adjacent point is all less than the preset slope threshold, then step S907 is executed, otherwise, step S905 is executed.

In step S907, it is determined whether the difference between the left slope and the right slope is less than the slope difference threshold.

In this step, when the difference between the left slope and the right slope is less than the slope difference threshold, step S908 is executed, otherwise, step S905 is executed.

In step S908, the current k-dist point is determined as the candidate distance threshold, and steps S905 to S907 are repeated, and when all candidate distance thresholds are obtained, step S909 is executed.

Step S909: After sorting the candidate thresholds in descending order, the largest candidate threshold is taken as the initial distance threshold parameter.

In step S910, the clustering algorithm is executed to obtain the number of classifications and the abnormal ratio.

In step S911, a step size is added on the basis of the initial distance threshold parameter, and the clustering algorithm is executed.

In step S912, it is judged whether the number of classifications has decreased, if so, step S913 is executed, otherwise, step S911 is executed.

Step S913: Determine the previous distance threshold as the best distance threshold.

In addition, in an embodiment, step S130 may include but is not limited to the following steps:

Step S131: In each target data category, calculate the average sum of distances between each data sequence to be measured and the rest of the data sequences to be measured, and determine the data sequence to be measured corresponding to the smallest value of the average and median distance as the target data sequence .

In one embodiment, after clustering multiple data sequences to be tested to obtain the target data category, a core data sequence representing the corresponding target data category may be determined for each target data category, that is, from A target data sequence is determined in each target data class. When determining a target data sequence from a target data class, you can first calculate the average sum of the distances between each data sequence to be tested in the target data class and other data sequences to be tested, and then select the data sequence to be tested with the smallest average sum of distances As the core data sequence representing the target data class, that is, as the target data sequence of the target data class.

In an embodiment, the target data sequence can be determined by the following formula:

in,

with

Represents different data sequences to be tested; euclidean() represents Euclidean distance.

In order to better explain the data processing methods provided in the above embodiments, the following detailed descriptions will be made with specific examples:

In a specific example, as shown in FIG. 17, FIG. 17 is a main flow diagram of the data processing method provided in this example. Based on the main flow diagram shown in FIG. 17, the data processing method specifically includes the following steps:

First, preprocess the data sequence to be tested collected in the network, mainly including the filling of missing values and data standardization, to form a time series data set of equal length;

Then, use the moving average method to extract the baseline of each data sequence to be tested to form a baseline data set;

Then, use the DBSCAN algorithm and Euclidean distance measurement function to cluster on the baseline data set, and use the heuristic algorithm to automatically adjust the threshold parameters;

Then, for each category in the clustering results, a core sequence is determined through the distance measurement;

Next, determine the abnormal data segment template for a single core sequence;

Then, automatically complete the generation of the data search space for each sequence;

Then, the corresponding abnormal data segment template is automatically obtained according to the core sequence of the classification of each sequence, and the search for similar abnormal segments is completed in the data search space of the sequence to obtain N similar abnormal segments with a higher degree of similarity. These N similar abnormal segments are abnormally marked;

Finally, the labeling results of similar abnormal segments of each sequence are manually detected and partially corrected to form the final label abnormal data.

In addition, another embodiment of the present invention also provides a device, which includes a memory, a processor, and a computer program stored on the memory and running on the processor.

The processor and the memory can be connected by a bus or in other ways.

As a non-transitory computer-readable storage medium, the memory can be used to store non-transitory software programs and non-transitory computer-executable programs. In addition, the memory may include a high-speed random access memory, and may also include a non-transitory memory, such as at least one magnetic disk storage device, a flash memory device, or other non-transitory solid-state storage devices. In some embodiments, the memory includes a memory remotely arranged with respect to the processor, and these remote memories may be connected to the processor through a network. Examples of the aforementioned networks include, but are not limited to, the Internet, corporate intranets, local area networks, mobile communication networks, and combinations thereof.

It should be noted that the device in this embodiment may include the system architecture platform in the embodiment shown in FIG. 1, and the device in this embodiment and the system architecture platform in the embodiment shown in FIG. 1 belong to the same invention. Concept, so the two have the same implementation principle and technical effect, and will not be detailed here.

The non-transitory software programs and instructions required to implement the data processing method of the foregoing embodiment are stored in the memory. When executed by the processor, the data processing method of the foregoing embodiment is executed, for example, the method in FIG. 2 described above is executed. Steps S100 to S500, method steps S310 to S330 in FIG. 3, method steps S311 to S312 in FIG. 4, method steps S410 to S430 in FIG. 5, method steps S432 to S434 in FIG. 6, and method in FIG. 7 Steps S110 to S130, method steps S600 to S700 in FIG. 8, method steps S610 to S630 in FIG. 9, method steps S611 to S612 in FIG. 10, method steps S710 to S730 in FIG. 11, and method in FIG. 12 Steps S732 to S734, method steps S121 to S123 in FIG. 13, method steps S1211 to S1212 in FIG. 14, method steps S810 to S840 in FIG. 15, and method steps S901 to S913 in FIG.

The device embodiments described above are merely illustrative, and the units described as separate components may or may not be physically separated, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the modules can be selected according to actual needs to achieve the objectives of the solutions of the embodiments.

In addition, another embodiment of the present invention also provides a computer-readable storage medium, the computer-readable storage medium stores computer-executable instructions, and the computer-executable instructions are executed by a processor or a controller, for example, by The execution of a processor in the foregoing device embodiment may cause the foregoing processor to execute the data processing method in the foregoing embodiment, for example, to execute the method steps S100 to S500 in FIG. 2 and the method steps S310 to S310 in FIG. 3 described above. S330, method steps S311 to S312 in FIG. 4, method steps S410 to S430 in FIG. 5, method steps S432 to S434 in FIG. 6, method steps S110 to S130 in FIG. 7, and method steps S600 to S600 in FIG. S700, the method steps S610 to S630 in FIG. 9, the method steps S611 to S612 in FIG. 10, the method steps S710 to S730 in FIG. 11, the method steps S732 to S734 in FIG. 12, and the method steps S121 to S121 in FIG. S123, method steps S1211 to S1212 in FIG. 14, method steps S810 to S840 in FIG. 15, and method steps S901 to S913 in FIG.

The embodiment of the present invention includes: acquiring the target data sequence; acquiring the first abnormal data segment in the target data sequence; acquiring the first data search space in the target data sequence; acquiring the data in the first data search space according to the first abnormal data segment The second abnormal data segment corresponding to the first abnormal data segment; the second abnormal data segment is marked. According to the solution provided by the embodiment of the present invention, by acquiring the first abnormal data segment and the first data search space in the target data sequence, the first abnormal data segment can be used as an abnormal data segment template, so that the first data search space can be According to the abnormal data segment template, the corresponding second abnormal data segment is acquired and marked, which realizes the purpose of marking other abnormal data segments in the target data sequence. Therefore, compared with the traditional manual marking of abnormal data segments, The solution provided by the embodiment of the present invention can improve the labeling efficiency of abnormal data in the data, thereby saving human resources and time resources.

A person of ordinary skill in the art can understand that all or some of the steps and systems in the methods disclosed above can be implemented as software, firmware, hardware, and appropriate combinations thereof. Certain physical components or all physical components can be implemented as software executed by a processor, such as a central processing unit, a digital signal processor, or a microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit . Such software may be distributed on a computer-readable medium, and the computer-readable medium may include a computer storage medium (or a non-transitory medium) and a communication medium (or a transitory medium). As is well known by those of ordinary skill in the art, the term computer storage medium includes volatile and non-volatile data implemented in any method or technology for storing information (such as computer-readable instructions, data structures, program modules, or other data). Sexual, removable and non-removable media. Computer storage media include but are not limited to RAM, ROM, EEPROM, flash memory or other storage technologies, CD-ROM, digital versatile disk (DVD) or other optical disk storage, magnetic cassettes, magnetic tapes, magnetic disk storage or other magnetic storage devices, or Any other medium used to store desired information and that can be accessed by a computer. In addition, as is well known to those of ordinary skill in the art, a communication medium usually contains computer-readable instructions, data structures, program modules, or other data in a modulated data signal such as a carrier wave or other transmission mechanism, and may include any information delivery medium. .

The above is a detailed description of some embodiments of the present invention, but the present invention is not limited to the above-mentioned embodiments. Those skilled in the art can make various equivalent modifications or substitutions without departing from the technical solution of the present invention. These equivalent modifications or replacements are all included in the scope defined by the claims of the present invention.

Claims

A data processing method, including,

Obtain the target data sequence;

Acquiring the first abnormal data segment in the target data sequence;

Acquiring a first data search space in the target data sequence;

Acquiring a second abnormal data segment corresponding to the first abnormal data segment in the first data search space according to the first abnormal data segment;

Mark the second abnormal data segment.
The data processing method according to claim 1, wherein said obtaining the first data search space in the target data sequence comprises:

Acquiring the first abnormal characteristic value of the target data sequence;

Determine the first data position corresponding to the first abnormal characteristic value in the target data sequence according to the first abnormal characteristic value;

Obtaining the first data search space according to the first data location.
The data processing method according to claim 2, wherein said obtaining the first abnormal characteristic value of the target data sequence comprises:

Acquiring first baseline prediction data of the target data sequence;

The first abnormal characteristic value is obtained according to the deviation value of the first baseline prediction data and the data in the target data sequence.
The data processing method according to claim 1, wherein the acquiring a second abnormal data segment corresponding to the first abnormal data segment in the first data search space according to the first abnormal data segment comprises :

Determine a third data segment in the first data search space;

Performing similarity calculation on the first abnormal data segment and the third data segment to obtain a first similarity metric value corresponding to the third data segment;

It is determined that the corresponding third data segment is a second abnormal data segment according to the first similarity metric value.
The data processing method according to claim 4, wherein the determining that the corresponding third data segment is a second abnormal data segment according to the first similarity metric value comprises:

When the first similarity metric value is less than a preset threshold, it is determined that the third data segment corresponding to the first similarity metric value is a second abnormal data segment.
The data processing method according to claim 4, wherein when the number of the third data segment is more than two, the corresponding third data segment is determined to be the second abnormality according to the first similarity metric value Data segment, including:

Acquiring the first similarity metric value whose value is less than a preset threshold;

Sorting the first similarity metric values whose values are less than a preset threshold value from small to large to adjust the sorting of the corresponding third data segment;

It is determined that the first N third data segments are second abnormal data segments, where N is greater than or equal to 1.
The data processing method according to any one of claims 1 to 6, wherein said acquiring a target data sequence comprises:

Obtain multiple data sequences to be tested;

Performing clustering processing on a plurality of the data sequences to be tested to obtain a target data category;

A target data sequence is determined from each target data category.
The data processing method according to claim 7, further comprising:

Acquiring a second data search space in the remaining data sequences to be tested in each of the target data classes;

The first abnormal data segment in the target data sequence is used to obtain the second abnormal data segment in the second data search space in the remaining data sequence to be tested, respectively.
8. The data processing method according to claim 8, wherein said obtaining the second data search space from the remaining data sequences to be tested in each of the target data classes respectively comprises:

Acquire the second abnormal characteristic value of the remaining data sequence to be tested in each of the target data types;

Respectively determine the second data position corresponding to the second abnormal characteristic value in the remaining data sequence to be tested according to the second abnormal characteristic value;

The second data search space of the remaining data sequence to be tested is obtained respectively according to the second data position.
The data processing method according to claim 9, wherein said obtaining the second abnormal characteristic value of the remaining data sequence to be tested in each of the target data types respectively comprises:

Acquiring the second baseline prediction data of the remaining data sequence to be tested in each of the target data types;

The second abnormal characteristic value of the remaining data sequence to be tested is obtained according to the deviation value of the second baseline prediction data and the data in the remaining data sequence to be tested, respectively.
The data processing method according to claim 8, wherein the first abnormal data segment in the target data sequence is obtained in the second data search space in the remaining data sequence to be tested. The second abnormal data segment includes:

Respectively determining a fourth data segment in the second data search space in the remaining data sequence to be tested;

Perform similarity calculations on the first abnormal data segment in the target data sequence and the fourth data segment in the remaining data sequences to be tested to obtain a second similarity corresponding to the fourth data segment metric;

The corresponding fourth data segment in the remaining data sequence to be tested is respectively determined according to the second similarity metric value as the second abnormal data segment in the remaining data sequence to be tested.
11. The data processing method according to claim 11, wherein the corresponding fourth data segment in the remaining data sequences to be tested is respectively determined according to the second similarity metric value to be the remaining data sequences to be tested The second abnormal data segment in includes:

When the second similarity metric value is less than a preset threshold, it is determined that the corresponding fourth data segment in the remaining data sequence to be tested is the second abnormal data segment in the remaining data sequence to be tested.
The data processing method according to claim 11, wherein, when the number of the fourth data segment is more than two, the corresponding ones of the remaining data sequences to be tested are respectively determined according to the second similarity metric value. The fourth data segment is the second abnormal data segment in the remaining data sequence to be tested, and includes:

Respectively acquiring the second similarity metric value corresponding to the remaining data sequence to be tested and the value is less than a preset threshold;

Sorting the second similarity metric values whose values are less than a preset threshold value from small to large, so as to respectively adjust the sorting of the corresponding fourth data segments in the remaining data sequences to be tested;

It is determined that the first N fourth data segments are the second abnormal data segments in the remaining data sequences to be tested, where N is greater than or equal to 1.
8. The data processing method according to claim 7, wherein the clustering of the plurality of data sequences to be tested to obtain the target data category comprises:

Perform data preprocessing on the plurality of data sequences to be tested, respectively, to obtain a plurality of first preprocessed data sequences;

Baseline extraction processing is performed on the plurality of first preprocessed data sequences respectively to obtain a plurality of second preprocessed data sequences;

Clustering a plurality of the second pre-processing data sequences according to the similarity to obtain the target data category.
The data processing method according to claim 14, wherein said performing data preprocessing on a plurality of said data sequences to be tested respectively to obtain a plurality of first preprocessing data sequences comprises:

Performing missing value filling processing on the plurality of data sequences to be tested, respectively, to obtain multiple filling data sequences;

Data standardization processing is performed on the multiple filling data sequences to obtain multiple first preprocessing data sequences.
The data processing method according to claim 14, wherein the clustering the plurality of second pre-processed data sequences according to similarity comprises:

The DBSCAN algorithm is used to cluster a plurality of the second preprocessed data sequences according to similarity; wherein, the parameters of the DBSCAN algorithm include a distance function, a neighborhood number threshold, and a neighborhood distance threshold; the result of the DBSCAN algorithm includes The number of categories and the proportion of abnormalities.
The data processing method according to claim 16, wherein the neighborhood distance threshold is obtained by a heuristic algorithm, wherein the heuristic algorithm comprises the following steps:

Calculating the similarity between two of the plurality of second pre-processed data sequences by using the distance function to obtain similarity matrix data;

Calculate the k-dist distance based on the similarity matrix data to obtain the k-dist sequence;

Obtaining an initial distance threshold parameter based on the k-dist sequence;

The initial distance threshold parameter is adjusted to obtain the neighborhood distance threshold.
The data processing method according to claim 17, wherein said obtaining the initial distance threshold parameter based on the k-dist sequence comprises:

Calculate the slopes of each k-dist point in the k-dist sequence with two adjacent points before and after, when the slopes of the two adjacent points are less than the preset slope threshold, and when the two The difference between the slopes of adjacent points is less than the preset slope difference threshold, and the current k-dist point is determined as the candidate distance threshold;

It is determined that the one with the largest value among the candidate distance thresholds is the initial distance threshold parameter.
The data processing method according to claim 17 or 18, wherein the adjusting the initial distance threshold parameter to obtain the neighborhood distance threshold comprises:

Get the step length;

The initial distance threshold parameter is adjusted according to the step length to obtain a distance adjustment threshold, and when the number of classifications decreases, it is determined that the distance adjustment threshold obtained by adjustment in the previous step is the neighborhood distance threshold.
8. The data processing method according to claim 7, wherein said determining one of said target data sequence from each of said target data types respectively comprises:

In each of the target data classes, calculate the average sum of the distances between each of the data sequence to be tested and the rest of the data sequences to be tested, and determine the one corresponding to the smallest value of the average and median of the distances. The measured data sequence is the target data sequence.
A device comprising: a memory, a processor, and a computer program stored on the memory and capable of running on the processor, wherein the processor executes the computer program as described in any one of claims 1 to 20 The data processing method described.
A computer-readable storage medium storing computer-executable instructions, wherein the computer-executable instructions are used to execute the data processing method according to any one of claims 1 to 20.