WO2021238455A1 - Data processing method and device, and computer-readable storage medium - Google Patents

Data processing method and device, and computer-readable storage medium Download PDF

Info

Publication number
WO2021238455A1
WO2021238455A1 PCT/CN2021/086644 CN2021086644W WO2021238455A1 WO 2021238455 A1 WO2021238455 A1 WO 2021238455A1 CN 2021086644 W CN2021086644 W CN 2021086644W WO 2021238455 A1 WO2021238455 A1 WO 2021238455A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
abnormal
tested
sequence
segment
Prior art date
Application number
PCT/CN2021/086644
Other languages
French (fr)
Chinese (zh)
Inventor
蒋勇
彭鑫
叶德忠
Original Assignee
中兴通讯股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中兴通讯股份有限公司 filed Critical 中兴通讯股份有限公司
Publication of WO2021238455A1 publication Critical patent/WO2021238455A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2457Query processing with adaptation to user needs
    • G06F16/24573Query processing with adaptation to user needs using data annotations, e.g. user-defined metadata
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2457Query processing with adaptation to user needs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2462Approximate or statistical queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2465Query processing support for facilitating data mining operations in structured databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques

Definitions

  • the embodiments of the present invention relate to, but are not limited to, the field of information processing technology, and in particular, to a data processing method, device, and computer-readable storage medium.
  • embodiments of the present invention provide a data processing method, device, and computer-readable storage medium, which at least solve the above technical problems to a certain extent.
  • an embodiment of the present invention provides a data processing method, including obtaining a target data sequence; obtaining a first abnormal data segment in the target data sequence; obtaining a first data search space in the target data sequence Acquire a second abnormal data segment corresponding to the first abnormal data segment in the first data search space according to the first abnormal data segment; mark the second abnormal data segment.
  • an embodiment of the present invention also provides a device, including: a memory, a processor, and a computer program stored in the memory and running on the processor.
  • a device including: a memory, a processor, and a computer program stored in the memory and running on the processor.
  • the processor executes the computer program, the above The data processing method of the second aspect is described.
  • an embodiment of the present invention also provides a computer-readable storage medium that stores computer-executable instructions, and the computer-executable instructions are used to execute the above-mentioned data processing method.
  • FIG. 1 is a schematic diagram of a system architecture platform for executing a data processing method according to an embodiment of the present invention
  • FIG. 2 is a flowchart of a data processing method provided by an embodiment of the present invention.
  • FIG. 3 is a flowchart of a data processing method provided by another embodiment of the present invention.
  • FIG. 4 is a flowchart of a data processing method provided by another embodiment of the present invention.
  • FIG. 5 is a flowchart of a data processing method provided by another embodiment of the present invention.
  • FIG. 6 is a flowchart of a data processing method provided by another embodiment of the present invention.
  • FIG. 7 is a flowchart of a data processing method provided by another embodiment of the present invention.
  • FIG. 8 is a flowchart of a data processing method provided by another embodiment of the present invention.
  • FIG. 9 is a flowchart of a data processing method provided by another embodiment of the present invention.
  • FIG. 10 is a flowchart of a data processing method provided by another embodiment of the present invention.
  • FIG. 11 is a flowchart of a data processing method provided by another embodiment of the present invention.
  • FIG. 12 is a flowchart of a data processing method provided by another embodiment of the present invention.
  • FIG. 13 is a flowchart of a data processing method provided by another embodiment of the present invention.
  • FIG. 14 is a flowchart of a data processing method provided by another embodiment of the present invention.
  • FIG. 15 is a flowchart of a heuristic algorithm provided by an embodiment of the present invention.
  • 16 is a flowchart of a heuristic algorithm provided by another embodiment of the present invention.
  • FIG. 17 is a main flow diagram of a data processing method provided by another embodiment of the present invention.
  • time series indicator data most of the time series indicator data will have certain periodic characteristics, and many recurring abnormal data tend to appear in the same position in different cycles, and The shape of the abnormal data will show a certain similarity.
  • most of the abnormal data can be attributed to several types of abnormalities with similar characteristics, and the number of truly unique abnormal data is relatively small.
  • there will be relatively similar abnormal data between similar and different time series indicator data For example, a network element has abnormally high central processing unit (CPU) utilization during a certain period of time. This situation may also appear in the CPU utilization timing data of another network element that undertakes similar services.
  • CPU central processing unit
  • the embodiments of the present invention provide a data processing method, device, and computer-readable storage medium.
  • the first abnormality is obtained in the target data sequence.
  • the data segment and the first data search space enable the first abnormal data segment to be used as an abnormal data segment template, so that the corresponding second abnormal data segment can be obtained and marked in the first data search space according to the abnormal data segment template, That is, the purpose of labeling other abnormal data segments in the target data sequence is achieved.
  • FIG. 1 is a schematic diagram of a system architecture platform for executing a data processing method provided by an embodiment of the present invention.
  • the system architecture platform includes a memory 110 and a processor 120, where the memory 110 and the processor 120 may be connected by a bus or in other ways.
  • the connection by a bus is taken as an example.
  • the memory 110 can be used to store non-transitory software programs and non-transitory computer-executable programs.
  • the memory 110 may include a high-speed random access memory, and may also include a non-transitory memory, such as at least one magnetic disk storage device, a flash memory device, or other non-transitory solid-state storage devices.
  • the memory 110 includes memories remotely arranged with respect to the processor 120, and these remote memories may be connected to the system architecture platform through a network. Examples of the aforementioned networks include, but are not limited to, the Internet, corporate intranets, local area networks, mobile communication networks, and combinations thereof.
  • system architecture platform can be applied to various network controllers or network managers, which is not specifically limited in this embodiment.
  • network controller or network manager with the system architecture platform can be applied to various network systems, for example, can be applied to 3G communication network systems, LTE communication network systems, 5G communication network systems, and subsequent evolved mobile communication network systems Etc., this embodiment does not specifically limit this.
  • FIG. 1 does not constitute a limitation to the embodiment of the present invention, and may include more or less components than those shown in the figure, or a combination of certain components, or different components. Component arrangement.
  • the processor 120 can call the data processing program stored in the memory 110 to execute the data processing method.
  • FIG. 2 is a flowchart of a data processing method provided by an embodiment of the present invention.
  • the data processing method includes but is not limited to step S100, step S200, step S300, step S400, and step S500.
  • Step S100 Obtain the target data sequence.
  • the target data sequence may be time-series indicator data or other sequence data.
  • the other sequence data may be non-time-series indicator data such as business type sequence data or business quantity sequence data.
  • the target data sequence can be automatically obtained by the device with the above-mentioned system architecture platform in the network, or it can be obtained by entering into the device with the above-mentioned system architecture platform through manual operation, which is not specifically limited in this embodiment.
  • Step S200 Acquire the first abnormal data segment in the target data sequence.
  • the first abnormal data segment is a data segment with abnormal data in the target data sequence.
  • the first abnormal data segment in the target data sequence can be manually determined and selected, and then entered into the system with the above-mentioned system architecture.
  • the device can obtain the first abnormal data segment, or it can be stored in the memory of the device so that the device can obtain the first abnormal data segment from the memory.
  • the necessary basic conditions can be provided for labeling the remaining abnormal data segments in the target data sequence in the subsequent steps.
  • abnormal data often occurs at one or more consecutive time points.
  • the time point when the abnormal data occurs is called the abnormal time point, and the data corresponding to the set of abnormal time points
  • the segment is called the abnormal data segment.
  • An abnormal data segment may last a long time (that is, it contains many abnormal time points). Therefore, for an abnormal data segment, at least the following characteristics are required: starting time point, ending time point, including at least 3 time points, abnormal data There is no overlap point in time between segments.
  • Step S300 Acquire a first data search space in the target data sequence.
  • the data search space is a part of candidate abnormal data extracted from the target data sequence by a machine learning method.
  • the data search space By obtaining the data search space, most of the normal data can be filtered out, and similar data segments appearing in these normal data can be prevented from being misjudged as similar abnormal data and searched out, thereby improving the search accuracy of similar abnormal data;
  • the search range of similar abnormal data can also be narrowed, thereby improving the search efficiency.
  • the first abnormal data segment may be a data segment outside the first data search space, or may be a data segment in the first data search space, which is not specifically limited in this embodiment.
  • the first abnormal data segment is obtained in the first data search space because the preliminary extraction of abnormal data has been carried out when the first data search space is obtained. , Can make the acquisition of the first abnormal data segment more accurate and effective.
  • Step S400 Acquire a second abnormal data segment corresponding to the first abnormal data segment in the first data search space according to the first abnormal data segment.
  • the first abnormal data segment can be used as an abnormal data segment template, and the data segment in the first data search space can be compared with the first abnormal data segment to find out the data segment in the first data search space.
  • the data segment that is the same or similar to the first abnormal data segment that is, the data segment that is the same or similar to the first abnormal data segment, is the second abnormal data segment. Therefore, by taking the first abnormal data segment as the template of the abnormal data segment and obtaining the second abnormal data segment corresponding to the first abnormal data segment in the first data search space, it is possible to find the abnormal data in the target data sequence. Therefore, for time series index data with a large amount of abnormal data but not many abnormal types, compared with the traditional method of manually searching for abnormal data, this embodiment can improve the efficiency of finding abnormal data in the data, thereby saving Human resources and time resources.
  • all the data in the first data search space can be regarded as a data segment, and the dynamic time warping (Dynamic Time Warping, DTW) algorithm can be used to calculate the similarity.
  • the first data can be determined A second abnormal data segment corresponding to the first abnormal data segment in a data search space.
  • the data in the first data search space can be divided into multiple data segments with the same length as the first abnormal data segment, and methods such as Euclidean distance or Pearson correlation coefficient can be used to compare the first abnormal data segment and the first abnormal data segment. Similarity calculations are performed on multiple data segments in the data search space to determine the second abnormal data segment corresponding to the first abnormal data segment in the first data search space.
  • Step S500 Mark the second abnormal data segment.
  • the second abnormal data segment corresponding to the first abnormal data segment when the second abnormal data segment corresponding to the first abnormal data segment is found in the first data search space, the second abnormal data segment can be marked, so as to facilitate the formation of abnormal label data.
  • training data sets which can be used in machine learning technologies such as deep learning technologies.
  • the data processing method uses the above-mentioned steps S100, S200, S300, S400, and S500, the first abnormal data segment and the first data search space are acquired in the target data sequence, so that the The first abnormal data segment can be used as an abnormal data segment template, so that the corresponding second abnormal data segment can be obtained and marked in the first data search space according to the abnormal data segment template, that is, to realize the detection of other abnormal data in the target data sequence.
  • the purpose of labeling data segments Therefore, for time series index data with a large amount of abnormal data but not many types of abnormalities, compared with the traditional manual labeling of abnormal data segments, the data processing method of this embodiment can improve the abnormalities in the data. The efficiency of data labeling can save human resources and time resources.
  • step S300 includes but is not limited to the following steps:
  • Step S310 Obtain the first abnormal characteristic value of the target data sequence
  • Step S320 Determine the first data position corresponding to the first abnormal characteristic value in the target data sequence according to the first abnormal characteristic value
  • Step S330 Acquire the first data search space according to the first data location.
  • abnormal data is data that deviates from most of the data in the data set.
  • the first abnormal characteristic value in this embodiment refers to the deviation value between the abnormal data and the normal data.
  • the first data position corresponding to the first abnormal characteristic value in the target data sequence may be determined according to the first abnormal characteristic value, that is, according to The first abnormal feature value determines the location of the abnormal data in the target data sequence. For example, when a data deviates from most of the data by greater than or equal to the first abnormal feature value, it can be determined that the location of the data is where an abnormal data is located. Location.
  • the first data position may include a start abnormal position, a middle abnormal position, and an end abnormal position. After the start abnormal position, the middle abnormal position, and the end abnormal position are determined, the first data search space can be obtained. .
  • LOF Local Outlier Factor
  • DBSCAN DBSCAN algorithm
  • isolation forest Isolation Forest, iForest
  • LOF Local Outlier Factor
  • iForest isolation forest
  • the isolated forest algorithm Take the isolated forest algorithm as an example to illustrate. First, build an iTree (tree) based on the target data sequence, and use the data in the target data sequence as the tree's sample data, and then use the first abnormal feature value to compare the sample data Perform a binary division to distinguish the sample data that meets the first abnormal feature value and the sample data that does not meet the first abnormal feature value to form two data sets, and then repeat the above process for these two data sets. Until the data can no longer be divided or the maximum height of the tree is reached, the first data position corresponding to the first abnormal characteristic value in the target data sequence can be obtained, so that the first data search space can be determined according to the first data position.
  • LOF Local Outlier Factor
  • DBSCAN DBSCAN algorithm
  • isolation forest Isolation Forest, iForest
  • step S310 includes but is not limited to the following steps:
  • Step S311 Obtain the first baseline prediction data of the target data sequence
  • Step S312 Obtain a first abnormal characteristic value according to the deviation value between the first baseline prediction data and the data in the target data sequence.
  • the baseline prediction data of the normal data in the target data sequence may be obtained first, and then based on the data in the target data sequence and the baseline prediction data The deviation value (that is, the absolute difference) between, obtains the first abnormal characteristic value.
  • different baseline prediction methods can be used to obtain baseline prediction data of normal data in the target data sequence, which is not specifically limited in this embodiment.
  • the baseline prediction method can use a difference method, a moving average method, and a weighting method.
  • Time series baseline forecasting methods such as moving average method, exponential weighted moving average method, differential moving average autoregressive method or three-time exponential smoothing method can also use regression methods such as random forest and XGBooste (Xtreme Gradient Boosting).
  • a variety of first abnormal feature values can be obtained by using a variety of baseline prediction methods, and corresponding steps are performed to obtain the first data search space by synthesizing different first abnormal feature values, which can facilitate the acquisition of the first data search space The accuracy and generalization ability.
  • the first abnormal characteristic value can be obtained by the following formula:
  • R is a first anomaly value
  • X i is the target data in the sequence
  • P i is the first baseline prediction target data sequence.
  • step S400 includes but is not limited to the following steps:
  • Step S410 Determine a third data segment in the first data search space
  • Step S420 Perform similarity calculation on the first abnormal data segment and the third data segment to obtain a first similarity metric value corresponding to the third data segment;
  • Step S430 Determine the corresponding third data segment as the second abnormal data segment according to the first similarity metric value.
  • the similarity measurement algorithm is used to calculate the similarity between the first abnormal data segment and the third data segment to obtain the first similarity measurement value corresponding to the third data segment.
  • the similarity metric value indicates that the first abnormal data segment is similar to the third data segment, and it can be determined that the corresponding third data segment is the second abnormal data segment (that is, the remaining abnormal data segments in the first data search space).
  • this embodiment can improve the abnormalities in the data.
  • the efficiency of data labeling can save human resources and time resources.
  • the number of the third data segment may be one or multiple, which is not specifically limited in this embodiment.
  • the number of the third data segment is one, all the data in the first data search space can be determined as the third data segment, or part of the continuous data in the first data search space can be determined as the third data segment.
  • the embodiment is not specifically limited; when the number of the third data segment is multiple, the data in the first data search space can be divided into multiple data segments of equal length, or the data in the first data search space can be divided It is divided into multiple data segments of unequal length, which is not specifically limited in this embodiment.
  • the calculation of the similarity between the first abnormal data segment and the third data segment can be achieved by using different similarity measurement algorithms. For example, for multiple third data segments of equal length to each other, Euclidean distance, Pearson correlation coefficient, or Spearman rank correlation coefficient can be used to calculate the similarity between the first abnormal data segment and the third data segment. ; For another example, for multiple third data segments with unequal lengths, the DTW algorithm or an improved fast DTW algorithm can be used to calculate the similarity between the first abnormal data segment and the third data segment.
  • the specific implementation manner for calculating the similarity between the first abnormal data segment and the third data segment can be appropriately selected according to actual use needs, and this embodiment does not specifically limit it.
  • the improved fast DTW algorithm can include FastDTW algorithm, SparseDTW algorithm, LB_Keogh algorithm, and LB_Improved algorithm, etc.
  • FastDTW algorithm can reduce the search space and data abstraction methods by limiting the accuracy difference. Next, the computational complexity is reduced.
  • step S430 includes but is not limited to the following steps:
  • Step S431 When the first similarity metric value is less than the preset threshold, it is determined that the third data segment corresponding to the first similarity metric value is the second abnormal data segment.
  • the first similarity metric value indicates the degree of similarity between the first abnormal data segment and the third data segment, and the smaller the value of the first similarity metric value is, it indicates that the first abnormal data segment and the third data segment are Therefore, when the first similarity measure value is less than the preset threshold, it can be determined that the first abnormal data segment and the third data segment have a higher degree of similarity, so that the first similarity measure can be determined
  • the third data segment corresponding to the value is the second abnormal data segment.
  • the preset threshold can be appropriately selected according to the similarity measurement algorithm used. For example, for the Euclidean distance and the DTW algorithm, different preset thresholds can be used, which is not specifically limited in this embodiment. .
  • step S430 may include but is not limited to the following steps:
  • Step S432 Obtain a first similarity metric value whose value is less than a preset threshold
  • Step S433 Sort the first similarity measure values whose values are less than the preset threshold value from small to large to adjust the sorting of the corresponding third data segment;
  • Step S434 Determine that the first N third data segments are second abnormal data segments, where N is greater than or equal to 1.
  • the number of the acquired first similarity metric values corresponding to the third data segment is also more than two.
  • the FastDTW algorithm can be used to compare each third data segment in the first data search space with the first abnormal data segment to calculate the similarity metric value of each third data segment, and then All third data segments are sorted according to similarity metric values to obtain several third data segments with a higher degree of similarity to the first abnormal data segment, so that the first data segment that needs to be labeled can be determined based on these third data segments.
  • Abnormal data segment can be used to compare each third data segment in the first data search space with the first abnormal data segment to calculate the similarity metric value of each third data segment, and then All third data segments are sorted according to similarity metric values to obtain several third data segments with a higher degree of similarity to the first abnormal data segment, so that the first data segment that needs to be labeled can be determined based on these third data segments.
  • the optimal value of N may be different. For example, if the value of N is too small, some abnormal data segments will be missed and not marked, and if the value of N is too large, some abnormal data segments with relatively low similarity may be identified, resulting in accuracy The problem of falling. Therefore, the value of N needs to be appropriately selected according to the actual application situation. If you want to select the value of N more accurately, you can calculate the AUC value by establishing the curve of accuracy and recall to obtain the best value of N. value.
  • the accuracy rate refers to the proportion of correctly labeled abnormal data segments
  • the recall rate refers to the proportion of manually labeled abnormal data segments that are correctly labeled
  • AUC Re Under Curve
  • step S100 may include but is not limited to the following steps:
  • Step S110 Obtain multiple data sequences to be tested
  • Step S120 performing clustering processing on a plurality of data sequences to be tested to obtain a target data category
  • Step S130 Determine a target data sequence from each target data category.
  • the number of data sequences to be tested collected from the network is very large. If the first abnormal data segment in each data sequence to be tested is manually determined, the workload is relatively large. In addition, there may be cases in which the data sequence to be tested does not contain a large amount of similar abnormal data due to the short acquisition time of the data sequence to be tested. Therefore, in this case, obtain the first data sequence in the data sequence to be tested. An abnormal data segment will be more difficult. However, for a data indicator, according to the different resource objects bound to it in the network, many data sequences to be tested can be collected. For example, in a medium-scale network, there will be tens of thousands of port resources.
  • a data index tens of thousands of data sequences to be tested can be collected, and these data sequences to be tested themselves often have certain similarities. For example, it is used to count the traffic timing data of the base station access port deployed in school A and is used to count the traffic timing data of the base station access port deployed in school B. Since the daily life characteristics of school A and school B are similar, then this The two data series to be tested will be relatively similar to a large extent. In the similar data sequence to be tested, the characteristics of the abnormal data will also have certain commonalities. Based on the above situation, the multiple acquired data sequences to be tested can be clustered to obtain the target data category, and then a target data sequence can be determined from each target data category, so as to provide the necessary foundation for the subsequent steps condition.
  • the number of target data classes obtained by clustering multiple data sequences to be tested may be one or multiple, depending on the similarity of the data sequences to be tested, for example, if These multiple data sequences to be tested are relatively similar, then all the data sequences to be tested can be classified as a target data category, and if some of the multiple data sequences to be tested are relatively similar, then you can The multiple data sequences to be tested are divided into multiple target data categories, and each target data category includes a part of the data sequences to be tested.
  • the data processing method may further include the following steps:
  • Step S600 acquiring a second data search space in the remaining data sequences to be tested in each target data category
  • Step S700 Use the first abnormal data segment in the target data sequence to obtain the second abnormal data segment in the second data search space in the remaining data sequence to be tested, respectively.
  • the second abnormal data segment in the target data sequence can be obtained and labeled through the steps in the above-mentioned embodiment.
  • the first abnormal data segment obtained in the target data sequence can be applied to the same target data category. Obtain and label the second abnormal data segment for the rest of the data sequence under test. Therefore, you can first obtain the second data search space from the remaining data sequence under test in each target data class, and then use the data in the target data sequence.
  • the first abnormal data segment obtains the second abnormal data segment from the second data search space in the remaining data sequence to be tested respectively. Since only the first abnormal data segment in the target data sequence can be used to obtain the second abnormal data segment in the second data search space in the remaining data sequences to be tested, it can save to obtain the second abnormal data segment in each of the remaining data sequences to be tested.
  • the second abnormal data segment in each data sequence to be tested can be obtained more concisely and efficiently, so that the labeling efficiency of abnormal data in multiple time series indicator data can be improved.
  • the second data search space in this embodiment and the first data search space in the above embodiment are of the same type of technical features.
  • the difference between the two is only in the different belonging objects, and the first data search space belongs to The target data sequence, and the second data search space belongs to the remaining data sequences to be tested in the same target data category.
  • the second data search space is not described in detail here.
  • step S700 in this embodiment is similar to step S400 in the embodiment shown in FIG.
  • the execution object of step S400 is the first data search space of the target data sequence
  • the execution object of step S700 in this embodiment is the second data search space of the remaining data sequences to be tested in the same target data class.
  • step S700 is not described in detail here.
  • step S700 reference may be made to related explanations of step S400 in the foregoing embodiment.
  • step S600 includes but is not limited to the following steps:
  • Step S610 Obtain the second abnormal characteristic value of the remaining data sequence to be tested in each target data category
  • Step S620 Determine the second data position corresponding to the second abnormal characteristic value in the remaining data sequence to be tested according to the second abnormal characteristic value;
  • Step S630 Obtain the second data search space of the remaining data sequence to be tested according to the second data position.
  • the second abnormal feature value and the second data location in this embodiment, and the first abnormal feature value and the first data location in the foregoing embodiment belong to the same type of technical features, respectively.
  • the only difference is that the attribution objects are different.
  • the first abnormal feature value and the first data location are both attributable to the target data sequence, while the second abnormal feature value and the second data location are both attributable to the remaining data to be tested in the same target data category sequence.
  • the second abnormal feature value and the second data location are not described in detail here.
  • the second abnormal feature value and the second data location please refer to the first abnormal feature in the above embodiment. Explanation of the value and the position of the first data.
  • step S610, step S620, and step S630 in this embodiment are similar to step S310, step S320, and step S330 in the embodiment shown in FIG. 3, and they have similar technical principles and Technical effect, the difference between the two is only in the execution target.
  • the execution target of step S310, step S320 and step S330 is the target data sequence
  • the execution target of step S610, step S620 and step S630 in this embodiment is The remaining data sequences to be tested in the same target data class.
  • step S610, step S620, and step S630 are not described here in detail.
  • step S610, step S620, and step S630 you can refer to step S310, step S320, and step S330 in the above embodiment. Related explanations.
  • step S610 includes but is not limited to the following steps:
  • Step S611 Obtain the second baseline prediction data of the remaining data sequence to be tested in each target data category
  • Step S612 Obtain the second abnormal characteristic value of the remaining data sequence to be tested according to the deviation value of the second baseline prediction data and the data in the remaining data sequence to be tested.
  • the second baseline prediction data in this embodiment and the first baseline prediction data in the foregoing embodiment belong to the same type of technical features.
  • the difference between the two is only that the attribution object is different, and the first baseline prediction The data belongs to the target data sequence, and the second baseline prediction data belongs to the remaining data sequences to be tested in the same target data category.
  • the second baseline prediction data is not described in detail here.
  • step S611 and step S612 in this embodiment are similar to step S311 and step S312 in the embodiment shown in FIG. 4, and they have similar technical principles and technical effects.
  • the only difference is that the execution objects are different.
  • the execution objects of step S311 and step S312 are the target data sequence, while the execution objects of step S611 and step S612 in this embodiment are the remaining data sequences to be tested in the same target data category. .
  • step S611 and step S612 are not described in detail here.
  • step S700 includes but is not limited to the following steps:
  • Step S710 respectively determine the fourth data segment in the second data search space in the remaining data sequence to be tested
  • Step S720 Perform similarity calculation on the first abnormal data segment in the target data sequence and the fourth data segment in the remaining data sequences to be tested to obtain a second similarity metric value corresponding to the fourth data segment;
  • Step S730 Determine the corresponding fourth data segment in the remaining data sequence to be tested as the second abnormal data segment in the remaining data sequence to be tested according to the second similarity metric value.
  • the fourth data segment and the second similarity metric value in this embodiment, and the third data segment and the first similarity metric value in the foregoing embodiment belong to the same type of technical features, and the difference between the two The only difference is that the attribution object is different.
  • the third data segment belongs to the first data search space of the target data sequence
  • the first similarity measure value corresponds to the third data segment
  • the fourth data segment belongs to the same target data category.
  • the second similarity metric value corresponds to the fourth data segment.
  • the fourth data segment and the second similarity measure value are not described in detail here.
  • the fourth data segment and the second similarity measure value refer to the third data segment in the above embodiment. Explanation and explanation of the first similarity measure.
  • step S710, step S720, and step S730 in this embodiment are similar to step S410, step S420, and step S430 in the embodiment shown in FIG. 5, and they have similar technical principles and The technical effect is that the difference between the two is only in the execution objects.
  • the execution objects of step S410, step S420, and step S430 are the first data search space of the target data sequence.
  • steps S710, S720, and S720 are executed.
  • the execution object of step S730 is the second data search space of the remaining data sequences to be tested in the same target data category.
  • step S710, step S720, and step S730 are not described in detail here.
  • steps S710, step S720, and step S730 you can refer to steps S410, step S420, and step S430 in the above embodiment. Related explanations.
  • step S730 includes but is not limited to the following steps:
  • Step S731 When the second similarity metric value is less than the preset threshold, it is determined that the corresponding fourth data segment in the remaining data sequence to be tested is the second abnormal data segment in the remaining data sequence to be tested.
  • the second similarity metric value represents the degree of similarity between the first abnormal data segment and the fourth data segment
  • the higher the degree of similarity between the two therefore, when the second similarity measure value is less than the preset threshold, it can be determined that the first abnormal data segment and the fourth data segment have a higher degree of similarity, and therefore the second similarity measure can be determined
  • the fourth data segment corresponding to the value is the second abnormal data segment.
  • the preset threshold can be appropriately selected according to the similarity measurement algorithm used. For example, for the Euclidean distance and the DTW algorithm, different preset thresholds can be used, which is not specifically limited in this embodiment. .
  • step S730 may include but is not limited to the following steps:
  • Step S732 respectively acquiring a second similarity metric value corresponding to the remaining data sequence to be tested and the value is less than a preset threshold
  • Step S733 sorting the second similarity measure values whose values are less than the preset threshold value from small to large, so as to respectively adjust the sorting of the corresponding fourth data segments in the remaining data sequences to be tested;
  • Step S734 Determine the first N fourth data segments as the second abnormal data segments in the remaining data sequences to be tested, where N is greater than or equal to 1.
  • step S732, step S733, and step S734 in this embodiment are similar to step S432, step S433, and step S434 in the embodiment shown in FIG. 6, and they have similar technical principles and Technical effect, the difference between the two is only in the execution target.
  • the execution target of step S432, step S433, and step S434 is the target data sequence
  • the execution target of step S732, step S733, and step S734 in this embodiment is The remaining data sequences to be tested in the same target data class.
  • step S732, step S733 and step S734 will not be described in detail here.
  • step S732, step S733 and step S734 please refer to step S432, step S433 and step S434 in the above embodiment.
  • step S120 may include but is not limited to the following steps:
  • Step S121 Perform data preprocessing on multiple data sequences to be tested, respectively, to obtain multiple first preprocessed data sequences;
  • Step S122 Perform baseline extraction processing on the multiple first pre-processed data sequences, respectively, to obtain multiple second pre-processed data sequences;
  • Step S123 clustering a plurality of second pre-processed data sequences according to similarity, to obtain a target data category.
  • data preprocessing may be performed on the multiple data sequences to be tested respectively to obtain multiple first preprocessed data sequences, and then the multiple data sequences can be preprocessed.
  • the first preprocessed data sequences are respectively subjected to baseline extraction processing to obtain multiple second preprocessed data sequences, and then the multiple second preprocessed data sequences are clustered according to the similarity, so as to obtain the target data category.
  • the abnormal data segment labeling processing for each data sequence to be tested can be transformed into the abnormal data segment labeling processing for each target data category , which can reduce the processing complexity and processing time, thereby improving the efficiency of labeling abnormal data in the data.
  • performing baseline extraction processing on the first pre-processed data sequence can smooth out abnormal parts and noise parts in the data sequence to be tested, thereby improving the accuracy of the similarity measurement between the data sequences to be tested.
  • the baseline extraction processing in step S122 in this embodiment has a similar technical principle to the step of obtaining baseline prediction data using the baseline prediction method in the embodiment shown in FIG.
  • the relevant explanations of performing the baseline extraction processing in S122 reference may be made to the relevant explanations of using the baseline prediction method to obtain baseline prediction data in the embodiment shown in FIG.
  • step S121 may include but is not limited to the following steps:
  • Step S1211 performing missing value filling processing on the multiple data sequences to be tested respectively, to obtain multiple filling data sequences
  • Step S1212 Perform data standardization processing on the multiple filling data sequences, respectively, to obtain multiple first preprocessed data sequences.
  • the data sequence to be tested collected from the network may have missing values of varying degrees due to various reasons. These missing values will not only cause the length of each data sequence to be tested to be different, resulting in some Similarity measurement algorithms are difficult to use and will affect the accuracy of the baseline extraction process.
  • this embodiment first performs data filling on these missing values to obtain a filling data sequence, and then performs data standardization processing on the filling data sequence to obtain The first preprocessed data sequence.
  • a linear interpolation filling method may be used to perform the missing value filling processing.
  • the linear interpolation filling method can smooth the waveform of the data sequence to be measured, thereby facilitating the execution of the baseline extraction processing. For example, for a time series indicator data, the specific location of the missing value can be determined based on the continuity in time. After the specific location of the missing value is determined, the specific data that needs to be filled can be obtained based on the data before and after the location of the missing value. Numerical value, for example, the average value of the preceding and following data can be used as the specific numerical value to be filled.
  • the linear interpolation filling method belongs to an algorithm commonly used in the art, and therefore, the specific principle of the algorithm will not be repeated here.
  • performing data standardization processing on the filling data sequence can transform and map the data sequence to be measured to a specific interval, thereby helping to eliminate the dimensional difference between different data sequences to be measured, so that they can be put together Compare the similarity.
  • the Z-Score method may be used for data standardization processing, and the calculation formula is as follows:
  • x′ i is the first preprocessed data sequence
  • x i is the data sequence to be tested
  • Is the mean value of the data series to be tested
  • is the standard deviation of the data series to be tested.
  • the clustering of multiple second pre-processed data sequences according to similarity in step S123 may specifically include, but is not limited to, the following steps:
  • Step S1231 using the DBSCAN algorithm to cluster a plurality of second pre-processed data sequences according to the similarity; among them, the parameters of the DBSCAN algorithm include the distance function, the threshold of the number of neighborhoods, and the threshold of the neighborhood distance; the result of the DBSCAN algorithm includes the number of categories and Abnormal proportions.
  • DBSCAN algorithm is one of the commonly used clustering algorithms, and the DBSCAN algorithm does not need to determine the number of cluster centers in advance.
  • the key parameters of DBSCAN algorithm include distance function, neighborhood number threshold and neighborhood distance threshold, while the result of DBSCAN algorithm includes classification number and abnormal proportion.
  • the Euclidean distance function can be used in this embodiment; for the number of neighborhood threshold, this embodiment can be set to 4; and for the neighborhood distance threshold, the parameter needs to be dynamically based on the data set. Estimated, and this parameter has a significant impact on the clustering results.
  • the neighborhood distance threshold may be obtained by a heuristic algorithm, where the heuristic algorithm includes but is not limited to the following steps:
  • Step S810 Calculate the similarity between a plurality of second pre-processed data sequences by using a distance function to obtain similarity matrix data;
  • Step S820 Calculate the k-dist distance based on the similarity matrix data to obtain the k-dist sequence
  • Step S830 obtaining an initial distance threshold parameter based on the k-dist sequence
  • Step S840 Adjust the initial distance threshold parameter to obtain the neighborhood distance threshold.
  • the k-dist distance refers to the distance between a data object and its k-th closest object.
  • the similarity between multiple second preprocessing data sequences can be calculated by using, for example, the Euclidean distance function and other distance functions to form similarity matrix data, and then based on the The similarity matrix data calculates the k-dist distance to obtain a k-dist sequence, and then obtains an initial distance threshold parameter based on the k-dist sequence, and then adjusts the initial distance threshold parameter to obtain an appropriate neighborhood distance threshold.
  • the neighborhood distance threshold can be applied to the above embodiment using the DBSCAN algorithm to cluster multiple second preprocessed data sequences according to similarity, Thereby, the target data class can be obtained.
  • initial thresholds such as the maximum distance threshold, minimum length threshold, slope threshold, and slope difference threshold of the neighborhood may be set first, and the above steps are performed after completing the setting of these initial thresholds. S810, step S820, step S830, and step S840.
  • step S820 when step S820 is performed, after calculating the k-dist distance of each k-dist point based on the similarity matrix data, the obtained k-dist distances can be sorted from small to large, and k-dist distances are excluded. -The k-dist point with a dist distance of 0 and the k-dist point with a k-dist distance exceeding the maximum distance threshold of the neighborhood, therefore, the remaining k-dist points constitute a k-dist sequence.
  • step S830 may include but is not limited to the following steps:
  • Step S831 Calculate the slopes of each k-dist point in the k-dist sequence and the two adjacent points before and after, the slopes of the current two adjacent points are less than the preset slope threshold, and the current two adjacent points The slope difference of is smaller than the preset slope difference threshold, and the current k-dist point is determined to be the candidate distance threshold;
  • Step S832 Determine the one with the largest value among the candidate distance thresholds as the initial distance threshold parameter.
  • the relatively flat k-dist point in the k-dist sequence may be first determined as the candidate distance threshold.
  • the specific steps may be: First calculate the slope of each k-dist point in the k-dist sequence and its two adjacent points before and after it.
  • the current k-dist point and the previous adjacent point which can be defined as the left slope
  • the current k-dist point The slope of the point and the next adjacent point (which can be defined as the right slope) is less than the preset slope threshold, and the difference between the left slope and the right slope is less than the preset slope difference threshold, then the current k The -dist point is determined as a candidate distance threshold.
  • these candidate distance thresholds can be sorted from largest to smallest, and then the candidate distance threshold with the maximum value is taken as the initial distance threshold parameter.
  • step S840 may include but is not limited to the following steps:
  • Step S841 obtaining the step length
  • step S842 the initial distance threshold parameter is adjusted according to the step length to obtain the distance adjustment threshold.
  • the distance adjustment threshold obtained in the previous step adjustment is the neighborhood distance threshold.
  • the initial distance threshold parameter can continue to be optimized. It is worth noting that when the initial distance threshold parameter is optimized, it is necessary to keep the classification number unchanged. Under the premise of reducing the abnormal proportion as much as possible.
  • the proportion of abnormalities will continue to decrease, and the number of classifications may also decrease, so you can gradually increase the value of the initial distance threshold parameter in a step-by-step manner to determine the best neighborhood distance Threshold, that is, on the basis of the initial distance threshold parameter, the number of classifications and abnormal proportions are recalculated after each step length is increased, until the number of classifications drops, the step length is stopped increasing, at this time, the previous step can be determined to adjust
  • the obtained distance adjustment threshold is the optimal neighborhood distance threshold.
  • the step length can be set according to empirical values, or it can be set according to the candidate distance threshold. For example, when the step length is set according to the candidate distance threshold, the step length can be set as the candidate distance threshold.
  • the distance threshold is one-tenth of the difference between the maximum distance threshold and the minimum distance threshold, which is not specifically limited in this embodiment.
  • the heuristic algorithm specifically includes the following steps:
  • Step S901 threshold setting.
  • initial thresholds such as the maximum distance threshold, minimum length threshold, slope threshold, and slope difference threshold of the neighborhood are respectively set.
  • Step S902 the sequence similarity matrix is calculated.
  • the similarity between each pair of data sequences is calculated by the distance function to form the similarity matrix data.
  • step S903 the k-dist distance is calculated and sorted.
  • the k-dist distance of each k-dist point is calculated based on the similarity matrix data and sorted from small to large.
  • Step S904 filtering according to the maximum distance threshold.
  • the k-dist points whose k-dist distance is 0 and the k-dist distance points whose k-dist distance exceeds the maximum distance threshold of the neighborhood are excluded.
  • step S905 the k-dist sequence values are taken in order.
  • step S906 it is judged whether the left slope and the right slope are both smaller than the slope threshold.
  • step S907 calculates the slope of each k-dist point and its two adjacent points before and after it, if the slope of the current k-dist point and the previous adjacent point (which can be defined as the left slope) and the current k-dist point and The slope (which can be defined as the right slope) of the latter adjacent point is all less than the preset slope threshold, then step S907 is executed, otherwise, step S905 is executed.
  • step S907 it is determined whether the difference between the left slope and the right slope is less than the slope difference threshold.
  • step S908 when the difference between the left slope and the right slope is less than the slope difference threshold, step S908 is executed, otherwise, step S905 is executed.
  • step S908 the current k-dist point is determined as the candidate distance threshold, and steps S905 to S907 are repeated, and when all candidate distance thresholds are obtained, step S909 is executed.
  • Step S909 After sorting the candidate thresholds in descending order, the largest candidate threshold is taken as the initial distance threshold parameter.
  • step S910 the clustering algorithm is executed to obtain the number of classifications and the abnormal ratio.
  • step S911 a step size is added on the basis of the initial distance threshold parameter, and the clustering algorithm is executed.
  • step S912 it is judged whether the number of classifications has decreased, if so, step S913 is executed, otherwise, step S911 is executed.
  • Step S913 Determine the previous distance threshold as the best distance threshold.
  • step S130 may include but is not limited to the following steps:
  • Step S131 In each target data category, calculate the average sum of distances between each data sequence to be measured and the rest of the data sequences to be measured, and determine the data sequence to be measured corresponding to the smallest value of the average and median distance as the target data sequence .
  • a core data sequence representing the corresponding target data category may be determined for each target data category, that is, from A target data sequence is determined in each target data class.
  • the target data sequence can be determined by the following formula:
  • FIG. 17 is a main flow diagram of the data processing method provided in this example. Based on the main flow diagram shown in FIG. 17, the data processing method specifically includes the following steps:
  • the corresponding abnormal data segment template is automatically obtained according to the core sequence of the classification of each sequence, and the search for similar abnormal segments is completed in the data search space of the sequence to obtain N similar abnormal segments with a higher degree of similarity. These N similar abnormal segments are abnormally marked;
  • another embodiment of the present invention also provides a device, which includes a memory, a processor, and a computer program stored on the memory and running on the processor.
  • the processor and the memory can be connected by a bus or in other ways.
  • the memory can be used to store non-transitory software programs and non-transitory computer-executable programs.
  • the memory may include a high-speed random access memory, and may also include a non-transitory memory, such as at least one magnetic disk storage device, a flash memory device, or other non-transitory solid-state storage devices.
  • the memory includes a memory remotely arranged with respect to the processor, and these remote memories may be connected to the processor through a network. Examples of the aforementioned networks include, but are not limited to, the Internet, corporate intranets, local area networks, mobile communication networks, and combinations thereof.
  • the device in this embodiment may include the system architecture platform in the embodiment shown in FIG. 1, and the device in this embodiment and the system architecture platform in the embodiment shown in FIG. 1 belong to the same invention. Concept, so the two have the same implementation principle and technical effect, and will not be detailed here.
  • Steps S100 to S500, method steps S310 to S330 in FIG. 3, method steps S311 to S312 in FIG. 4, method steps S410 to S430 in FIG. 5, method steps S432 to S434 in FIG. 6, and method in FIG. 7 Steps S110 to S130, method steps S600 to S700 in FIG. 8, method steps S610 to S630 in FIG. 9, method steps S611 to S612 in FIG. 10, method steps S710 to S730 in FIG. 11, and method in FIG. 12 Steps S732 to S734, method steps S121 to S123 in FIG. 13, method steps S1211 to S1212 in FIG. 14, method steps S810 to S840 in FIG. 15, and method steps S901 to S913 in FIG.
  • the device embodiments described above are merely illustrative, and the units described as separate components may or may not be physically separated, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the modules can be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
  • another embodiment of the present invention also provides a computer-readable storage medium, the computer-readable storage medium stores computer-executable instructions, and the computer-executable instructions are executed by a processor or a controller, for example, by The execution of a processor in the foregoing device embodiment may cause the foregoing processor to execute the data processing method in the foregoing embodiment, for example, to execute the method steps S100 to S500 in FIG. 2 and the method steps S310 to S310 in FIG. 3 described above. S330, method steps S311 to S312 in FIG. 4, method steps S410 to S430 in FIG. 5, method steps S432 to S434 in FIG. 6, method steps S110 to S130 in FIG. 7, and method steps S600 to S600 in FIG.
  • the embodiment of the present invention includes: acquiring the target data sequence; acquiring the first abnormal data segment in the target data sequence; acquiring the first data search space in the target data sequence; acquiring the data in the first data search space according to the first abnormal data segment The second abnormal data segment corresponding to the first abnormal data segment; the second abnormal data segment is marked.
  • the first abnormal data segment can be used as an abnormal data segment template, so that the first data search space can be According to the abnormal data segment template, the corresponding second abnormal data segment is acquired and marked, which realizes the purpose of marking other abnormal data segments in the target data sequence. Therefore, compared with the traditional manual marking of abnormal data segments,
  • the solution provided by the embodiment of the present invention can improve the labeling efficiency of abnormal data in the data, thereby saving human resources and time resources.
  • Computer storage medium includes volatile and non-volatile data implemented in any method or technology for storing information (such as computer-readable instructions, data structures, program modules, or other data). Sexual, removable and non-removable media.
  • Computer storage media include but are not limited to RAM, ROM, EEPROM, flash memory or other storage technologies, CD-ROM, digital versatile disk (DVD) or other optical disk storage, magnetic cassettes, magnetic tapes, magnetic disk storage or other magnetic storage devices, or Any other medium used to store desired information and that can be accessed by a computer.
  • a communication medium usually contains computer-readable instructions, data structures, program modules, or other data in a modulated data signal such as a carrier wave or other transmission mechanism, and may include any information delivery medium. .

Abstract

A data processing method and device, and a computer-readable storage medium. The data processing method comprises: acquiring a target data sequence (S100); acquiring a first anomalous data segment from the target data sequence (S200); acquiring a first data search space from the target data sequence (S300); acquiring, according to the first anomalous data segment and from the first data search space, a second anomalous data segment corresponding to the first anomalous data segment (S400); and labeling the second anomalous data segment (S500).

Description

数据处理方法、设备及计算机可读存储介质Data processing method, equipment and computer readable storage medium
相关申请的交叉引用Cross-references to related applications
本申请基于申请号为202010473617.0、申请日为2020年5月29日的中国专利申请提出,并要求该中国专利申请的优先权,该中国专利申请的全部内容在此引入本申请作为参考。This application is filed based on a Chinese patent application with an application number of 202010473617.0 and an application date of May 29, 2020, and claims the priority of the Chinese patent application. The entire content of the Chinese patent application is hereby incorporated by reference into this application.
技术领域Technical field
本发明实施例涉及但不限于信息处理技术领域,尤其涉及一种数据处理方法、设备及计算机可读存储介质。The embodiments of the present invention relate to, but are not limited to, the field of information processing technology, and in particular, to a data processing method, device, and computer-readable storage medium.
背景技术Background technique
随着大数据和人工智能技术的发展,通信网络的运维中已越来越多地引入了更智能更高效的机器学习技术,例如指标的异常感知、趋势预测、故障根因分析等等。这些技术要达到较好的应用效果通常需要依赖高质量的训练数据集,而可靠的标签数据是高质量训练数据集的一部分。而且,近年来深度学习技术在图像、语音识别领域的应用获得了巨大的成功,背后就离不开依靠大量人力标注得到的标签数据集。With the development of big data and artificial intelligence technology, more and more intelligent and efficient machine learning technologies have been introduced into the operation and maintenance of communication networks, such as indicator anomaly perception, trend prediction, fault root cause analysis, and so on. These technologies usually need to rely on high-quality training data sets to achieve better application effects, and reliable label data is a part of high-quality training data sets. Moreover, in recent years, the application of deep learning technology in the field of image and speech recognition has achieved great success, and it is inseparable from the label data set obtained by relying on a large amount of manpower to label.
然而,对庞大的训练数据集进行人工标注,其成本非常昂贵,需要耗费大量的人力资源和时间资源。例如,针对一个中等规模的网络,存在着数以百万计的海量时序数据,如果通过人工对这些数据中的所有异常数据进行标注,是不可能完成的任务,即便采用一些经验公式等方法来进行辅助标注,也存在结果不准确、不完整的问题。因此,如何提升数据中异常数据的标注效率,是亟待解决的技术问题。However, manual labeling of huge training data sets is very expensive and requires a lot of human resources and time resources. For example, for a medium-scale network, there are millions of massive time series data. If you manually label all abnormal data in these data, it is impossible to complete the task, even if some empirical formulas are used to solve the problem. There are also problems with inaccurate and incomplete results when performing auxiliary labeling. Therefore, how to improve the efficiency of labeling abnormal data in the data is a technical problem to be solved urgently.
发明内容Summary of the invention
以下是对本文详细描述的主题的概述。本概述并非是为了限制权利要求的保护范围。The following is an overview of the topics detailed in this article. This summary is not intended to limit the scope of protection of the claims.
第一方面,本发明实施例提供了一种数据处理方法、设备及计算机可读存储介质,至少在一定程度上解决上述技术问题。In the first aspect, embodiments of the present invention provide a data processing method, device, and computer-readable storage medium, which at least solve the above technical problems to a certain extent.
第二方面,本发明实施例提供了一种数据处理方法,包括,获取目标数据序列;获取所述目标数据序列中的第一异常数据段;在所述目标数据序列中获取第一数据搜索空间;根据所述第一异常数据段在所述第一数据搜索空间中获取与所述第一异常数据段对应的第二异常数据段;对所述第二异常数据段进行标注。In a second aspect, an embodiment of the present invention provides a data processing method, including obtaining a target data sequence; obtaining a first abnormal data segment in the target data sequence; obtaining a first data search space in the target data sequence Acquire a second abnormal data segment corresponding to the first abnormal data segment in the first data search space according to the first abnormal data segment; mark the second abnormal data segment.
第三方面,本发明实施例还提供了一种设备,包括:存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现如上所述第二方面的数据处理方法。In a third aspect, an embodiment of the present invention also provides a device, including: a memory, a processor, and a computer program stored in the memory and running on the processor. When the processor executes the computer program, the above The data processing method of the second aspect is described.
第四方面,本发明实施例还提供一种计算机可读存储介质,存储有计算机可执行指令,所述计算机可执行指令用于执行如上所述的数据处理方法。In a fourth aspect, an embodiment of the present invention also provides a computer-readable storage medium that stores computer-executable instructions, and the computer-executable instructions are used to execute the above-mentioned data processing method.
本发明的其它特征和优点将在随后的说明书中阐述,并且,部分地从说明书中变得显而易见,或者通过实施本发明而了解。本发明的目的和其他优点可通过在说明书、权利要求书以及附图中所特别指出的结构来实现和获得。Other features and advantages of the present invention will be described in the following description, and partly become obvious from the description, or understood by implementing the present invention. The purpose and other advantages of the present invention can be realized and obtained through the structures specifically pointed out in the specification, claims and drawings.
附图说明Description of the drawings
附图用来提供对本发明技术方案的进一步理解,并且构成说明书的一部分,与本发明的实施例一起用于解释本发明的技术方案,并不构成对本发明技术方案的限制。The accompanying drawings are used to provide a further understanding of the technical solution of the present invention, and constitute a part of the specification. Together with the embodiments of the present invention, they are used to explain the technical solution of the present invention, and do not constitute a limitation to the technical solution of the present invention.
图1是本发明一个实施例提供的用于执行数据处理方法的系统架构平台的示意图;FIG. 1 is a schematic diagram of a system architecture platform for executing a data processing method according to an embodiment of the present invention;
图2是本发明一个实施例提供的数据处理方法的流程图;Figure 2 is a flowchart of a data processing method provided by an embodiment of the present invention;
图3是本发明另一实施例提供的数据处理方法的流程图;FIG. 3 is a flowchart of a data processing method provided by another embodiment of the present invention;
图4是本发明另一实施例提供的数据处理方法的流程图;4 is a flowchart of a data processing method provided by another embodiment of the present invention;
图5是本发明另一实施例提供的数据处理方法的流程图;FIG. 5 is a flowchart of a data processing method provided by another embodiment of the present invention;
图6是本发明另一实施例提供的数据处理方法的流程图;FIG. 6 is a flowchart of a data processing method provided by another embodiment of the present invention;
图7是本发明另一实施例提供的数据处理方法的流程图;FIG. 7 is a flowchart of a data processing method provided by another embodiment of the present invention;
图8是本发明另一实施例提供的数据处理方法的流程图;FIG. 8 is a flowchart of a data processing method provided by another embodiment of the present invention;
图9是本发明另一实施例提供的数据处理方法的流程图;FIG. 9 is a flowchart of a data processing method provided by another embodiment of the present invention;
图10是本发明另一实施例提供的数据处理方法的流程图;FIG. 10 is a flowchart of a data processing method provided by another embodiment of the present invention;
图11是本发明另一实施例提供的数据处理方法的流程图;FIG. 11 is a flowchart of a data processing method provided by another embodiment of the present invention;
图12是本发明另一实施例提供的数据处理方法的流程图;FIG. 12 is a flowchart of a data processing method provided by another embodiment of the present invention;
图13是本发明另一实施例提供的数据处理方法的流程图;FIG. 13 is a flowchart of a data processing method provided by another embodiment of the present invention;
图14是本发明另一实施例提供的数据处理方法的流程图;FIG. 14 is a flowchart of a data processing method provided by another embodiment of the present invention;
图15是本发明一个实施例提供的启发式算法的流程图;FIG. 15 is a flowchart of a heuristic algorithm provided by an embodiment of the present invention;
图16是本发明另一实施例提供的启发式算法的流程图;16 is a flowchart of a heuristic algorithm provided by another embodiment of the present invention;
图17是本发明另一实施例提供的数据处理方法的主要流程框图。FIG. 17 is a main flow diagram of a data processing method provided by another embodiment of the present invention.
具体实施方式Detailed ways
为了使本发明的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本发明进行进一步详细说明。应当理解,此处所描述的具体实施例仅用以解释本发明,并不用于限定本发明。In order to make the objectives, technical solutions, and advantages of the present invention clearer, the following further describes the present invention in detail with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present invention, but not used to limit the present invention.
需要说明的是,虽然在装置示意图中进行了功能模块划分,在流程图中示出了逻辑顺序,但是在某些情况下,可以以不同于装置中的模块划分,或流程图中的顺序执行所示出或描述的步骤。说明书、权利要求书或上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。It should be noted that although the functional module division is carried out in the device schematic diagram and the logical sequence is shown in the flowchart, in some cases, it can be executed in a different order from the module division in the device or the sequence in the flowchart. Steps shown or described. The terms "first", "second", etc. in the specification, claims, or the above-mentioned drawings are used to distinguish similar objects, and are not necessarily used to describe a specific order or sequence.
针对网络运行中所收集到的数据,例如时间序列指标数据,其中大部分的时间序列指标数据都会存在一定的周期特性,而很多重复出现的异常数据往往会出现在不同周期中的同一位置,并且异常数据的形态会呈现一定的相似性。特别地,在一条持续时间较长并且具有大量异常数据的时间序列指标数据中,大部分异常数据都可以归属在具有相似特性的几类异常中,真正独特的异常数据的数量是比较少的。此外,对于相近的不同时间序列指标数据之间,也会存在比较相似的异常数据,例如一个网元在某时段存在中央处理器(Central Processing Unit,CPU)利用率异常冲高的情况,则这种情况也可能会出现在承接同类业务的另一个网元的CPU利用率时序数据上。Regarding the data collected during network operation, such as time series indicator data, most of the time series indicator data will have certain periodic characteristics, and many recurring abnormal data tend to appear in the same position in different cycles, and The shape of the abnormal data will show a certain similarity. In particular, in a long-duration time series indicator data with a large amount of abnormal data, most of the abnormal data can be attributed to several types of abnormalities with similar characteristics, and the number of truly unique abnormal data is relatively small. In addition, there will be relatively similar abnormal data between similar and different time series indicator data. For example, a network element has abnormally high central processing unit (CPU) utilization during a certain period of time. This situation may also appear in the CPU utilization timing data of another network element that undertakes similar services.
基于上述情况,本发明实施例提供了一种数据处理方法、设备及计算机可读存储介质,根据大部分数据中重复出现的异常数据所具有的周期特性,通过在目标数据序列中获取第一异常数据段和第一数据搜索空间,使得该第一异常数据段可以作为异常数据段模板,从而可以在第一数据搜索空间中根据该异常数据段模板获取并标注相对应的第二异常数据段,即实现了对目标数据序列中的其他异常数据段进行标注的目的,因此,对于异常数据量很多但异常种类不多的时间序列指标数据来说,相比于传统的人工标注异常数据段,本发明实施例提供的方案能够提高数据中异常数据的标注效率,从而能够节省人力资源和时间资源。Based on the foregoing, the embodiments of the present invention provide a data processing method, device, and computer-readable storage medium. According to the periodic characteristics of the abnormal data that occurs repeatedly in most data, the first abnormality is obtained in the target data sequence. The data segment and the first data search space enable the first abnormal data segment to be used as an abnormal data segment template, so that the corresponding second abnormal data segment can be obtained and marked in the first data search space according to the abnormal data segment template, That is, the purpose of labeling other abnormal data segments in the target data sequence is achieved. Therefore, for time series index data with a large amount of abnormal data but not many abnormal types, compared with the traditional manual labeling of abnormal data segments, this The solution provided by the embodiment of the invention can improve the labeling efficiency of abnormal data in the data, thereby saving human resources and time resources.
下面结合附图,对本发明实施例作进一步阐述。The embodiments of the present invention will be further described below in conjunction with the accompanying drawings.
如图1所示,图1是本发明一个实施例提供的用于执行数据处理方法的系统架构平台的示意图。As shown in FIG. 1, FIG. 1 is a schematic diagram of a system architecture platform for executing a data processing method provided by an embodiment of the present invention.
在图1的示例中,该系统架构平台包括存储器110和处理器120,其中,存储器110和处理器120可以通过总线或者其他方式连接,图1中以通过总线连接为例。In the example of FIG. 1, the system architecture platform includes a memory 110 and a processor 120, where the memory 110 and the processor 120 may be connected by a bus or in other ways. In FIG. 1, the connection by a bus is taken as an example.
存储器110作为一种非暂态计算机可读存储介质,可用于存储非暂态软件程序以及非暂态性计算机可执行程序。此外,存储器110可以包括高速随机存取存储器,还可以包括非暂态存储器,例如至少一个磁盘存储器件、闪存器件、或其他非暂态固态存储器件。在一些实施方式中,存储器110包括相对于处理器120远程设置的存储器,这些远程存储器可以通过网络连接至该系统架构平台。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。As a non-transitory computer-readable storage medium, the memory 110 can be used to store non-transitory software programs and non-transitory computer-executable programs. In addition, the memory 110 may include a high-speed random access memory, and may also include a non-transitory memory, such as at least one magnetic disk storage device, a flash memory device, or other non-transitory solid-state storage devices. In some embodiments, the memory 110 includes memories remotely arranged with respect to the processor 120, and these remote memories may be connected to the system architecture platform through a network. Examples of the aforementioned networks include, but are not limited to, the Internet, corporate intranets, local area networks, mobile communication networks, and combinations thereof.
本领域技术人员可以理解的是,该系统架构平台可以应用于各种网络控制器或者网络管理器,本实施例对此并不作具体限定。另外,具有该系统架构平台的网络控制器或者网络管理器可以应用于各种网络系统,例如,可以应用于3G通信网络系统、LTE通信网络系统、5G通信网络系统以及后续演进的移动通信网络系统等,本实施例对此并不作具体限定。Those skilled in the art can understand that the system architecture platform can be applied to various network controllers or network managers, which is not specifically limited in this embodiment. In addition, the network controller or network manager with the system architecture platform can be applied to various network systems, for example, can be applied to 3G communication network systems, LTE communication network systems, 5G communication network systems, and subsequent evolved mobile communication network systems Etc., this embodiment does not specifically limit this.
本领域技术人员可以理解的是,图1中示出的系统架构平台并不构成对本发明实施例的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。Those skilled in the art can understand that the system architecture platform shown in FIG. 1 does not constitute a limitation to the embodiment of the present invention, and may include more or less components than those shown in the figure, or a combination of certain components, or different components. Component arrangement.
在图1所示的系统架构平台中,处理器120可以调用储存在存储器110中的数据处理程序,从而执行数据处理方法。In the system architecture platform shown in FIG. 1, the processor 120 can call the data processing program stored in the memory 110 to execute the data processing method.
基于上述系统架构平台,下面提出本发明的数据处理方法的各个实施例。Based on the foregoing system architecture platform, various embodiments of the data processing method of the present invention are presented below.
如图2所示,图2是本发明一个实施例提供的数据处理方法的流程图,该数据处理方法包括但不限于有步骤S100、步骤S200、步骤S300、步骤S400和步骤S500。As shown in FIG. 2, FIG. 2 is a flowchart of a data processing method provided by an embodiment of the present invention. The data processing method includes but is not limited to step S100, step S200, step S300, step S400, and step S500.
步骤S100,获取目标数据序列。Step S100: Obtain the target data sequence.
在一实施例中,目标数据序列可以是时间序列指标数据,也可以是其他序列数据,其中,其他序列数据可以是业务种类序列数据或者业务数量序列数据等非时间序列指标数据,本实施例并不作具体限定。另外,目标数据序列可以由具有上述系统架构平台的设备在网络中自动获取得到,也可以通过人工操作而录入到具有上述系统架构平台的设备中而获取得到,本实施例并不作具体限定。In one embodiment, the target data sequence may be time-series indicator data or other sequence data. The other sequence data may be non-time-series indicator data such as business type sequence data or business quantity sequence data. This embodiment does not There is no specific limitation. In addition, the target data sequence can be automatically obtained by the device with the above-mentioned system architecture platform in the network, or it can be obtained by entering into the device with the above-mentioned system architecture platform through manual operation, which is not specifically limited in this embodiment.
步骤S200,获取目标数据序列中的第一异常数据段。Step S200: Acquire the first abnormal data segment in the target data sequence.
在一实施例中,第一异常数据段为目标数据序列中的存在异常数据的数据段,可以先由人工确定并选择该目标数据序列中的第一异常数据段,然后录入到具有上述系统架构平台的设备中以使得该设备获取到该第一异常数据段,或者保存到该设备的存储器中以使得该设备可以从存储器中获取该第一异常数据段。当获取到目标数据序列中的第一异常数据段后,可以为后续步骤中对目标数据序列中的其余异常数据段进行标注操作提供必要的基础条件。In one embodiment, the first abnormal data segment is a data segment with abnormal data in the target data sequence. The first abnormal data segment in the target data sequence can be manually determined and selected, and then entered into the system with the above-mentioned system architecture. In the device of the platform, the device can obtain the first abnormal data segment, or it can be stored in the memory of the device so that the device can obtain the first abnormal data segment from the memory. After the first abnormal data segment in the target data sequence is obtained, the necessary basic conditions can be provided for labeling the remaining abnormal data segments in the target data sequence in the subsequent steps.
值得注意的是,在时间序列指标数据中,异常数据往往会发生在一个或多个连续的时刻点,出现异常数据的时刻点被称为异常时刻点,则异常时刻点的集合所对应的数据段被称为异常数据段。一个异常数据段可能持续较长时间(即包含很多异常时刻点),因此,对于一个异常数据段,至少需要具备以下特点:起始时刻点、结束时刻点、至少包括3个时刻点、异常数据段之间无重叠时刻点。It is worth noting that in time series indicator data, abnormal data often occurs at one or more consecutive time points. The time point when the abnormal data occurs is called the abnormal time point, and the data corresponding to the set of abnormal time points The segment is called the abnormal data segment. An abnormal data segment may last a long time (that is, it contains many abnormal time points). Therefore, for an abnormal data segment, at least the following characteristics are required: starting time point, ending time point, including at least 3 time points, abnormal data There is no overlap point in time between segments.
步骤S300,在目标数据序列中获取第一数据搜索空间。Step S300: Acquire a first data search space in the target data sequence.
在一实施例中,数据搜索空间是通过机器学习方法从目标数据序列中提取出的一部分候选异常数据。通过获取数据搜索空间,可以过滤掉大部分的正常数据,避免在这些正常数据中出现的相似数据段被误判断为相似的异常数据而被搜索出来,从而可以提升相似异常数据的搜索准确性;此外,通过获取数据搜索空间,还可以缩小相似异常数据的搜索范围,从而能够提升搜索的效率。In an embodiment, the data search space is a part of candidate abnormal data extracted from the target data sequence by a machine learning method. By obtaining the data search space, most of the normal data can be filtered out, and similar data segments appearing in these normal data can be prevented from being misjudged as similar abnormal data and searched out, thereby improving the search accuracy of similar abnormal data; In addition, by obtaining the data search space, the search range of similar abnormal data can also be narrowed, thereby improving the search efficiency.
在一实施例中,第一异常数据段可以为第一数据搜索空间之外的数据段,也可以为第一数据搜索空间中的数据段,本实施例对此并不作具体限定。当第一异常数据段为第一数据搜索空间中的数据段时,由于在获取第一数据搜索空间时已经经过了异常数据的初步提取,因此在第一数据搜索空间中获取第一异常数据段,能够使得第一异常数据段的获取更加准确有效。In an embodiment, the first abnormal data segment may be a data segment outside the first data search space, or may be a data segment in the first data search space, which is not specifically limited in this embodiment. When the first abnormal data segment is a data segment in the first data search space, the first abnormal data segment is obtained in the first data search space because the preliminary extraction of abnormal data has been carried out when the first data search space is obtained. , Can make the acquisition of the first abnormal data segment more accurate and effective.
步骤S400,根据第一异常数据段在第一数据搜索空间中获取与第一异常数据段对应的第二异常数据段。Step S400: Acquire a second abnormal data segment corresponding to the first abnormal data segment in the first data search space according to the first abnormal data segment.
在一实施例中,可以把第一异常数据段作为异常数据段模板,把第一数据搜索空间中的数据段与该第一异常数据段进行比较,找出第一数据搜索空间中的与该第一异常数据段相同或者相似的数据段,即,与该第一异常数据段相同或者相似的数据段,即为第二异常数据段。所以,通过把第一异常数据段作为异常数据段模板而在第一数据搜索空间中获取与第一异常数据段对应的第二异常数据段,即可实现在目标数据序列中找出异常数据的目的,因此,对于异常数据量很多但异常种类不多的时间序列指标数据来 说,相比于传统的人工寻找异常数据的方式,本实施例能够提高数据中异常数据的寻找效率,从而能够节省人力资源和时间资源。In an embodiment, the first abnormal data segment can be used as an abnormal data segment template, and the data segment in the first data search space can be compared with the first abnormal data segment to find out the data segment in the first data search space. The data segment that is the same or similar to the first abnormal data segment, that is, the data segment that is the same or similar to the first abnormal data segment, is the second abnormal data segment. Therefore, by taking the first abnormal data segment as the template of the abnormal data segment and obtaining the second abnormal data segment corresponding to the first abnormal data segment in the first data search space, it is possible to find the abnormal data in the target data sequence. Therefore, for time series index data with a large amount of abnormal data but not many abnormal types, compared with the traditional method of manually searching for abnormal data, this embodiment can improve the efficiency of finding abnormal data in the data, thereby saving Human resources and time resources.
在一实施例中,可以把第一数据搜索空间中的全部数据作为一个数据段,并利用动态时间规整(Dynamic Time Warping,DTW)算法进行相似度计算,当计算结果为相似时,可以确定第一数据搜索空间中的与第一异常数据段对应的第二异常数据段。此外,还可以把第一数据搜索空间中的数据划分为多个与第一异常数据段等长的数据段,并利用欧氏距离或皮尔逊相关系数等方法对第一异常数据段和第一数据搜索空间中的多个数据段分别进行相似度计算,从而确定第一数据搜索空间中的与第一异常数据段对应的第二异常数据段。In an embodiment, all the data in the first data search space can be regarded as a data segment, and the dynamic time warping (Dynamic Time Warping, DTW) algorithm can be used to calculate the similarity. When the calculation result is similar, the first data can be determined A second abnormal data segment corresponding to the first abnormal data segment in a data search space. In addition, the data in the first data search space can be divided into multiple data segments with the same length as the first abnormal data segment, and methods such as Euclidean distance or Pearson correlation coefficient can be used to compare the first abnormal data segment and the first abnormal data segment. Similarity calculations are performed on multiple data segments in the data search space to determine the second abnormal data segment corresponding to the first abnormal data segment in the first data search space.
步骤S500,对第二异常数据段进行标注。Step S500: Mark the second abnormal data segment.
在一实施例中,当在第一数据搜索空间中寻找到与第一异常数据段对应的第二异常数据段后,即可对该第二异常数据段进行标注,从而便于形成异常标签数据,以得到高质量的训练数据集,从而可以用于例如深度学习技术等的机器学习技术。In one embodiment, when the second abnormal data segment corresponding to the first abnormal data segment is found in the first data search space, the second abnormal data segment can be marked, so as to facilitate the formation of abnormal label data. In order to obtain high-quality training data sets, which can be used in machine learning technologies such as deep learning technologies.
在一实施例中,由于该数据处理方法使用了上述步骤S100、步骤S200、步骤S300、步骤S400和步骤S500,通过在目标数据序列中获取第一异常数据段和第一数据搜索空间,使得该第一异常数据段可以作为异常数据段模板,从而可以在第一数据搜索空间中根据该异常数据段模板获取并标注相对应的第二异常数据段,即实现了对目标数据序列中的其他异常数据段进行标注的目的,因此,对于异常数据量很多但异常种类不多的时间序列指标数据来说,相比于传统的人工标注异常数据段,本实施例的数据处理方法能够提高数据中异常数据的标注效率,从而能够节省人力资源和时间资源。In one embodiment, since the data processing method uses the above-mentioned steps S100, S200, S300, S400, and S500, the first abnormal data segment and the first data search space are acquired in the target data sequence, so that the The first abnormal data segment can be used as an abnormal data segment template, so that the corresponding second abnormal data segment can be obtained and marked in the first data search space according to the abnormal data segment template, that is, to realize the detection of other abnormal data in the target data sequence. The purpose of labeling data segments. Therefore, for time series index data with a large amount of abnormal data but not many types of abnormalities, compared with the traditional manual labeling of abnormal data segments, the data processing method of this embodiment can improve the abnormalities in the data. The efficiency of data labeling can save human resources and time resources.
另外,参照图3,在一实施例中,步骤S300包括但不限于有以下步骤:In addition, referring to FIG. 3, in an embodiment, step S300 includes but is not limited to the following steps:
步骤S310,获取目标数据序列的第一异常特征值;Step S310: Obtain the first abnormal characteristic value of the target data sequence;
步骤S320,根据第一异常特征值确定目标数据序列中与第一异常特征值对应的第一数据位置;Step S320: Determine the first data position corresponding to the first abnormal characteristic value in the target data sequence according to the first abnormal characteristic value;
步骤S330,根据第一数据位置获取第一数据搜索空间。Step S330: Acquire the first data search space according to the first data location.
本领域技术人员可以理解的是,异常数据是在数据集中偏离大部分数据的数据,基于此,本实施例中的第一异常特征值,指的是异常数据与正常数据之间的偏离值。Those skilled in the art can understand that abnormal data is data that deviates from most of the data in the data set. Based on this, the first abnormal characteristic value in this embodiment refers to the deviation value between the abnormal data and the normal data.
在一实施例中,当获取到目标数据序列的第一异常特征值后,可以根据该第一异常特征值确定目标数据序列中与该第一异常特征值对应的第一数据位置,即,根据该第一异常特征值确定目标数据序列中的异常数据的位置,例如,当一个数据偏离大部分数据的距离大于或者等于第一异常特征值,即可确定该数据的位置为一个异常数据所在的位置。In an embodiment, after the first abnormal characteristic value of the target data sequence is obtained, the first data position corresponding to the first abnormal characteristic value in the target data sequence may be determined according to the first abnormal characteristic value, that is, according to The first abnormal feature value determines the location of the abnormal data in the target data sequence. For example, when a data deviates from most of the data by greater than or equal to the first abnormal feature value, it can be determined that the location of the data is where an abnormal data is located. Location.
在一实施例中,该第一数据位置可以包括起始异常位置、中间异常位置和结束异常位置,当确定起始异常位置、中间异常位置和结束异常位置后,即可得到第一数据搜索空间。In an embodiment, the first data position may include a start abnormal position, a middle abnormal position, and an end abnormal position. After the start abnormal position, the middle abnormal position, and the end abnormal position are determined, the first data search space can be obtained. .
值得注意的是,当获取到第一异常特征值后,可以采用LOF(Local Outlier Factor)算法、DBSCAN算法或孤立森林(Isolation Forest,iForest)算法等不同的算法获取第一数据搜索空间,本实施例并不作具体限定。以孤立森林算法为例进行说明,首先,根据目标数据序列构建一棵iTree(树),并利用目标数据序列中的数据作为这棵树的样本数据,然后,利用第一异常特征值对样本数据进行二叉划分,把符合第一异常特征值的样本数据和不符合第一异常特征值的样本数据区分开来,分别构成两个数据集,然后分别对这两个数据集重复上面的过程,直到数据不可再分或者达到这棵树的最大高度,因此,可以得到目标数据序列中与第一异常特征值对应的第一数据位置,从而可以根据该第一数据位置确定第一数据搜索空间。It is worth noting that when the first abnormal feature value is obtained, different algorithms such as LOF (Local Outlier Factor) algorithm, DBSCAN algorithm or isolation forest (Isolation Forest, iForest) algorithm can be used to obtain the first data search space. This implementation The examples are not specifically limited. Take the isolated forest algorithm as an example to illustrate. First, build an iTree (tree) based on the target data sequence, and use the data in the target data sequence as the tree's sample data, and then use the first abnormal feature value to compare the sample data Perform a binary division to distinguish the sample data that meets the first abnormal feature value and the sample data that does not meet the first abnormal feature value to form two data sets, and then repeat the above process for these two data sets. Until the data can no longer be divided or the maximum height of the tree is reached, the first data position corresponding to the first abnormal characteristic value in the target data sequence can be obtained, so that the first data search space can be determined according to the first data position.
另外,本领域技术人员可以理解的是,LOF算法、DBSCAN算法和孤立森林算法,都是本领域常用的算法,因此,针对这些算法的具体原理,在此不再赘述。In addition, those skilled in the art can understand that the LOF algorithm, the DBSCAN algorithm, and the isolated forest algorithm are all commonly used algorithms in the field. Therefore, the specific principles of these algorithms will not be repeated here.
另外,参照图4,在一实施例中,步骤S310包括但不限于有以下步骤:In addition, referring to FIG. 4, in an embodiment, step S310 includes but is not limited to the following steps:
步骤S311,获取目标数据序列的第一基线预测数据;Step S311: Obtain the first baseline prediction data of the target data sequence;
步骤S312,根据第一基线预测数据与目标数据序列中的数据的偏离值获得第一异常特征值。Step S312: Obtain a first abnormal characteristic value according to the deviation value between the first baseline prediction data and the data in the target data sequence.
在一实施例中,在执行获取目标数据序列的第一异常特征值的步骤时,可以先获取目标数据序列中的正常数据的基线预测数据,然后根据目标数据序列中的数据以及该基线预测数据之间的偏离值(即绝对差值),得到第一异常特征值。In one embodiment, when the step of obtaining the first abnormal characteristic value of the target data sequence is performed, the baseline prediction data of the normal data in the target data sequence may be obtained first, and then based on the data in the target data sequence and the baseline prediction data The deviation value (that is, the absolute difference) between, obtains the first abnormal characteristic value.
在一实施例中,可以利用不同的基线预测方法获取目标数据序列中的正常数据的基线预测数据,本实施例并不作具体限定,例如,该基线预测方法可以采用差分法、移动平均法、加权移动平均法、指数加权移动平均法、差分移动平均自回归法或三次指数平滑法等时序基线预测方法,还可以采用随机森林和XGBooste(Xtreme Gradient Boosting)等回归方法。通过采用多种基线预测方法可获取到多种第一异常特征值,而通过综合不同的第一异常特征值而执行对应的步骤以获取第一数据搜索空间,能够有利于获取第一数据搜索空间的准确性和泛化能力。In one embodiment, different baseline prediction methods can be used to obtain baseline prediction data of normal data in the target data sequence, which is not specifically limited in this embodiment. For example, the baseline prediction method can use a difference method, a moving average method, and a weighting method. Time series baseline forecasting methods such as moving average method, exponential weighted moving average method, differential moving average autoregressive method or three-time exponential smoothing method can also use regression methods such as random forest and XGBooste (Xtreme Gradient Boosting). A variety of first abnormal feature values can be obtained by using a variety of baseline prediction methods, and corresponding steps are performed to obtain the first data search space by synthesizing different first abnormal feature values, which can facilitate the acquisition of the first data search space The accuracy and generalization ability.
在一实施例中,在获取到目标数据序列的第一基线预测数据后,可以通过如下公式得到第一异常特征值:In an embodiment, after obtaining the first baseline prediction data of the target data sequence, the first abnormal characteristic value can be obtained by the following formula:
R=|P i-X i| R=|P i -X i |
其中,R为第一异常特征值,X i为目标数据序列中的数据,P i为目标数据序列的第一基线预测数据。 Wherein, R is a first anomaly value, X i is the target data in the sequence, P i is the first baseline prediction target data sequence.
本领域技术人员可以理解的是,差分法、移动平均法、加权移动平均法、指数加权移动平均法、差分移动平均自回归法、三次指数平滑法、随机森林和XGBooste,都是本领域常用的算法,因此,针对这些算法的具体原理,在此不再赘述。Those skilled in the art can understand that the difference method, moving average method, weighted moving average method, exponentially weighted moving average method, differential moving average autoregressive method, cubic exponential smoothing method, random forest and XGBooste are all commonly used in this field. Algorithms, therefore, for the specific principles of these algorithms, I will not repeat them here.
另外,参照图5,在一实施例中,步骤S400包括但不限于有以下步骤:In addition, referring to FIG. 5, in an embodiment, step S400 includes but is not limited to the following steps:
步骤S410,在第一数据搜索空间中确定第三数据段;Step S410: Determine a third data segment in the first data search space;
步骤S420,对第一异常数据段与第三数据段进行相似度计算,得到与第三数据段对应的第一相似度量值;Step S420: Perform similarity calculation on the first abnormal data segment and the third data segment to obtain a first similarity metric value corresponding to the third data segment;
步骤S430,根据第一相似度量值确定对应的第三数据段为第二异常数据段。Step S430: Determine the corresponding third data segment as the second abnormal data segment according to the first similarity metric value.
在一实施例中,当执行根据第一异常数据段在第一数据搜索空间中获取与第一异常数据段对应的第二异常数据段这一步骤时,可以先在第一数据搜索空间中确定第三数据段,当确定第三数据段后,利用相似度量算法对第一异常数据段与第三数据段进行相似度计算,得到与第三数据段对应的第一相似度量值,当第一相似度量值表示第一异常数据段与第三数据段相类似,即可确定对应的第三数据段为第二异常数据段(即第一数据搜索空间中的其余异常数据段)。即,通过比较第一异常数据段与第三数据段的相似程度而确定第三数据段是否为第二异常数据段,相比于传统的人工标注异常数据段,本实施例能够提高数据中异常数据的标注效率,从而能够节省人力资源和时间资源。In an embodiment, when the step of obtaining the second abnormal data segment corresponding to the first abnormal data segment in the first data search space according to the first abnormal data segment is performed, it may be determined in the first data search space first In the third data segment, when the third data segment is determined, the similarity measurement algorithm is used to calculate the similarity between the first abnormal data segment and the third data segment to obtain the first similarity measurement value corresponding to the third data segment. The similarity metric value indicates that the first abnormal data segment is similar to the third data segment, and it can be determined that the corresponding third data segment is the second abnormal data segment (that is, the remaining abnormal data segments in the first data search space). That is, by comparing the degree of similarity between the first abnormal data segment and the third data segment, it is determined whether the third data segment is the second abnormal data segment. Compared with the traditional manual marking of abnormal data segments, this embodiment can improve the abnormalities in the data. The efficiency of data labeling can save human resources and time resources.
在一实施例中,该第三数据段的数量可以为一个,也可以为多个,本实施例并不作具体限定。当第三数据段的数量为一个时,可以把第一数据搜索空间中的全部数据确定为第三数据段,也可以把第一数据搜索空间中的部分连续数据确定为第三数据段,本实施例并不作具体限定;当第三数据段的数量为多个时,可以把第一数据搜索空间中的数据划分为多个等长的数据段,也可以把第一数据搜索空间中的数据划分为多个不等长的数据段,本实施例并不作具体限定。In an embodiment, the number of the third data segment may be one or multiple, which is not specifically limited in this embodiment. When the number of the third data segment is one, all the data in the first data search space can be determined as the third data segment, or part of the continuous data in the first data search space can be determined as the third data segment. The embodiment is not specifically limited; when the number of the third data segment is multiple, the data in the first data search space can be divided into multiple data segments of equal length, or the data in the first data search space can be divided It is divided into multiple data segments of unequal length, which is not specifically limited in this embodiment.
在一实施例中,对第一异常数据段与第三数据段进行相似度计算,可以采用不同的相似度量算法实现。例如,针对相互之间等长的多个第三数据段,可以采用欧氏距离、皮尔逊相关系数或斯皮尔曼秩相关系数的方式对第一异常数据段与第三数据段进行相似度计算;又如,针对相互之间不等长的多个第三数据段,可以采用DTW算法或者改进的快速DTW算法对第一异常数据段与第三数据段进行相似度计算。对第一异常数据段与第三数据段进行相似度计算的具体实施方式,可以根据实际使用需要而进行适当的选择,本实施例并不作具体限定。值得注意的是,改进的快速DTW算法可以包括有FastDTW算法、SparseDTW算法、LB_Keogh算法和LB_Improved算法等,其中,FastDTW算法通过限制缩小搜 索空间和数据抽象的方法,可以在精度差别不太大的情况下,将计算复杂度降低。In an embodiment, the calculation of the similarity between the first abnormal data segment and the third data segment can be achieved by using different similarity measurement algorithms. For example, for multiple third data segments of equal length to each other, Euclidean distance, Pearson correlation coefficient, or Spearman rank correlation coefficient can be used to calculate the similarity between the first abnormal data segment and the third data segment. ; For another example, for multiple third data segments with unequal lengths, the DTW algorithm or an improved fast DTW algorithm can be used to calculate the similarity between the first abnormal data segment and the third data segment. The specific implementation manner for calculating the similarity between the first abnormal data segment and the third data segment can be appropriately selected according to actual use needs, and this embodiment does not specifically limit it. It is worth noting that the improved fast DTW algorithm can include FastDTW algorithm, SparseDTW algorithm, LB_Keogh algorithm, and LB_Improved algorithm, etc. Among them, FastDTW algorithm can reduce the search space and data abstraction methods by limiting the accuracy difference. Next, the computational complexity is reduced.
本领域技术人员可以理解的是,欧氏距离、皮尔逊相关系数、斯皮尔曼秩相关系数、DTW算法以及各种改进的快速DTW算法,都是本领域常用的算法,因此,针对这些算法的具体原理,在此不再赘述。Those skilled in the art can understand that Euclidean distance, Pearson correlation coefficient, Spearman rank correlation coefficient, DTW algorithm, and various improved fast DTW algorithms are all commonly used algorithms in the field. Therefore, for these algorithms The specific principle will not be repeated here.
另外,在一实施例中,步骤S430包括但不限于有以下步骤:In addition, in an embodiment, step S430 includes but is not limited to the following steps:
步骤S431,当第一相似度量值小于预设阈值,确定与第一相似度量值对应的第三数据段为第二异常数据段。Step S431: When the first similarity metric value is less than the preset threshold, it is determined that the third data segment corresponding to the first similarity metric value is the second abnormal data segment.
在一实施例中,第一相似度量值表示第一异常数据段与第三数据段之间的相似程度,第一相似度量值的数值越小,表示第一异常数据段与第三数据段之间的相似程度越高,因此,当第一相似度量值小于预设阈值,即可确定第一异常数据段与第三数据段之间具有较高相似程度,因此可以确定与该第一相似度量值对应的第三数据段为第二异常数据段。In one embodiment, the first similarity metric value indicates the degree of similarity between the first abnormal data segment and the third data segment, and the smaller the value of the first similarity metric value is, it indicates that the first abnormal data segment and the third data segment are Therefore, when the first similarity measure value is less than the preset threshold, it can be determined that the first abnormal data segment and the third data segment have a higher degree of similarity, so that the first similarity measure can be determined The third data segment corresponding to the value is the second abnormal data segment.
在一实施例中,预设阈值可以根据所采用的相似度量算法的不同而进行适当的选择,例如,针对欧氏距离和DTW算法,可以采用不同的预设阈值,本实施例并不作具体限定。In an embodiment, the preset threshold can be appropriately selected according to the similarity measurement algorithm used. For example, for the Euclidean distance and the DTW algorithm, different preset thresholds can be used, which is not specifically limited in this embodiment. .
另外,参照图6,在一实施例中,当第三数据段的数量为两个以上,则步骤S430可以包括但不限于有以下步骤:In addition, referring to FIG. 6, in an embodiment, when the number of third data segments is more than two, step S430 may include but is not limited to the following steps:
步骤S432,获取数值小于预设阈值的第一相似度量值;Step S432: Obtain a first similarity metric value whose value is less than a preset threshold;
步骤S433,对数值小于预设阈值的第一相似度量值进行从小到大的排序以调整对应的第三数据段的排序;Step S433: Sort the first similarity measure values whose values are less than the preset threshold value from small to large to adjust the sorting of the corresponding third data segment;
步骤S434,确定前N个第三数据段为第二异常数据段,其中,N大于等于1。Step S434: Determine that the first N third data segments are second abnormal data segments, where N is greater than or equal to 1.
在一实施例中,在第三数据段的数量为两个以上的情况下,所获取到的与第三数据段对应的第一相似度量值的数量也为两个以上,在这种情况下,可以先获取数值小于预设阈值的第一相似度量值,以筛选出与第一异常数据段的相似程度达到一定程度的第三数据段,排除其余的相似程度较小的第三数据段,接着,对数值小于预设阈值的第一相似度量值进行从小到大的排序以调整对应的第三数据段的排序,使得与第一异常数据段的相似程度达到一定程度的第三数据段可以按照相似程度由高到低进行重新排序,然后,根据实际应用情况下,确定前几个第三数据段为第二异常数据段。In one embodiment, when the number of the third data segment is more than two, the number of the acquired first similarity metric values corresponding to the third data segment is also more than two. In this case , You can first obtain the first similarity metric value whose value is less than the preset threshold to filter out the third data segment with a certain degree of similarity with the first abnormal data segment, and exclude the remaining third data segments with less similarity. Next, sort the first similarity metric values whose values are less than the preset threshold from small to large to adjust the sorting of the corresponding third data segments, so that the third data segments that are similar to the first abnormal data segment to a certain degree can be Reorder according to the degree of similarity from high to low, and then, according to actual application conditions, determine the first few third data segments as the second abnormal data segments.
以一个具体示例进行说明,可以采用FastDTW算法将第一数据搜索空间中每个第三数据段分别与第一异常数据段进行对比,以计算获得每个第三数据段的相似度量值,接着对所有第三数据段按相似度量值进行排序,以获得与第一异常数据段的相似程度较高的几个第三数据段,从而可以根据这几个第三数据段确定所需要进行标注的第二异常数据段。To illustrate with a specific example, the FastDTW algorithm can be used to compare each third data segment in the first data search space with the first abnormal data segment to calculate the similarity metric value of each third data segment, and then All third data segments are sorted according to similarity metric values to obtain several third data segments with a higher degree of similarity to the first abnormal data segment, so that the first data segment that needs to be labeled can be determined based on these third data segments. 2. Abnormal data segment.
在一实施例中,对不同的第一异常数据段以及不同的第一数据搜索空间来说,N的最佳取值可能是不同的。例如,如果N的取值过小,会遗漏一些异常数据段而没有被标注到,而如果N的取值过大,则可能会因为识别到一些相似程度比较低的异常数据段而导致精确率下降的问题。因此,N的取值需要根据实际的应用情况而进行适当的选择,若要更准确的选择N的取值,可以通过建立精确率与召回率的曲线以计算AUC值来获得N的最佳取值。In an embodiment, for different first abnormal data segments and different first data search spaces, the optimal value of N may be different. For example, if the value of N is too small, some abnormal data segments will be missed and not marked, and if the value of N is too large, some abnormal data segments with relatively low similarity may be identified, resulting in accuracy The problem of falling. Therefore, the value of N needs to be appropriately selected according to the actual application situation. If you want to select the value of N more accurately, you can calculate the AUC value by establishing the curve of accuracy and recall to obtain the best value of N. value.
值得注意的是,精确率是指正确标注异常数据段的比例;召回率是指在人工标注的异常数据段的样本中被正确标注的比例;AUC(Are Under Curve)是一个模型的评价指标,简单来说就是随机抽出一对样本(一个正样本和一个负样本),然后用训练得到的分类器来对这两个样本进行预测,预测得到正样本的概率大于负样本概率的概率。It is worth noting that the accuracy rate refers to the proportion of correctly labeled abnormal data segments; the recall rate refers to the proportion of manually labeled abnormal data segments that are correctly labeled; AUC (Are Under Curve) is an evaluation indicator of a model. To put it simply, a pair of samples (a positive sample and a negative sample) are randomly selected, and then the trained classifier is used to predict the two samples, and the probability of the positive sample is predicted to be greater than the probability of the negative sample.
另外,参照图7,在一实施例中,步骤S100可以包括但不限于有以下步骤:In addition, referring to FIG. 7, in an embodiment, step S100 may include but is not limited to the following steps:
步骤S110,获取多个待测数据序列;Step S110: Obtain multiple data sequences to be tested;
步骤S120,对多个待测数据序列进行聚类处理得到目标数据类;Step S120, performing clustering processing on a plurality of data sequences to be tested to obtain a target data category;
步骤S130,从每一个目标数据类中分别确定一个目标数据序列。Step S130: Determine a target data sequence from each target data category.
在一实施例中,从网络中采集到的待测数据序列的数量是非常庞大的,若通过人工的方式确定每条待测数据序列中的第一异常数据段,该工作量是比较大的,此外,也会存在由于待测数据序列的采集时间较短而导致该待测数据序列并不包含大量相似的异常数据的情况,因此,这种情况下,获取这些待测数据序列中的第一异常数据段就会比较困难。然而,对于一个数据指标,在网络中根据其绑定的不同资源对象,可以采集到很多待测数据序列,例如在一个中等规模的网络中,会存在着数万个端口资源,以端口流量这一个数据指标为例,能够采集到数万条待测数据序列,而这些待测数据序列本身往往也会具有一定的相似性。例如,用于统计部署在学校A的基站接入端口流量时序数据,与用于统计部署在学校B的基站接入端口流量时序数据,由于学校A与学校B的学生生活作息特性相近,那么这两个待测数据序列在很大程度上会比较相近。而在相近的待测数据序列上,其异常数据的特征也会有一定的共性。基于上述情况,可以先对获取到的多个待测数据序列进行聚类处理以得到目标数据类,然后从每一个目标数据类中分别确定一个目标数据序列,从而为后续的步骤提供必要的基础条件。In one embodiment, the number of data sequences to be tested collected from the network is very large. If the first abnormal data segment in each data sequence to be tested is manually determined, the workload is relatively large. In addition, there may be cases in which the data sequence to be tested does not contain a large amount of similar abnormal data due to the short acquisition time of the data sequence to be tested. Therefore, in this case, obtain the first data sequence in the data sequence to be tested. An abnormal data segment will be more difficult. However, for a data indicator, according to the different resource objects bound to it in the network, many data sequences to be tested can be collected. For example, in a medium-scale network, there will be tens of thousands of port resources. Taking a data index as an example, tens of thousands of data sequences to be tested can be collected, and these data sequences to be tested themselves often have certain similarities. For example, it is used to count the traffic timing data of the base station access port deployed in school A and is used to count the traffic timing data of the base station access port deployed in school B. Since the daily life characteristics of school A and school B are similar, then this The two data series to be tested will be relatively similar to a large extent. In the similar data sequence to be tested, the characteristics of the abnormal data will also have certain commonalities. Based on the above situation, the multiple acquired data sequences to be tested can be clustered to obtain the target data category, and then a target data sequence can be determined from each target data category, so as to provide the necessary foundation for the subsequent steps condition.
在一实施例中,对多个待测数据序列进行聚类处理而得到的目标数据类,其数量可以为一个,也可以为多个,根据待测数据序列的相似性而定,例如,如果这多个待测数据序列之间比较相似,那么可以把全部待测数据序列归为一个目标数据类,而如果这多个待测数据序列之中的某些待测数据序列比较相似,则可以把这多个待测数据序列分为多个目标数据类,每个目标数据类均包括有部分待测数据序列。In one embodiment, the number of target data classes obtained by clustering multiple data sequences to be tested may be one or multiple, depending on the similarity of the data sequences to be tested, for example, if These multiple data sequences to be tested are relatively similar, then all the data sequences to be tested can be classified as a target data category, and if some of the multiple data sequences to be tested are relatively similar, then you can The multiple data sequences to be tested are divided into multiple target data categories, and each target data category includes a part of the data sequences to be tested.
另外,参照图8,在一实施例中,该数据处理方法还可以包括有以下步骤:In addition, referring to FIG. 8, in an embodiment, the data processing method may further include the following steps:
步骤S600,在每一个目标数据类中的其余待测数据序列中分别获取第二数据搜索空间;Step S600, acquiring a second data search space in the remaining data sequences to be tested in each target data category;
步骤S700,利用目标数据序列中的第一异常数据段分别在其余待测数据序列中的第二数据搜索空间中获取第二异常数据段。Step S700: Use the first abnormal data segment in the target data sequence to obtain the second abnormal data segment in the second data search space in the remaining data sequence to be tested, respectively.
在一实施例中,当在每一个目标数据类中确定了一个目标数据序列,则可以通过上述实施例中的步骤获取该目标数据序列中的第二异常数据段并对其进行标注,其具体的技术原理以及所带来的技术效果,可以参照上述实施例中的相关描述,此处不再赘述。In one embodiment, when a target data sequence is determined in each target data category, the second abnormal data segment in the target data sequence can be obtained and labeled through the steps in the above-mentioned embodiment. For the technical principle and the technical effects brought about, reference may be made to the relevant description in the foregoing embodiment, which will not be repeated here.
在一实施例中,由于每一个目标数据类中的待测数据序列都具有一定的相似性,因此在目标数据序列中获取到的第一异常数据段,可以适用于对同一个目标数据类中的其余待测数据序列进行第二异常数据段的获取与标注,因此,可以先在每一个目标数据类中的其余待测数据序列中分别获取第二数据搜索空间,然后利用目标数据序列中的第一异常数据段分别在其余待测数据序列中的第二数据搜索空间中获取第二异常数据段。由于仅根据目标数据序列中的第一异常数据段即可在其余待测数据序列中的第二数据搜索空间中获取第二异常数据段,因此可以节省分别在其余各个待测数据序列中获取其对应的第一异常数据段的操作步骤,从而可以更为简洁高效的获取各个待测数据序列中的第二异常数据段,从而可以提高多个时间序列指标数据中异常数据的标注效率。In one embodiment, since the data sequence to be tested in each target data category has a certain similarity, the first abnormal data segment obtained in the target data sequence can be applied to the same target data category. Obtain and label the second abnormal data segment for the rest of the data sequence under test. Therefore, you can first obtain the second data search space from the remaining data sequence under test in each target data class, and then use the data in the target data sequence. The first abnormal data segment obtains the second abnormal data segment from the second data search space in the remaining data sequence to be tested respectively. Since only the first abnormal data segment in the target data sequence can be used to obtain the second abnormal data segment in the second data search space in the remaining data sequences to be tested, it can save to obtain the second abnormal data segment in each of the remaining data sequences to be tested. Corresponding to the operation steps of the first abnormal data segment, the second abnormal data segment in each data sequence to be tested can be obtained more concisely and efficiently, so that the labeling efficiency of abnormal data in multiple time series indicator data can be improved.
值得注意的是,本实施例中的第二数据搜索空间与上述实施例中的第一数据搜索空间是同一类型的技术特征,两者的区别仅在于归属对象不同,第一数据搜索空间归属于目标数据序列,而第二数据搜索空间则归属于同一个目标数据类中的其余待测数据序列。为避免内容重复,此处不对第二数据搜索空间进行详细的说明,针对第二数据搜索空间的相关解释说明,可以参照上述实施例中针对第一数据搜索空间的相关解释说明。It is worth noting that the second data search space in this embodiment and the first data search space in the above embodiment are of the same type of technical features. The difference between the two is only in the different belonging objects, and the first data search space belongs to The target data sequence, and the second data search space belongs to the remaining data sequences to be tested in the same target data category. In order to avoid duplication of content, the second data search space is not described in detail here. For related explanations of the second data search space, reference may be made to the related explanations of the first data search space in the foregoing embodiment.
此外,值得注意的是,本实施例中的步骤S700与上述图2所示实施例中的步骤S400相类似,两者具有相类似的技术原理以及技术效果,两者的区别仅在于执行对象不同,上述实施例中步骤S400的执行对象为目标数据序列的第一数据搜索空间,而本实施例中步骤S700的执行对象为同一个目标数据类中其余待测数据序列的第二数据搜索空间。为避免内容重复,此处不对步骤S700进行详细的说明,针对步骤S700的相关解释说明,可以参照上述实施例中针对步骤S400的相关解释说明。In addition, it is worth noting that step S700 in this embodiment is similar to step S400 in the embodiment shown in FIG. In the foregoing embodiment, the execution object of step S400 is the first data search space of the target data sequence, and the execution object of step S700 in this embodiment is the second data search space of the remaining data sequences to be tested in the same target data class. To avoid repetition of content, step S700 is not described in detail here. For related explanations of step S700, reference may be made to related explanations of step S400 in the foregoing embodiment.
另外,参照图9,在一实施例中,步骤S600包括但不限于有以下步骤:In addition, referring to FIG. 9, in an embodiment, step S600 includes but is not limited to the following steps:
步骤S610,在每一个目标数据类中分别获取其余待测数据序列的第二异常特征值;Step S610: Obtain the second abnormal characteristic value of the remaining data sequence to be tested in each target data category;
步骤S620,根据第二异常特征值分别确定其余待测数据序列中与第二异常特征值对应的第二数据位置;Step S620: Determine the second data position corresponding to the second abnormal characteristic value in the remaining data sequence to be tested according to the second abnormal characteristic value;
步骤S630,根据第二数据位置分别获取其余待测数据序列的第二数据搜索空间。Step S630: Obtain the second data search space of the remaining data sequence to be tested according to the second data position.
在一实施例中,本实施例中的第二异常特征值和第二数据位置,与上述实施例中的第一异常特征值和第一数据位置,分别属于相同类型的技术特征,两者的区别仅在于归属对象不同,第一异常特征值和第一数据位置均归属于目标数据序列,而第二异常特征值和第二数据位置则均归属于同一个目标数据类中的其余待测数据序列。为避免内容重复,此处不对第二异常特征值和第二数据位置进行详细的说明,针对第二异常特征值和第二数据位置的相关解释说明,可以参照上述实施例中针对第一异常特征值和第一数据位置的相关解释说明。In one embodiment, the second abnormal feature value and the second data location in this embodiment, and the first abnormal feature value and the first data location in the foregoing embodiment, belong to the same type of technical features, respectively. The only difference is that the attribution objects are different. The first abnormal feature value and the first data location are both attributable to the target data sequence, while the second abnormal feature value and the second data location are both attributable to the remaining data to be tested in the same target data category sequence. In order to avoid duplication of content, the second abnormal feature value and the second data location are not described in detail here. For the related explanation of the second abnormal feature value and the second data location, please refer to the first abnormal feature in the above embodiment. Explanation of the value and the position of the first data.
在一实施例中,本实施例中的步骤S610、步骤S620和步骤S630,与上述图3所示实施例中的步骤S310、步骤S320和步骤S330相类似,两者具有相类似的技术原理以及技术效果,两者的区别仅在于执行对象不同,上述实施例中步骤S310、步骤S320和步骤S330的执行对象为目标数据序列,而本实施例中步骤S610、步骤S620和步骤S630的执行对象为同一个目标数据类中的其余待测数据序列。为避免内容重复,此处不对步骤S610、步骤S620和步骤S630进行详细的说明,针对步骤S610、步骤S620和步骤S630的相关解释说明,可以参照上述实施例中针对步骤S310、步骤S320和步骤S330的相关解释说明。In an embodiment, step S610, step S620, and step S630 in this embodiment are similar to step S310, step S320, and step S330 in the embodiment shown in FIG. 3, and they have similar technical principles and Technical effect, the difference between the two is only in the execution target. In the above embodiment, the execution target of step S310, step S320 and step S330 is the target data sequence, while the execution target of step S610, step S620 and step S630 in this embodiment is The remaining data sequences to be tested in the same target data class. To avoid duplication of content, step S610, step S620, and step S630 are not described here in detail. For related explanations of step S610, step S620, and step S630, you can refer to step S310, step S320, and step S330 in the above embodiment. Related explanations.
另外,参照图10,在一实施例中,步骤S610包括但不限于有以下步骤:In addition, referring to FIG. 10, in an embodiment, step S610 includes but is not limited to the following steps:
步骤S611,在每一个目标数据类中分别获取其余待测数据序列的第二基线预测数据;Step S611: Obtain the second baseline prediction data of the remaining data sequence to be tested in each target data category;
步骤S612,根据第二基线预测数据与其余待测数据序列中的数据的偏离值分别获得其余待测数据序列的第二异常特征值。Step S612: Obtain the second abnormal characteristic value of the remaining data sequence to be tested according to the deviation value of the second baseline prediction data and the data in the remaining data sequence to be tested.
在一实施例中,本实施例中的第二基线预测数据,与上述实施例中的第一基线预测数据,属于相同类型的技术特征,两者的区别仅在于归属对象不同,第一基线预测数据归属于目标数据序列,而第二基线预测数据则归属于同一个目标数据类中的其余待测数据序列。为避免内容重复,此处不对第二基线预测数据进行详细的说明,针对第二基线预测数据的相关解释说明,可以参照上述实施例中针对第一基线预测数据的相关解释说明。In one embodiment, the second baseline prediction data in this embodiment and the first baseline prediction data in the foregoing embodiment belong to the same type of technical features. The difference between the two is only that the attribution object is different, and the first baseline prediction The data belongs to the target data sequence, and the second baseline prediction data belongs to the remaining data sequences to be tested in the same target data category. In order to avoid duplication of content, the second baseline prediction data is not described in detail here. For related explanations of the second baseline prediction data, reference may be made to the related explanations of the first baseline prediction data in the foregoing embodiment.
在一实施例中,本实施例中的步骤S611和步骤S612,与上述图4所示实施例中的步骤S311和步骤S312相类似,两者具有相类似的技术原理以及技术效果,两者的区别仅在于执行对象不同,上述实施例中步骤S311和步骤S312的执行对象为目标数据序列,而本实施例中步骤S611和步骤S612的执行对象为同一个目标数据类中的其余待测数据序列。为避免内容重复,此处不对步骤S611和步骤S612进行详细的说明,针对步骤S611和步骤S612的相关解释说明,可以参照上述实施例中针对步骤S311和步骤S312的相关解释说明。In one embodiment, step S611 and step S612 in this embodiment are similar to step S311 and step S312 in the embodiment shown in FIG. 4, and they have similar technical principles and technical effects. The only difference is that the execution objects are different. In the above embodiment, the execution objects of step S311 and step S312 are the target data sequence, while the execution objects of step S611 and step S612 in this embodiment are the remaining data sequences to be tested in the same target data category. . In order to avoid repetition of content, step S611 and step S612 are not described in detail here. For related explanations of step S611 and step S612, please refer to the related explanations of step S311 and step S312 in the foregoing embodiment.
另外,参照图11,在一实施例中,步骤S700包括但不限于有以下步骤:In addition, referring to FIG. 11, in an embodiment, step S700 includes but is not limited to the following steps:
步骤S710,分别在其余待测数据序列中的第二数据搜索空间中确定第四数据段;Step S710, respectively determine the fourth data segment in the second data search space in the remaining data sequence to be tested;
步骤S720,把目标数据序列中的第一异常数据段分别与其余待测数据序列中的第四数据段进行相似度计算,得到与第四数据段对应的第二相似度量值;Step S720: Perform similarity calculation on the first abnormal data segment in the target data sequence and the fourth data segment in the remaining data sequences to be tested to obtain a second similarity metric value corresponding to the fourth data segment;
步骤S730,根据第二相似度量值分别确定其余待测数据序列中的对应的第四数据段为其余待测数据序列中的第二异常数据段。Step S730: Determine the corresponding fourth data segment in the remaining data sequence to be tested as the second abnormal data segment in the remaining data sequence to be tested according to the second similarity metric value.
在一实施例中,本实施例中的第四数据段和第二相似度量值,与上述实施例中的第三数据段和第一相似度量值,分别属于相同类型的技术特征,两者的区别仅在于归属对象不同,第三数据段归属于目标数据序列的第一数据搜索空间,第一相似度量值与第三数据段相对应,而第四数据段则归属于同一个目标数据类中其余待测数据序列的第二数据搜索空间,第二相似度量值与第四数据段相对应。为避免内容重复,此处不对第四数据段和第二相似度量值进行详细的说明,针对第四数据段和第二相似度量值的相 关解释说明,可以参照上述实施例中针对第三数据段和第一相似度量值的相关解释说明。In one embodiment, the fourth data segment and the second similarity metric value in this embodiment, and the third data segment and the first similarity metric value in the foregoing embodiment, belong to the same type of technical features, and the difference between the two The only difference is that the attribution object is different. The third data segment belongs to the first data search space of the target data sequence, the first similarity measure value corresponds to the third data segment, and the fourth data segment belongs to the same target data category. In the second data search space of the remaining data sequence to be tested, the second similarity metric value corresponds to the fourth data segment. In order to avoid duplication of content, the fourth data segment and the second similarity measure value are not described in detail here. For the related explanation of the fourth data segment and the second similarity measure value, refer to the third data segment in the above embodiment. Explanation and explanation of the first similarity measure.
在一实施例中,本实施例中的步骤S710、步骤S720和步骤S730,与上述图5所示实施例中的步骤S410、步骤S420和步骤S430相类似,两者具有相类似的技术原理以及技术效果,两者的区别仅在于执行对象不同,上述实施例中步骤S410、步骤S420和步骤S430的执行对象为目标数据序列的第一数据搜索空间,而本实施例中步骤S710、步骤S720和步骤S730的执行对象为同一个目标数据类中其余待测数据序列的第二数据搜索空间。为避免内容重复,此处不对步骤S710、步骤S720和步骤S730进行详细的说明,针对步骤S710、步骤S720和步骤S730的相关解释说明,可以参照上述实施例中针对步骤S410、步骤S420和步骤S430的相关解释说明。In one embodiment, step S710, step S720, and step S730 in this embodiment are similar to step S410, step S420, and step S430 in the embodiment shown in FIG. 5, and they have similar technical principles and The technical effect is that the difference between the two is only in the execution objects. In the above embodiment, the execution objects of step S410, step S420, and step S430 are the first data search space of the target data sequence. However, in this embodiment, steps S710, S720, and S720 are executed. The execution object of step S730 is the second data search space of the remaining data sequences to be tested in the same target data category. In order to avoid duplication of content, step S710, step S720, and step S730 are not described in detail here. For related explanations of step S710, step S720, and step S730, you can refer to steps S410, step S420, and step S430 in the above embodiment. Related explanations.
另外,在一实施例中,步骤S730包括但不限于有以下步骤:In addition, in an embodiment, step S730 includes but is not limited to the following steps:
步骤S731,当第二相似度量值小于预设阈值,确定其余待测数据序列中的对应的第四数据段为其余待测数据序列中的第二异常数据段。Step S731: When the second similarity metric value is less than the preset threshold, it is determined that the corresponding fourth data segment in the remaining data sequence to be tested is the second abnormal data segment in the remaining data sequence to be tested.
在一实施例中,第二相似度量值表示第一异常数据段与第四数据段之间的相似程度,第二相似度量值的数值越小,表示第一异常数据段与第四数据段之间的相似程度越高,因此,当第二相似度量值小于预设阈值,即可确定第一异常数据段与第四数据段之间具有较高相似程度,因此可以确定与该第二相似度量值对应的第四数据段为第二异常数据段。In one embodiment, the second similarity metric value represents the degree of similarity between the first abnormal data segment and the fourth data segment, and the smaller the value of the second similarity metric value is, it represents the difference between the first abnormal data segment and the fourth data segment The higher the degree of similarity between the two, therefore, when the second similarity measure value is less than the preset threshold, it can be determined that the first abnormal data segment and the fourth data segment have a higher degree of similarity, and therefore the second similarity measure can be determined The fourth data segment corresponding to the value is the second abnormal data segment.
在一实施例中,预设阈值可以根据所采用的相似度量算法的不同而进行适当的选择,例如,针对欧氏距离和DTW算法,可以采用不同的预设阈值,本实施例并不作具体限定。In an embodiment, the preset threshold can be appropriately selected according to the similarity measurement algorithm used. For example, for the Euclidean distance and the DTW algorithm, different preset thresholds can be used, which is not specifically limited in this embodiment. .
另外,参照图12,在一实施例中,当第四数据段的数量为两个以上,则步骤S730可以包括但不限于有以下步骤:In addition, referring to FIG. 12, in an embodiment, when the number of fourth data segments is more than two, step S730 may include but is not limited to the following steps:
步骤S732,分别获取对应于其余待测数据序列的数值小于预设阈值的第二相似度量值;Step S732, respectively acquiring a second similarity metric value corresponding to the remaining data sequence to be tested and the value is less than a preset threshold;
步骤S733,对数值小于预设阈值的第二相似度量值进行从小到大的排序,以分别调整在其余待测数据序列中的对应的第四数据段的排序;Step S733, sorting the second similarity measure values whose values are less than the preset threshold value from small to large, so as to respectively adjust the sorting of the corresponding fourth data segments in the remaining data sequences to be tested;
步骤S734,分别在其余待测数据序列中确定前N个第四数据段为第二异常数据段,其中,N大于等于1。Step S734: Determine the first N fourth data segments as the second abnormal data segments in the remaining data sequences to be tested, where N is greater than or equal to 1.
在一实施例中,本实施例中的步骤S732、步骤S733和步骤S734,与上述图6所示实施例中的步骤S432、步骤S433和步骤S434相类似,两者具有相类似的技术原理以及技术效果,两者的区别仅在于执行对象不同,上述实施例中步骤S432、步骤S433和步骤S434的执行对象为目标数据序列,而本实施例中步骤S732、步骤S733和步骤S734的执行对象为同一个目标数据类中的其余待测数据序列。为避免内容重复,此处不对步骤S732、步骤S733和步骤S734进行详细的说明,针对步骤S732、步骤S733和步骤S734的相关解释说明,可以参照上述实施例中针对步骤S432、步骤S433和步骤S434的相关解释说明。In one embodiment, step S732, step S733, and step S734 in this embodiment are similar to step S432, step S433, and step S434 in the embodiment shown in FIG. 6, and they have similar technical principles and Technical effect, the difference between the two is only in the execution target. In the above embodiment, the execution target of step S432, step S433, and step S434 is the target data sequence, while the execution target of step S732, step S733, and step S734 in this embodiment is The remaining data sequences to be tested in the same target data class. In order to avoid content duplication, step S732, step S733 and step S734 will not be described in detail here. For the relevant explanation of step S732, step S733 and step S734, please refer to step S432, step S433 and step S434 in the above embodiment. Related explanations.
另外,参照图13,在一实施例中,步骤S120可以包括但不限于有以下步骤:In addition, referring to FIG. 13, in an embodiment, step S120 may include but is not limited to the following steps:
步骤S121,对多个待测数据序列分别进行数据预处理,得到多个第一预处理数据序列;Step S121: Perform data preprocessing on multiple data sequences to be tested, respectively, to obtain multiple first preprocessed data sequences;
步骤S122,对多个第一预处理数据序列分别进行基线提取处理,得到多个第二预处理数据序列;Step S122: Perform baseline extraction processing on the multiple first pre-processed data sequences, respectively, to obtain multiple second pre-processed data sequences;
步骤S123,按相似度对多个第二预处理数据序列进行聚类,得到目标数据类。Step S123, clustering a plurality of second pre-processed data sequences according to similarity, to obtain a target data category.
在一实施例中,当需要对多个待测数据序列进行聚类处理,可以先对这多个待测数据序列分别进行数据预处理以得到多个第一预处理数据序列,然后对这多个第一预处理数据序列分别进行基线提取处理以得到多个第二预处理数据序列,接着,按相似度对这多个第二预处理数据序列进行聚类,从而得到目标数据类。当把这多个待测数据序列进行聚类以得到对应的目标数据类后,可以把针对每一个待测数据序列的异常数据段标注处理转变成针对每一个目标数据类的异常数据段标注处理,从而可以降低处理的复杂程度以及处理时间,从而可以提高数据中异常数据的标注效率。In one embodiment, when multiple data sequences to be tested need to be clustered, data preprocessing may be performed on the multiple data sequences to be tested respectively to obtain multiple first preprocessed data sequences, and then the multiple data sequences can be preprocessed. The first preprocessed data sequences are respectively subjected to baseline extraction processing to obtain multiple second preprocessed data sequences, and then the multiple second preprocessed data sequences are clustered according to the similarity, so as to obtain the target data category. When the multiple data sequences to be tested are clustered to obtain the corresponding target data category, the abnormal data segment labeling processing for each data sequence to be tested can be transformed into the abnormal data segment labeling processing for each target data category , Which can reduce the processing complexity and processing time, thereby improving the efficiency of labeling abnormal data in the data.
在一实施例中,对第一预处理数据序列进行基线提取处理,能够平滑掉待测数据序列中的异常部分 和噪声部分,从而可以提高待测数据序列之间相似性度量的准确性。In an embodiment, performing baseline extraction processing on the first pre-processed data sequence can smooth out abnormal parts and noise parts in the data sequence to be tested, thereby improving the accuracy of the similarity measurement between the data sequences to be tested.
值得注意的是,本实施例中步骤S122的进行基线提取处理,与上述图4所示实施例中利用基线预测方法获取基线预测数据的步骤,具有相类似的技术原理,针对本实施例中步骤S122的进行基线提取处理的相关解释说明,可以参照上述图4所示实施例中利用基线预测方法获取基线预测数据的相关解释说明,因此此处不再赘述。It is worth noting that the baseline extraction processing in step S122 in this embodiment has a similar technical principle to the step of obtaining baseline prediction data using the baseline prediction method in the embodiment shown in FIG. For the relevant explanations of performing the baseline extraction processing in S122, reference may be made to the relevant explanations of using the baseline prediction method to obtain baseline prediction data in the embodiment shown in FIG.
另外,参照图14,在一实施例中,步骤S121可以包括但不限于有以下步骤:In addition, referring to FIG. 14, in an embodiment, step S121 may include but is not limited to the following steps:
步骤S1211,对多个待测数据序列分别进行缺失值填充处理,得到多个填充数据序列;Step S1211, performing missing value filling processing on the multiple data sequences to be tested respectively, to obtain multiple filling data sequences;
步骤S1212,对多个填充数据序列分别进行数据标准化处理,得到多个第一预处理数据序列。Step S1212: Perform data standardization processing on the multiple filling data sequences, respectively, to obtain multiple first preprocessed data sequences.
在一实施例中,从网络中采集到的待测数据序列会因为各种原因而具有不同程度的缺失值,这些缺失值不仅会造成各个待测数据序列之间的长度不一,从而导致一些相似性度量算法难以使用,还会影响基线提取处理的准确性,为了解决这些问题,本实施例先对这些缺失值进行数据填充以得到填充数据序列,然后对填充数据序列进行数据标准化处理以得到第一预处理数据序列。In one embodiment, the data sequence to be tested collected from the network may have missing values of varying degrees due to various reasons. These missing values will not only cause the length of each data sequence to be tested to be different, resulting in some Similarity measurement algorithms are difficult to use and will affect the accuracy of the baseline extraction process. In order to solve these problems, this embodiment first performs data filling on these missing values to obtain a filling data sequence, and then performs data standardization processing on the filling data sequence to obtain The first preprocessed data sequence.
在一实施例中,可以采用线性插值填充方法进行缺失值填充处理,线性插值填充方法可平滑待测数据序列的波形,从而有利于执行基线提取处理。例如,对于一个时间序列指标数据,根据时间上的连续性即可判断缺失值的具体位置,在确定了缺失值的具体位置后,根据该缺失值的位置的前后数据,可以获得需要填充的具体数值,例如,可以采用前后数据的平均值作为需要填充的具体数值。本领域技术人员可以理解的是,线性插值填充方法属于本领域常用的算法,因此,针对该算法的具体原理,在此不再赘述。In one embodiment, a linear interpolation filling method may be used to perform the missing value filling processing. The linear interpolation filling method can smooth the waveform of the data sequence to be measured, thereby facilitating the execution of the baseline extraction processing. For example, for a time series indicator data, the specific location of the missing value can be determined based on the continuity in time. After the specific location of the missing value is determined, the specific data that needs to be filled can be obtained based on the data before and after the location of the missing value. Numerical value, for example, the average value of the preceding and following data can be used as the specific numerical value to be filled. Those skilled in the art can understand that the linear interpolation filling method belongs to an algorithm commonly used in the art, and therefore, the specific principle of the algorithm will not be repeated here.
在一实施例中,对填充数据序列进行数据标准化处理,可以将待测数据序列转换映射到一个特定区间上,从而有利于消除不同待测数据序列之间的量纲差异,从而能放在一起进行相似度的比较。本实施例可以采用Z-Score方法进行数据标准化处理,其计算公式如下:In one embodiment, performing data standardization processing on the filling data sequence can transform and map the data sequence to be measured to a specific interval, thereby helping to eliminate the dimensional difference between different data sequences to be measured, so that they can be put together Compare the similarity. In this embodiment, the Z-Score method may be used for data standardization processing, and the calculation formula is as follows:
Figure PCTCN2021086644-appb-000001
Figure PCTCN2021086644-appb-000001
其中,x′ i为第一预处理数据序列,x i为待测数据序列,
Figure PCTCN2021086644-appb-000002
为待测数据序列的均值,σ为待测数据序列的标准差。
Among them, x′ i is the first preprocessed data sequence, x i is the data sequence to be tested,
Figure PCTCN2021086644-appb-000002
Is the mean value of the data series to be tested, and σ is the standard deviation of the data series to be tested.
另外,在一实施例中,步骤S123中的按相似度对多个第二预处理数据序列进行聚类,具体可以包括但不限于有以下步骤:In addition, in an embodiment, the clustering of multiple second pre-processed data sequences according to similarity in step S123 may specifically include, but is not limited to, the following steps:
步骤S1231,利用DBSCAN算法按相似度对多个第二预处理数据序列进行聚类;其中,DBSCAN算法的参数包括距离函数、邻域数量阈值和邻域距离阈值;DBSCAN算法的结果包括分类数和异常比例。Step S1231, using the DBSCAN algorithm to cluster a plurality of second pre-processed data sequences according to the similarity; among them, the parameters of the DBSCAN algorithm include the distance function, the threshold of the number of neighborhoods, and the threshold of the neighborhood distance; the result of the DBSCAN algorithm includes the number of categories and Abnormal proportions.
本领域技术人员可以理解的是,DBSCAN算法是常用的聚类算法之一,DBSCAN算法无需事先确定聚类中心的数量。DBSCAN算法的关键参数包括有距离函数、邻域数量阈值和邻域距离阈值,而DBSCAN算法的结果包括有分类数和异常比例。Those skilled in the art can understand that the DBSCAN algorithm is one of the commonly used clustering algorithms, and the DBSCAN algorithm does not need to determine the number of cluster centers in advance. The key parameters of DBSCAN algorithm include distance function, neighborhood number threshold and neighborhood distance threshold, while the result of DBSCAN algorithm includes classification number and abnormal proportion.
在一实施例中,对于距离函数,本实施例可以采用欧氏距离函数;对于邻域数量阈值,本实施例可以设定为4;而对于邻域距离阈值,该参数需要根据数据集进行动态估算,并且该参数对聚类的结果影响明显。In one embodiment, for the distance function, the Euclidean distance function can be used in this embodiment; for the number of neighborhood threshold, this embodiment can be set to 4; and for the neighborhood distance threshold, the parameter needs to be dynamically based on the data set. Estimated, and this parameter has a significant impact on the clustering results.
另外,参照图15,在一实施例中,邻域距离阈值可以通过启发式算法而得到,其中,启发式算法包括但不限于有以下步骤:In addition, referring to FIG. 15, in an embodiment, the neighborhood distance threshold may be obtained by a heuristic algorithm, where the heuristic algorithm includes but is not limited to the following steps:
步骤S810,通过距离函数计算多个第二预处理数据序列两两之间的相似度,得到相似度矩阵数据;Step S810: Calculate the similarity between a plurality of second pre-processed data sequences by using a distance function to obtain similarity matrix data;
步骤S820,基于相似度矩阵数据计算k-dist距离,得到k-dist序列;Step S820: Calculate the k-dist distance based on the similarity matrix data to obtain the k-dist sequence;
步骤S830,基于k-dist序列得到初始距离阈值参数;Step S830, obtaining an initial distance threshold parameter based on the k-dist sequence;
步骤S840,调整初始距离阈值参数以得到邻域距离阈值。Step S840: Adjust the initial distance threshold parameter to obtain the neighborhood distance threshold.
在一实施例中,k-dist距离是指一个数据对象与其第k个距离最近的对象间的距离。当需要确定合 适的邻域距离阈值时,可以先通过例如欧氏距离函数等距离函数计算多个第二预处理数据序列两两之间的相似度,以形成相似度矩阵数据,接着,基于该相似度矩阵数据计算k-dist距离以得到k-dist序列,然后,基于该k-dist序列得到初始距离阈值参数,接着,通过调整该初始距离阈值参数以得到合适的邻域距离阈值。当通过上述启发式算法而得到合适的邻域距离阈值后,即可把该邻域距离阈值应用于上述实施例中的利用DBSCAN算法按相似度对多个第二预处理数据序列进行聚类,从而可以得到目标数据类。In an embodiment, the k-dist distance refers to the distance between a data object and its k-th closest object. When it is necessary to determine a suitable neighborhood distance threshold, the similarity between multiple second preprocessing data sequences can be calculated by using, for example, the Euclidean distance function and other distance functions to form similarity matrix data, and then based on the The similarity matrix data calculates the k-dist distance to obtain a k-dist sequence, and then obtains an initial distance threshold parameter based on the k-dist sequence, and then adjusts the initial distance threshold parameter to obtain an appropriate neighborhood distance threshold. After the appropriate neighborhood distance threshold is obtained through the above heuristic algorithm, the neighborhood distance threshold can be applied to the above embodiment using the DBSCAN algorithm to cluster multiple second preprocessed data sequences according to similarity, Thereby, the target data class can be obtained.
在一实施例中,在执行步骤S810之前,可以先设定邻域最大距离阈值、最小长度阈值、斜率阈值和斜率差阈值等初始阈值,当完成这些初始阈值的设定后,再执行上述步骤S810、步骤S820、步骤S830和步骤S840。In one embodiment, before performing step S810, initial thresholds such as the maximum distance threshold, minimum length threshold, slope threshold, and slope difference threshold of the neighborhood may be set first, and the above steps are performed after completing the setting of these initial thresholds. S810, step S820, step S830, and step S840.
在一实施例中,当执行步骤S820时,在基于相似度矩阵数据计算每个k-dist点的k-dist距离后,可以把得到的k-dist距离按从小到大进行排序,并且排除k-dist距离为0的k-dist点以及k-dist距离超过邻域最大距离阈值的k-dist点,因此,剩余下来的k-dist点即构成了k-dist序列。In an embodiment, when step S820 is performed, after calculating the k-dist distance of each k-dist point based on the similarity matrix data, the obtained k-dist distances can be sorted from small to large, and k-dist distances are excluded. -The k-dist point with a dist distance of 0 and the k-dist point with a k-dist distance exceeding the maximum distance threshold of the neighborhood, therefore, the remaining k-dist points constitute a k-dist sequence.
另外,在一实施例中,步骤S830可以包括但不限于有以下步骤:In addition, in an embodiment, step S830 may include but is not limited to the following steps:
步骤S831,计算k-dist序列中每个k-dist点分别与前后两个相邻点的斜率,当前后两个相邻点的斜率均小于预设斜率阈值,且当前后两个相邻点的斜率的差值小于预设斜率差阈值,确定当前k-dist点为候选距离阈值;Step S831: Calculate the slopes of each k-dist point in the k-dist sequence and the two adjacent points before and after, the slopes of the current two adjacent points are less than the preset slope threshold, and the current two adjacent points The slope difference of is smaller than the preset slope difference threshold, and the current k-dist point is determined to be the candidate distance threshold;
步骤S832,确定候选距离阈值中数值最大的一个为初始距离阈值参数。Step S832: Determine the one with the largest value among the candidate distance thresholds as the initial distance threshold parameter.
在一实施例中,当需要执行基于k-dist序列得到初始距离阈值参数的步骤时,可以先在k-dist序列中把较为平缓的k-dist点确定为候选距离阈值,具体步骤可以为:先计算k-dist序列中的每个k-dist点与其前后两个相邻点的斜率,如果当前k-dist点与前一个相邻点的斜率(可以定义为左斜率)和当前k-dist点与后一个相邻点的斜率(可以定义为右斜率)均小于预设的斜率阈值,并且该左斜率和该右斜率之间的差值小于预设的斜率差阈值,则可以将当前k-dist点确定为一个候选距离阈值,当得到多个候选距离阈值,可以先把这些候选距离阈值按从大到小进行排序,然后取最大值的候选距离阈值作为初始距离阈值参数。In an embodiment, when the step of obtaining the initial distance threshold parameter based on the k-dist sequence needs to be performed, the relatively flat k-dist point in the k-dist sequence may be first determined as the candidate distance threshold. The specific steps may be: First calculate the slope of each k-dist point in the k-dist sequence and its two adjacent points before and after it. If the slope of the current k-dist point and the previous adjacent point (which can be defined as the left slope) and the current k-dist The slope of the point and the next adjacent point (which can be defined as the right slope) is less than the preset slope threshold, and the difference between the left slope and the right slope is less than the preset slope difference threshold, then the current k The -dist point is determined as a candidate distance threshold. When multiple candidate distance thresholds are obtained, these candidate distance thresholds can be sorted from largest to smallest, and then the candidate distance threshold with the maximum value is taken as the initial distance threshold parameter.
另外,在一实施例中,步骤S840可以包括但不限于有以下步骤:In addition, in an embodiment, step S840 may include but is not limited to the following steps:
步骤S841,获取步进长度;Step S841, obtaining the step length;
步骤S842,根据步进长度调整初始距离阈值参数得到距离调整阈值,当分类数出现下降,确定前一步调整得到的距离调整阈值为邻域距离阈值。In step S842, the initial distance threshold parameter is adjusted according to the step length to obtain the distance adjustment threshold. When the number of classifications decreases, it is determined that the distance adjustment threshold obtained in the previous step adjustment is the neighborhood distance threshold.
在一实施例中,当确定初始距离阈值参数后,可以对该初始距离阈值参数继续进行优化处理,值得注意的是,在对该初始距离阈值参数进行优化处理时,需要在保持分类数不变的前提下,尽可能的降低异常比例。由于随着初始距离阈值参数的值的增大,异常比例会不断降低,而分类数也有可能会减少,因此可以采用步进的方式逐步增加初始距离阈值参数的值以确定最佳的邻域距离阈值,即,在初始距离阈值参数的基础上,每次增加一个步进长度后重新计算分类数和异常比例,直到分类数出现下降时停止增加步进长度,此时,可以确定进行前一步调整而得到的距离调整阈值为最佳的邻域距离阈值。In one embodiment, after the initial distance threshold parameter is determined, the initial distance threshold parameter can continue to be optimized. It is worth noting that when the initial distance threshold parameter is optimized, it is necessary to keep the classification number unchanged. Under the premise of reducing the abnormal proportion as much as possible. As the value of the initial distance threshold parameter increases, the proportion of abnormalities will continue to decrease, and the number of classifications may also decrease, so you can gradually increase the value of the initial distance threshold parameter in a step-by-step manner to determine the best neighborhood distance Threshold, that is, on the basis of the initial distance threshold parameter, the number of classifications and abnormal proportions are recalculated after each step length is increased, until the number of classifications drops, the step length is stopped increasing, at this time, the previous step can be determined to adjust The obtained distance adjustment threshold is the optimal neighborhood distance threshold.
在一实施例中,步进长度可以根据经验值而进行设置,也可以根据候选距离阈值而进行设置,例如,当步进长度根据候选距离阈值而进行设置时,可以把步进长度设置为候选距离阈值中的最大距离阈值和最小距离阈值的差值的十分之一,本实施例对此并不作具体限定。In an embodiment, the step length can be set according to empirical values, or it can be set according to the candidate distance threshold. For example, when the step length is set according to the candidate distance threshold, the step length can be set as the candidate distance threshold. The distance threshold is one-tenth of the difference between the maximum distance threshold and the minimum distance threshold, which is not specifically limited in this embodiment.
为了能够更好地说明上述实施例中所提供的启发式算法,下面以具体的示例进行详细的描述说明:In order to better explain the heuristic algorithm provided in the foregoing embodiment, the following describes in detail with a specific example:
在一具体示例中,如图16所示,该启发式算法具体包括以下步骤:In a specific example, as shown in FIG. 16, the heuristic algorithm specifically includes the following steps:
步骤S901,阈值设置。Step S901, threshold setting.
在该步骤中,分别设置邻域最大距离阈值、最小长度阈值、斜率阈值和斜率差阈值等初始阈值。In this step, initial thresholds such as the maximum distance threshold, minimum length threshold, slope threshold, and slope difference threshold of the neighborhood are respectively set.
步骤S902,序列相似度矩阵计算。Step S902, the sequence similarity matrix is calculated.
在该步骤中,通过距离函数两两计算每一对数据序列之间的相似度,形成相似度矩阵数据。In this step, the similarity between each pair of data sequences is calculated by the distance function to form the similarity matrix data.
步骤S903,计算k-dist距离并排序。In step S903, the k-dist distance is calculated and sorted.
在该步骤中,基于相似度矩阵数据计算每个k-dist点的k-dist距离,并按从小到大进行排序。In this step, the k-dist distance of each k-dist point is calculated based on the similarity matrix data and sorted from small to large.
步骤S904,按最大距离阈值过滤。Step S904, filtering according to the maximum distance threshold.
在该步骤中,排除k-dist距离为0的k-dist点以及k-dist距离超过邻域最大距离阈值的k-dist距离点。In this step, the k-dist points whose k-dist distance is 0 and the k-dist distance points whose k-dist distance exceeds the maximum distance threshold of the neighborhood are excluded.
步骤S905,按顺序取k-dist序列值。In step S905, the k-dist sequence values are taken in order.
步骤S906,判断左斜率和右斜率是否均小于斜率阈值。In step S906, it is judged whether the left slope and the right slope are both smaller than the slope threshold.
在该步骤中,计算每个k-dist点与其前后两个相邻点的斜率,如果当前k-dist点与前一个相邻点的斜率(可以定义为左斜率)和当前k-dist点与后一个相邻点的斜率(可以定义为右斜率)均小于预设的斜率阈值,则执行步骤S907,否则执行步骤S905。In this step, calculate the slope of each k-dist point and its two adjacent points before and after it, if the slope of the current k-dist point and the previous adjacent point (which can be defined as the left slope) and the current k-dist point and The slope (which can be defined as the right slope) of the latter adjacent point is all less than the preset slope threshold, then step S907 is executed, otherwise, step S905 is executed.
步骤S907,判断左斜率和右斜率的差值是否小于斜率差阈值。In step S907, it is determined whether the difference between the left slope and the right slope is less than the slope difference threshold.
在该步骤中,当左斜率和右斜率的差值小于斜率差阈值,则执行步骤S908,否则执行步骤S905。In this step, when the difference between the left slope and the right slope is less than the slope difference threshold, step S908 is executed, otherwise, step S905 is executed.
步骤S908,把当前k-dist点确定为候选距离阈值,并重复步骤S905至步骤S907,当得到所有的候选距离阈值后,执行步骤S909。In step S908, the current k-dist point is determined as the candidate distance threshold, and steps S905 to S907 are repeated, and when all candidate distance thresholds are obtained, step S909 is executed.
步骤S909,将候选阈值按从大到小进行排序后,取最大的候选阈值为初始距离阈值参数。Step S909: After sorting the candidate thresholds in descending order, the largest candidate threshold is taken as the initial distance threshold parameter.
步骤S910,执行聚类算法,得到分类数和异常比例。In step S910, the clustering algorithm is executed to obtain the number of classifications and the abnormal ratio.
步骤S911,在初始距离阈值参数的基础上增加一个步长,并执行聚类算法。In step S911, a step size is added on the basis of the initial distance threshold parameter, and the clustering algorithm is executed.
步骤S912,判断分类数是否下降,若是,执行步骤S913,否则执行步骤S911。In step S912, it is judged whether the number of classifications has decreased, if so, step S913 is executed, otherwise, step S911 is executed.
步骤S913,把前一个距离阈值确定为最佳距离阈值。Step S913: Determine the previous distance threshold as the best distance threshold.
另外,在一实施例中,步骤S130可以包括但不限于有以下步骤:In addition, in an embodiment, step S130 may include but is not limited to the following steps:
步骤S131,在每一个目标数据类中,分别计算每一个待测数据序列与其余待测数据序列的距离平均和,确定与距离平均和中数值最小的一个对应的待测数据序列为目标数据序列。Step S131: In each target data category, calculate the average sum of distances between each data sequence to be measured and the rest of the data sequences to be measured, and determine the data sequence to be measured corresponding to the smallest value of the average and median distance as the target data sequence .
在一实施例中,当对多个待测数据序列进行聚类处理而得到目标数据类后,可以对每一个目标数据类分别确定一个用于代表对应的目标数据类的核心数据序列,即从每一个目标数据类中分别确定一个目标数据序列。而从一个目标数据类中确定一个目标数据序列时,可以先计算该目标数据类中每个待测数据序列与其他待测数据序列的距离平均和,然后选择距离平均和最小的待测数据序列作为代表该目标数据类的核心数据序列,即作为该目标数据类的目标数据序列。In one embodiment, after clustering multiple data sequences to be tested to obtain the target data category, a core data sequence representing the corresponding target data category may be determined for each target data category, that is, from A target data sequence is determined in each target data class. When determining a target data sequence from a target data class, you can first calculate the average sum of the distances between each data sequence to be tested in the target data class and other data sequences to be tested, and then select the data sequence to be tested with the smallest average sum of distances As the core data sequence representing the target data class, that is, as the target data sequence of the target data class.
在一实施例中,可以通过如下公式确定目标数据序列:In an embodiment, the target data sequence can be determined by the following formula:
Figure PCTCN2021086644-appb-000003
Figure PCTCN2021086644-appb-000003
其中,
Figure PCTCN2021086644-appb-000004
Figure PCTCN2021086644-appb-000005
分别表示不同的待测数据序列;euclidean()表示欧氏距离。
in,
Figure PCTCN2021086644-appb-000004
with
Figure PCTCN2021086644-appb-000005
Represents different data sequences to be tested; euclidean() represents Euclidean distance.
为了能够更好地说明上述实施例中所提供的数据处理方法,下面以具体的示例进行详细的描述说明:In order to better explain the data processing methods provided in the above embodiments, the following detailed descriptions will be made with specific examples:
在一具体示例中,如图17所示,图17是本示例所提供的数据处理方法的主要流程框图,基于如图17所示的主要流程框图,该数据处理方法具体包括有以下步骤:In a specific example, as shown in FIG. 17, FIG. 17 is a main flow diagram of the data processing method provided in this example. Based on the main flow diagram shown in FIG. 17, the data processing method specifically includes the following steps:
首先,对网络中采集到的待测数据序列进行预处理,主要包括缺失值的填充和数据标准化,形成等长的时序数据集;First, preprocess the data sequence to be tested collected in the network, mainly including the filling of missing values and data standardization, to form a time series data set of equal length;
然后,采用移动平均法对每个待测数据序列进行基线抽取,形成基线数据集;Then, use the moving average method to extract the baseline of each data sequence to be tested to form a baseline data set;
接着,在基线数据集上采用DBSCAN算法和欧氏距离度量函数进行聚类,并且采用启发式算法自动进行阈值参数调优;Then, use the DBSCAN algorithm and Euclidean distance measurement function to cluster on the baseline data set, and use the heuristic algorithm to automatically adjust the threshold parameters;
然后,对聚类结果中的每个分类,通过距离度量确定一个核心序列;Then, for each category in the clustering results, a core sequence is determined through the distance measurement;
接着,对单个核心序列确定异常数据段模板;Next, determine the abnormal data segment template for a single core sequence;
然后,自动对每个序列完成数据搜索空间的生成;Then, automatically complete the generation of the data search space for each sequence;
接着,自动根据每个序列所属分类的核心序列获取相应的异常数据段模板,并在该序列的数据搜索空间上完成相似异常段的搜索,获得相似程度较高的N个相似异常段,并对这N个相似异常段进行异常标注;Then, the corresponding abnormal data segment template is automatically obtained according to the core sequence of the classification of each sequence, and the search for similar abnormal segments is completed in the data search space of the sequence to obtain N similar abnormal segments with a higher degree of similarity. These N similar abnormal segments are abnormally marked;
最后,对每个序列的相似异常段的标注结果进行人工检测和局部修正,形成最终的标签异常数据。Finally, the labeling results of similar abnormal segments of each sequence are manually detected and partially corrected to form the final label abnormal data.
此外,本发明的另一个实施例还提供了一种设备,该设备包括:存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序。In addition, another embodiment of the present invention also provides a device, which includes a memory, a processor, and a computer program stored on the memory and running on the processor.
处理器和存储器可以通过总线或者其他方式连接。The processor and the memory can be connected by a bus or in other ways.
存储器作为一种非暂态计算机可读存储介质,可用于存储非暂态软件程序以及非暂态性计算机可执行程序。此外,存储器可以包括高速随机存取存储器,还可以包括非暂态存储器,例如至少一个磁盘存储器件、闪存器件、或其他非暂态固态存储器件。在一些实施方式中,存储器包括相对于处理器远程设置的存储器,这些远程存储器可以通过网络连接至该处理器。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。As a non-transitory computer-readable storage medium, the memory can be used to store non-transitory software programs and non-transitory computer-executable programs. In addition, the memory may include a high-speed random access memory, and may also include a non-transitory memory, such as at least one magnetic disk storage device, a flash memory device, or other non-transitory solid-state storage devices. In some embodiments, the memory includes a memory remotely arranged with respect to the processor, and these remote memories may be connected to the processor through a network. Examples of the aforementioned networks include, but are not limited to, the Internet, corporate intranets, local area networks, mobile communication networks, and combinations thereof.
需要说明的是,本实施例中的设备,可以包括有如图1所示实施例中的系统架构平台,本实施例中的设备和如图1所示实施例中的系统架构平台属于相同的发明构思,因此两者具有相同的实现原理以及技术效果,此处不再详述。It should be noted that the device in this embodiment may include the system architecture platform in the embodiment shown in FIG. 1, and the device in this embodiment and the system architecture platform in the embodiment shown in FIG. 1 belong to the same invention. Concept, so the two have the same implementation principle and technical effect, and will not be detailed here.
实现上述实施例的数据处理方法所需的非暂态软件程序以及指令存储在存储器中,当被处理器执行时,执行上述实施例的数据处理方法,例如,执行以上描述的图2中的方法步骤S100至S500、图3中的方法步骤S310至S330、图4中的方法步骤S311至S312、图5中的方法步骤S410至S430、图6中的方法步骤S432至S434、图7中的方法步骤S110至S130、图8中的方法步骤S600至S700、图9中的方法步骤S610至S630、图10中的方法步骤S611至S612、图11中的方法步骤S710至S730、图12中的方法步骤S732至S734、图13中的方法步骤S121至S123、图14中的方法步骤S1211至S1212、图15中的方法步骤S810至S840、图16中的方法步骤S901至S913。The non-transitory software programs and instructions required to implement the data processing method of the foregoing embodiment are stored in the memory. When executed by the processor, the data processing method of the foregoing embodiment is executed, for example, the method in FIG. 2 described above is executed. Steps S100 to S500, method steps S310 to S330 in FIG. 3, method steps S311 to S312 in FIG. 4, method steps S410 to S430 in FIG. 5, method steps S432 to S434 in FIG. 6, and method in FIG. 7 Steps S110 to S130, method steps S600 to S700 in FIG. 8, method steps S610 to S630 in FIG. 9, method steps S611 to S612 in FIG. 10, method steps S710 to S730 in FIG. 11, and method in FIG. 12 Steps S732 to S734, method steps S121 to S123 in FIG. 13, method steps S1211 to S1212 in FIG. 14, method steps S810 to S840 in FIG. 15, and method steps S901 to S913 in FIG.
以上所描述的设备实施例仅仅是示意性的,其中作为分离部件说明的单元可以是或者也可以不是物理上分开的,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。The device embodiments described above are merely illustrative, and the units described as separate components may or may not be physically separated, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the modules can be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
此外,本发明的另一个实施例还提供了一种计算机可读存储介质,该计算机可读存储介质存储有计算机可执行指令,该计算机可执行指令被一个处理器或控制器执行,例如,被上述设备实施例中的一个处理器执行,可使得上述处理器执行上述实施例中的数据处理方法,例如,执行以上描述的图2中的方法步骤S100至S500、图3中的方法步骤S310至S330、图4中的方法步骤S311至S312、图5中的方法步骤S410至S430、图6中的方法步骤S432至S434、图7中的方法步骤S110至S130、图8中的方法步骤S600至S700、图9中的方法步骤S610至S630、图10中的方法步骤S611至S612、图11中的方法步骤S710至S730、图12中的方法步骤S732至S734、图13中的方法步骤S121至S123、图14中的方法步骤S1211至S1212、图15中的方法步骤S810至S840、图16中的方法步骤S901至S913。In addition, another embodiment of the present invention also provides a computer-readable storage medium, the computer-readable storage medium stores computer-executable instructions, and the computer-executable instructions are executed by a processor or a controller, for example, by The execution of a processor in the foregoing device embodiment may cause the foregoing processor to execute the data processing method in the foregoing embodiment, for example, to execute the method steps S100 to S500 in FIG. 2 and the method steps S310 to S310 in FIG. 3 described above. S330, method steps S311 to S312 in FIG. 4, method steps S410 to S430 in FIG. 5, method steps S432 to S434 in FIG. 6, method steps S110 to S130 in FIG. 7, and method steps S600 to S600 in FIG. S700, the method steps S610 to S630 in FIG. 9, the method steps S611 to S612 in FIG. 10, the method steps S710 to S730 in FIG. 11, the method steps S732 to S734 in FIG. 12, and the method steps S121 to S121 in FIG. S123, method steps S1211 to S1212 in FIG. 14, method steps S810 to S840 in FIG. 15, and method steps S901 to S913 in FIG.
本发明实施例包括:获取目标数据序列;获取目标数据序列中的第一异常数据段;在目标数据序列中获取第一数据搜索空间;根据第一异常数据段在第一数据搜索空间中获取与第一异常数据段对应的第二异常数据段;对第二异常数据段进行标注。根据本发明实施例提供的方案,通过在目标数据序列中获取第一异常数据段和第一数据搜索空间,使得该第一异常数据段可以作为异常数据段模板,从而可以在第一数据搜索空间中根据该异常数据段模板获取并标注相对应的第二异常数据段,即实现了对目标数据序列中的其他异常数据段进行标注的目的,因此,相比于传统的人工标注异常数据段,本发明实施例提供的方案能够提高数据中异常数据的标注效率,从而能够节省人力资源和时间资源。The embodiment of the present invention includes: acquiring the target data sequence; acquiring the first abnormal data segment in the target data sequence; acquiring the first data search space in the target data sequence; acquiring the data in the first data search space according to the first abnormal data segment The second abnormal data segment corresponding to the first abnormal data segment; the second abnormal data segment is marked. According to the solution provided by the embodiment of the present invention, by acquiring the first abnormal data segment and the first data search space in the target data sequence, the first abnormal data segment can be used as an abnormal data segment template, so that the first data search space can be According to the abnormal data segment template, the corresponding second abnormal data segment is acquired and marked, which realizes the purpose of marking other abnormal data segments in the target data sequence. Therefore, compared with the traditional manual marking of abnormal data segments, The solution provided by the embodiment of the present invention can improve the labeling efficiency of abnormal data in the data, thereby saving human resources and time resources.
本领域普通技术人员可以理解,上文中所公开方法中的全部或某些步骤、系统可以被实施为软件、 固件、硬件及其适当的组合。某些物理组件或所有物理组件可以被实施为由处理器,如中央处理器、数字信号处理器或微处理器执行的软件,或者被实施为硬件,或者被实施为集成电路,如专用集成电路。这样的软件可以分布在计算机可读介质上,计算机可读介质可以包括计算机存储介质(或非暂时性介质)和通信介质(或暂时性介质)。如本领域普通技术人员公知的,术语计算机存储介质包括在用于存储信息(诸如计算机可读指令、数据结构、程序模块或其他数据)的任何方法或技术中实施的易失性和非易失性、可移除和不可移除介质。计算机存储介质包括但不限于RAM、ROM、EEPROM、闪存或其他存储器技术、CD-ROM、数字多功能盘(DVD)或其他光盘存储、磁盒、磁带、磁盘存储或其他磁存储装置、或者可以用于存储期望的信息并且可以被计算机访问的任何其他的介质。此外,本领域普通技术人员公知的是,通信介质通常包含计算机可读指令、数据结构、程序模块或者诸如载波或其他传输机制之类的调制数据信号中的其他数据,并且可包括任何信息递送介质。A person of ordinary skill in the art can understand that all or some of the steps and systems in the methods disclosed above can be implemented as software, firmware, hardware, and appropriate combinations thereof. Certain physical components or all physical components can be implemented as software executed by a processor, such as a central processing unit, a digital signal processor, or a microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit . Such software may be distributed on a computer-readable medium, and the computer-readable medium may include a computer storage medium (or a non-transitory medium) and a communication medium (or a transitory medium). As is well known by those of ordinary skill in the art, the term computer storage medium includes volatile and non-volatile data implemented in any method or technology for storing information (such as computer-readable instructions, data structures, program modules, or other data). Sexual, removable and non-removable media. Computer storage media include but are not limited to RAM, ROM, EEPROM, flash memory or other storage technologies, CD-ROM, digital versatile disk (DVD) or other optical disk storage, magnetic cassettes, magnetic tapes, magnetic disk storage or other magnetic storage devices, or Any other medium used to store desired information and that can be accessed by a computer. In addition, as is well known to those of ordinary skill in the art, a communication medium usually contains computer-readable instructions, data structures, program modules, or other data in a modulated data signal such as a carrier wave or other transmission mechanism, and may include any information delivery medium. .
以上是对本发明的一些实施例进行了具体说明,但本发明并不局限于上述实施方式,熟悉本领域的技术人员在不违背本发明技术方案的前提下还可作出种种的等同变形或替换,这些等同的变形或替换均包含在本发明权利要求所限定的范围内。The above is a detailed description of some embodiments of the present invention, but the present invention is not limited to the above-mentioned embodiments. Those skilled in the art can make various equivalent modifications or substitutions without departing from the technical solution of the present invention. These equivalent modifications or replacements are all included in the scope defined by the claims of the present invention.

Claims (22)

  1. 一种数据处理方法,包括,A data processing method, including,
    获取目标数据序列;Obtain the target data sequence;
    获取所述目标数据序列中的第一异常数据段;Acquiring the first abnormal data segment in the target data sequence;
    在所述目标数据序列中获取第一数据搜索空间;Acquiring a first data search space in the target data sequence;
    根据所述第一异常数据段在所述第一数据搜索空间中获取与所述第一异常数据段对应的第二异常数据段;Acquiring a second abnormal data segment corresponding to the first abnormal data segment in the first data search space according to the first abnormal data segment;
    对所述第二异常数据段进行标注。Mark the second abnormal data segment.
  2. 根据权利要求1所述的数据处理方法,其中,所述在所述目标数据序列中获取第一数据搜索空间,包括:The data processing method according to claim 1, wherein said obtaining the first data search space in the target data sequence comprises:
    获取所述目标数据序列的第一异常特征值;Acquiring the first abnormal characteristic value of the target data sequence;
    根据所述第一异常特征值确定所述目标数据序列中与所述第一异常特征值对应的第一数据位置;Determine the first data position corresponding to the first abnormal characteristic value in the target data sequence according to the first abnormal characteristic value;
    根据所述第一数据位置获取所述第一数据搜索空间。Obtaining the first data search space according to the first data location.
  3. 根据权利要求2所述的数据处理方法,其中,所述获取所述目标数据序列的第一异常特征值,包括:The data processing method according to claim 2, wherein said obtaining the first abnormal characteristic value of the target data sequence comprises:
    获取所述目标数据序列的第一基线预测数据;Acquiring first baseline prediction data of the target data sequence;
    根据所述第一基线预测数据与所述目标数据序列中的数据的偏离值获得第一异常特征值。The first abnormal characteristic value is obtained according to the deviation value of the first baseline prediction data and the data in the target data sequence.
  4. 根据权利要求1所述的数据处理方法,其中,所述根据所述第一异常数据段在所述第一数据搜索空间中获取与所述第一异常数据段对应的第二异常数据段,包括:The data processing method according to claim 1, wherein the acquiring a second abnormal data segment corresponding to the first abnormal data segment in the first data search space according to the first abnormal data segment comprises :
    在所述第一数据搜索空间中确定第三数据段;Determine a third data segment in the first data search space;
    对所述第一异常数据段与所述第三数据段进行相似度计算,得到与所述第三数据段对应的第一相似度量值;Performing similarity calculation on the first abnormal data segment and the third data segment to obtain a first similarity metric value corresponding to the third data segment;
    根据所述第一相似度量值确定对应的所述第三数据段为第二异常数据段。It is determined that the corresponding third data segment is a second abnormal data segment according to the first similarity metric value.
  5. 根据权利要求4所述的数据处理方法,其中,所述根据所述第一相似度量值确定对应的所述第三数据段为第二异常数据段,包括:The data processing method according to claim 4, wherein the determining that the corresponding third data segment is a second abnormal data segment according to the first similarity metric value comprises:
    当所述第一相似度量值小于预设阈值,确定与所述第一相似度量值对应的所述第三数据段为第二异常数据段。When the first similarity metric value is less than a preset threshold, it is determined that the third data segment corresponding to the first similarity metric value is a second abnormal data segment.
  6. 根据权利要求4所述的数据处理方法,其中,当所述第三数据段的数量为两个以上,所述根据所述第一相似度量值确定对应的所述第三数据段为第二异常数据段,包括:The data processing method according to claim 4, wherein when the number of the third data segment is more than two, the corresponding third data segment is determined to be the second abnormality according to the first similarity metric value Data segment, including:
    获取数值小于预设阈值的所述第一相似度量值;Acquiring the first similarity metric value whose value is less than a preset threshold;
    对数值小于预设阈值的所述第一相似度量值进行从小到大的排序以调整对应的所述第三数据段的排序;Sorting the first similarity metric values whose values are less than a preset threshold value from small to large to adjust the sorting of the corresponding third data segment;
    确定前N个所述第三数据段为第二异常数据段,其中,N大于等于1。It is determined that the first N third data segments are second abnormal data segments, where N is greater than or equal to 1.
  7. 根据权利要求1至6任意一项所述的数据处理方法,其中,所述获取目标数据序列,包括:The data processing method according to any one of claims 1 to 6, wherein said acquiring a target data sequence comprises:
    获取多个待测数据序列;Obtain multiple data sequences to be tested;
    对多个所述待测数据序列进行聚类处理得到目标数据类;Performing clustering processing on a plurality of the data sequences to be tested to obtain a target data category;
    从每一个所述目标数据类中分别确定一个所述目标数据序列。A target data sequence is determined from each target data category.
  8. 根据权利要求7所述的数据处理方法,其中,还包括:The data processing method according to claim 7, further comprising:
    在每一个所述目标数据类中的其余待测数据序列中分别获取第二数据搜索空间;Acquiring a second data search space in the remaining data sequences to be tested in each of the target data classes;
    利用所述目标数据序列中的所述第一异常数据段分别在所述其余待测数据序列中的所述第二数据 搜索空间中获取所述第二异常数据段。The first abnormal data segment in the target data sequence is used to obtain the second abnormal data segment in the second data search space in the remaining data sequence to be tested, respectively.
  9. 根据权利要求8所述的数据处理方法,其中,所述在每一个所述目标数据类中的其余待测数据序列中分别获取第二数据搜索空间,包括:8. The data processing method according to claim 8, wherein said obtaining the second data search space from the remaining data sequences to be tested in each of the target data classes respectively comprises:
    在每一个所述目标数据类中分别获取其余待测数据序列的第二异常特征值;Acquire the second abnormal characteristic value of the remaining data sequence to be tested in each of the target data types;
    根据所述第二异常特征值分别确定所述其余待测数据序列中与所述第二异常特征值对应的第二数据位置;Respectively determine the second data position corresponding to the second abnormal characteristic value in the remaining data sequence to be tested according to the second abnormal characteristic value;
    根据所述第二数据位置分别获取所述其余待测数据序列的第二数据搜索空间。The second data search space of the remaining data sequence to be tested is obtained respectively according to the second data position.
  10. 根据权利要求9所述的数据处理方法,其中,所述在每一个所述目标数据类中分别获取其余待测数据序列的第二异常特征值,包括:The data processing method according to claim 9, wherein said obtaining the second abnormal characteristic value of the remaining data sequence to be tested in each of the target data types respectively comprises:
    在每一个所述目标数据类中分别获取所述其余待测数据序列的第二基线预测数据;Acquiring the second baseline prediction data of the remaining data sequence to be tested in each of the target data types;
    根据所述第二基线预测数据与所述其余待测数据序列中的数据的偏离值分别获得所述其余待测数据序列的第二异常特征值。The second abnormal characteristic value of the remaining data sequence to be tested is obtained according to the deviation value of the second baseline prediction data and the data in the remaining data sequence to be tested, respectively.
  11. 根据权利要求8所述的数据处理方法,其中,所述利用所述目标数据序列中的所述第一异常数据段分别在所述其余待测数据序列中的所述第二数据搜索空间中获取所述第二异常数据段,包括:The data processing method according to claim 8, wherein the first abnormal data segment in the target data sequence is obtained in the second data search space in the remaining data sequence to be tested. The second abnormal data segment includes:
    分别在所述其余待测数据序列中的所述第二数据搜索空间中确定第四数据段;Respectively determining a fourth data segment in the second data search space in the remaining data sequence to be tested;
    把所述目标数据序列中的所述第一异常数据段分别与所述其余待测数据序列中的所述第四数据段进行相似度计算,得到与所述第四数据段对应的第二相似度量值;Perform similarity calculations on the first abnormal data segment in the target data sequence and the fourth data segment in the remaining data sequences to be tested to obtain a second similarity corresponding to the fourth data segment metric;
    根据所述第二相似度量值分别确定所述其余待测数据序列中的对应的所述第四数据段为所述其余待测数据序列中的第二异常数据段。The corresponding fourth data segment in the remaining data sequence to be tested is respectively determined according to the second similarity metric value as the second abnormal data segment in the remaining data sequence to be tested.
  12. 根据权利要求11所述的数据处理方法,其中,所述根据所述第二相似度量值分别确定所述其余待测数据序列中的对应的所述第四数据段为所述其余待测数据序列中的第二异常数据段,包括:11. The data processing method according to claim 11, wherein the corresponding fourth data segment in the remaining data sequences to be tested is respectively determined according to the second similarity metric value to be the remaining data sequences to be tested The second abnormal data segment in includes:
    当所述第二相似度量值小于预设阈值,确定所述其余待测数据序列中的对应的所述第四数据段为所述其余待测数据序列中的第二异常数据段。When the second similarity metric value is less than a preset threshold, it is determined that the corresponding fourth data segment in the remaining data sequence to be tested is the second abnormal data segment in the remaining data sequence to be tested.
  13. 根据权利要求11所述的数据处理方法,其中,当所述第四数据段的数量为两个以上,所述根据所述第二相似度量值分别确定所述其余待测数据序列中的对应的所述第四数据段为所述其余待测数据序列中的第二异常数据段,包括:The data processing method according to claim 11, wherein, when the number of the fourth data segment is more than two, the corresponding ones of the remaining data sequences to be tested are respectively determined according to the second similarity metric value. The fourth data segment is the second abnormal data segment in the remaining data sequence to be tested, and includes:
    分别获取对应于所述其余待测数据序列的数值小于预设阈值的所述第二相似度量值;Respectively acquiring the second similarity metric value corresponding to the remaining data sequence to be tested and the value is less than a preset threshold;
    对数值小于预设阈值的所述第二相似度量值进行从小到大的排序,以分别调整在所述其余待测数据序列中的对应的所述第四数据段的排序;Sorting the second similarity metric values whose values are less than a preset threshold value from small to large, so as to respectively adjust the sorting of the corresponding fourth data segments in the remaining data sequences to be tested;
    分别在所述其余待测数据序列中确定前N个所述第四数据段为第二异常数据段,其中,N大于等于1。It is determined that the first N fourth data segments are the second abnormal data segments in the remaining data sequences to be tested, where N is greater than or equal to 1.
  14. 根据权利要求7所述的数据处理方法,其中,所述对多个所述待测数据序列进行聚类处理得到目标数据类,包括:8. The data processing method according to claim 7, wherein the clustering of the plurality of data sequences to be tested to obtain the target data category comprises:
    对多个所述待测数据序列分别进行数据预处理,得到多个第一预处理数据序列;Perform data preprocessing on the plurality of data sequences to be tested, respectively, to obtain a plurality of first preprocessed data sequences;
    对多个所述第一预处理数据序列分别进行基线提取处理,得到多个第二预处理数据序列;Baseline extraction processing is performed on the plurality of first preprocessed data sequences respectively to obtain a plurality of second preprocessed data sequences;
    按相似度对多个所述第二预处理数据序列进行聚类,得到目标数据类。Clustering a plurality of the second pre-processing data sequences according to the similarity to obtain the target data category.
  15. 根据权利要求14所述的数据处理方法,其中,所述对多个所述待测数据序列分别进行数据预处理,得到多个第一预处理数据序列,包括:The data processing method according to claim 14, wherein said performing data preprocessing on a plurality of said data sequences to be tested respectively to obtain a plurality of first preprocessing data sequences comprises:
    对多个所述待测数据序列分别进行缺失值填充处理,得到多个填充数据序列;Performing missing value filling processing on the plurality of data sequences to be tested, respectively, to obtain multiple filling data sequences;
    对多个所述填充数据序列分别进行数据标准化处理,得到多个第一预处理数据序列。Data standardization processing is performed on the multiple filling data sequences to obtain multiple first preprocessing data sequences.
  16. 根据权利要求14所述的数据处理方法,其中,所述按相似度对多个所述第二预处理数据序列进行聚类,包括:The data processing method according to claim 14, wherein the clustering the plurality of second pre-processed data sequences according to similarity comprises:
    利用DBSCAN算法按相似度对多个所述第二预处理数据序列进行聚类;其中,所述DBSCAN算法的参数包括距离函数、邻域数量阈值和邻域距离阈值;所述DBSCAN算法的结果包括分类数和异常比例。The DBSCAN algorithm is used to cluster a plurality of the second preprocessed data sequences according to similarity; wherein, the parameters of the DBSCAN algorithm include a distance function, a neighborhood number threshold, and a neighborhood distance threshold; the result of the DBSCAN algorithm includes The number of categories and the proportion of abnormalities.
  17. 根据权利要求16所述的数据处理方法,其中,所述邻域距离阈值通过启发式算法而得到,其中,所述启发式算法包括以下步骤:The data processing method according to claim 16, wherein the neighborhood distance threshold is obtained by a heuristic algorithm, wherein the heuristic algorithm comprises the following steps:
    通过所述距离函数计算多个所述第二预处理数据序列两两之间的相似度,得到相似度矩阵数据;Calculating the similarity between two of the plurality of second pre-processed data sequences by using the distance function to obtain similarity matrix data;
    基于所述相似度矩阵数据计算k-dist距离,得到k-dist序列;Calculate the k-dist distance based on the similarity matrix data to obtain the k-dist sequence;
    基于所述k-dist序列得到初始距离阈值参数;Obtaining an initial distance threshold parameter based on the k-dist sequence;
    调整所述初始距离阈值参数以得到所述邻域距离阈值。The initial distance threshold parameter is adjusted to obtain the neighborhood distance threshold.
  18. 根据权利要求17所述的数据处理方法,其中,所述基于所述k-dist序列得到初始距离阈值参数,包括:The data processing method according to claim 17, wherein said obtaining the initial distance threshold parameter based on the k-dist sequence comprises:
    计算所述k-dist序列中每个k-dist点分别与前后两个相邻点的斜率,当所述前后两个相邻点的斜率均小于预设斜率阈值,且当所述前后两个相邻点的斜率的差值小于预设斜率差阈值,确定当前k-dist点为候选距离阈值;Calculate the slopes of each k-dist point in the k-dist sequence with two adjacent points before and after, when the slopes of the two adjacent points are less than the preset slope threshold, and when the two The difference between the slopes of adjacent points is less than the preset slope difference threshold, and the current k-dist point is determined as the candidate distance threshold;
    确定所述候选距离阈值中数值最大的一个为初始距离阈值参数。It is determined that the one with the largest value among the candidate distance thresholds is the initial distance threshold parameter.
  19. 根据权利要求17或18所述的数据处理方法,其中,所述调整所述初始距离阈值参数以得到所述邻域距离阈值,包括:The data processing method according to claim 17 or 18, wherein the adjusting the initial distance threshold parameter to obtain the neighborhood distance threshold comprises:
    获取步进长度;Get the step length;
    根据所述步进长度调整所述初始距离阈值参数得到距离调整阈值,当所述分类数出现下降,确定前一步调整得到的距离调整阈值为所述邻域距离阈值。The initial distance threshold parameter is adjusted according to the step length to obtain a distance adjustment threshold, and when the number of classifications decreases, it is determined that the distance adjustment threshold obtained by adjustment in the previous step is the neighborhood distance threshold.
  20. 根据权利要求7所述的数据处理方法,其中,所述从每一个所述目标数据类中分别确定一个所述目标数据序列,包括:8. The data processing method according to claim 7, wherein said determining one of said target data sequence from each of said target data types respectively comprises:
    在每一个所述目标数据类中,分别计算每一个所述待测数据序列与其余所述待测数据序列的距离平均和,确定与所述距离平均和中数值最小的一个对应的所述待测数据序列为所述目标数据序列。In each of the target data classes, calculate the average sum of the distances between each of the data sequence to be tested and the rest of the data sequences to be tested, and determine the one corresponding to the smallest value of the average and median of the distances. The measured data sequence is the target data sequence.
  21. 一种设备,包括:存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,其中,所述处理器执行所述计算机程序时实现如权利要求1至20中任意一项所述的数据处理方法。A device comprising: a memory, a processor, and a computer program stored on the memory and capable of running on the processor, wherein the processor executes the computer program as described in any one of claims 1 to 20 The data processing method described.
  22. 一种计算机可读存储介质,存储有计算机可执行指令,其中,所述计算机可执行指令用于执行如权利要求1至20中任意一项所述的数据处理方法。A computer-readable storage medium storing computer-executable instructions, wherein the computer-executable instructions are used to execute the data processing method according to any one of claims 1 to 20.
PCT/CN2021/086644 2020-05-29 2021-04-12 Data processing method and device, and computer-readable storage medium WO2021238455A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010473617.0 2020-05-29
CN202010473617.0A CN113742387A (en) 2020-05-29 2020-05-29 Data processing method, device and computer readable storage medium

Publications (1)

Publication Number Publication Date
WO2021238455A1 true WO2021238455A1 (en) 2021-12-02

Family

ID=78724518

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/086644 WO2021238455A1 (en) 2020-05-29 2021-04-12 Data processing method and device, and computer-readable storage medium

Country Status (2)

Country Link
CN (1) CN113742387A (en)
WO (1) WO2021238455A1 (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114872290A (en) * 2022-05-20 2022-08-09 深圳市信润富联数字科技有限公司 Self-adaptive production abnormity monitoring method for injection molding part
CN115792479A (en) * 2023-02-08 2023-03-14 东营市建筑设计研究院 Intelligent power consumption monitoring method and system for intelligent socket
CN115858894A (en) * 2023-02-14 2023-03-28 温州众成科技有限公司 Visual big data analysis method
CN116029842A (en) * 2023-03-28 2023-04-28 北京环球医疗救援有限责任公司 Cleaning and denoising method and system for medical insurance big data
CN116331044A (en) * 2023-05-31 2023-06-27 山东芯演欣电子科技发展有限公司 Charging data storage system for direct-current charging pile
CN116383190A (en) * 2023-05-15 2023-07-04 青岛场外市场清算中心有限公司 Intelligent cleaning method and system for massive big data
CN116994675A (en) * 2023-09-28 2023-11-03 佳木斯大学 Brocade based on near infrared data Lantern calyx epidermis detection method
CN117150233A (en) * 2023-10-30 2023-12-01 广东电网有限责任公司湛江供电局 Power grid abnormal data management method, system, equipment and medium
CN117196446A (en) * 2023-11-06 2023-12-08 北京中海通科技有限公司 Product risk real-time monitoring platform based on big data
CN117476136A (en) * 2023-12-28 2024-01-30 山东松盛新材料有限公司 High-purity carboxylate synthesis process parameter optimization method and system
CN117455127B (en) * 2023-12-26 2024-03-15 临沂市园林环卫保障服务中心 Plant carbon sink dynamic data monitoring system based on wisdom gardens

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2013218725A (en) * 2013-06-19 2013-10-24 Hitachi Ltd Abnormality detecting method and abnormality detecting system
CN104636999A (en) * 2015-01-04 2015-05-20 江苏联宏自动化系统工程有限公司 Detection method for building abnormal energy consumption data
CN109882834A (en) * 2019-03-27 2019-06-14 新奥数能科技有限公司 The operation data monitoring method and device of boiler plant
CN110558971A (en) * 2019-08-02 2019-12-13 苏州星空大海医疗科技有限公司 Method for generating countermeasure network electrocardiogram abnormity detection based on single target and multiple targets
CN111061711A (en) * 2019-11-28 2020-04-24 同济大学 Large data flow unloading method and device based on data processing behavior

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2013218725A (en) * 2013-06-19 2013-10-24 Hitachi Ltd Abnormality detecting method and abnormality detecting system
CN104636999A (en) * 2015-01-04 2015-05-20 江苏联宏自动化系统工程有限公司 Detection method for building abnormal energy consumption data
CN109882834A (en) * 2019-03-27 2019-06-14 新奥数能科技有限公司 The operation data monitoring method and device of boiler plant
CN110558971A (en) * 2019-08-02 2019-12-13 苏州星空大海医疗科技有限公司 Method for generating countermeasure network electrocardiogram abnormity detection based on single target and multiple targets
CN111061711A (en) * 2019-11-28 2020-04-24 同济大学 Large data flow unloading method and device based on data processing behavior

Cited By (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114872290B (en) * 2022-05-20 2024-02-06 深圳市信润富联数字科技有限公司 Self-adaptive production abnormality monitoring method for injection molding part
CN114872290A (en) * 2022-05-20 2022-08-09 深圳市信润富联数字科技有限公司 Self-adaptive production abnormity monitoring method for injection molding part
CN115792479A (en) * 2023-02-08 2023-03-14 东营市建筑设计研究院 Intelligent power consumption monitoring method and system for intelligent socket
CN115792479B (en) * 2023-02-08 2023-05-09 东营市建筑设计研究院 Intelligent power consumption monitoring method and system for intelligent socket
CN115858894A (en) * 2023-02-14 2023-03-28 温州众成科技有限公司 Visual big data analysis method
CN116029842A (en) * 2023-03-28 2023-04-28 北京环球医疗救援有限责任公司 Cleaning and denoising method and system for medical insurance big data
CN116029842B (en) * 2023-03-28 2023-06-20 北京环球医疗救援有限责任公司 Cleaning and denoising method and system for medical insurance big data
CN116383190A (en) * 2023-05-15 2023-07-04 青岛场外市场清算中心有限公司 Intelligent cleaning method and system for massive big data
CN116383190B (en) * 2023-05-15 2023-08-25 青岛场外市场清算中心有限公司 Intelligent cleaning method and system for massive financial transaction big data
CN116331044A (en) * 2023-05-31 2023-06-27 山东芯演欣电子科技发展有限公司 Charging data storage system for direct-current charging pile
CN116331044B (en) * 2023-05-31 2023-08-04 山东芯演欣电子科技发展有限公司 Charging data storage system for direct-current charging pile
CN116994675A (en) * 2023-09-28 2023-11-03 佳木斯大学 Brocade based on near infrared data Lantern calyx epidermis detection method
CN116994675B (en) * 2023-09-28 2023-12-01 佳木斯大学 Brocade based on near infrared data Lantern calyx epidermis detection method
CN117150233A (en) * 2023-10-30 2023-12-01 广东电网有限责任公司湛江供电局 Power grid abnormal data management method, system, equipment and medium
CN117150233B (en) * 2023-10-30 2024-02-13 广东电网有限责任公司湛江供电局 Power grid abnormal data management method, system, equipment and medium
CN117196446A (en) * 2023-11-06 2023-12-08 北京中海通科技有限公司 Product risk real-time monitoring platform based on big data
CN117196446B (en) * 2023-11-06 2024-01-19 北京中海通科技有限公司 Product risk real-time monitoring platform based on big data
CN117455127B (en) * 2023-12-26 2024-03-15 临沂市园林环卫保障服务中心 Plant carbon sink dynamic data monitoring system based on wisdom gardens
CN117476136A (en) * 2023-12-28 2024-01-30 山东松盛新材料有限公司 High-purity carboxylate synthesis process parameter optimization method and system
CN117476136B (en) * 2023-12-28 2024-03-15 山东松盛新材料有限公司 High-purity carboxylate synthesis process parameter optimization method and system

Also Published As

Publication number Publication date
CN113742387A (en) 2021-12-03

Similar Documents

Publication Publication Date Title
WO2021238455A1 (en) Data processing method and device, and computer-readable storage medium
WO2019232853A1 (en) Chinese model training method, chinese image recognition method, device, apparatus and medium
CN112418117B (en) Small target detection method based on unmanned aerial vehicle image
WO2019232843A1 (en) Handwritten model training method and apparatus, handwritten image recognition method and apparatus, and device and medium
JP6897749B2 (en) Learning methods, learning systems, and learning programs
CN112579823B (en) Video abstract generation method and system based on feature fusion and incremental sliding window
CN110826618A (en) Personal credit risk assessment method based on random forest
CN113723157B (en) Crop disease identification method and device, electronic equipment and storage medium
CN112766218A (en) Cross-domain pedestrian re-identification method and device based on asymmetric joint teaching network
CN116821809B (en) Vital sign data acquisition system based on artificial intelligence
CN109993042A (en) A kind of face identification method and its device
CN109934077B (en) Image identification method and electronic equipment
CN110766075A (en) Tire area image comparison method and device, computer equipment and storage medium
CN113526282A (en) Method, device, medium and equipment for diagnosing medium and long-term aging faults of elevator
CN112185108A (en) Urban road network congestion mode identification method, equipment and medium based on space-time characteristics
CN110751191A (en) Image classification method and system
CN111461010A (en) Power equipment identification efficiency optimization method based on template tracking
CN113780145A (en) Sperm morphology detection method, sperm morphology detection device, computer equipment and storage medium
CN113487610A (en) Herpes image recognition method and device, computer equipment and storage medium
CN112115994A (en) Training method and device of image recognition model, server and storage medium
CN111428064A (en) Small-area fingerprint image fast indexing method, device, equipment and storage medium
CN109739840A (en) Data processing empty value method, apparatus and terminal device
CN109670417A (en) Fingerprint identification method and device
CN110942089B (en) Multi-level decision-based keystroke recognition method
CN112926670A (en) Garbage classification system and method based on transfer learning

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21811916

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205 DATED 13.04.2023)

122 Ep: pct application non-entry in european phase

Ref document number: 21811916

Country of ref document: EP

Kind code of ref document: A1