CN114493230A

CN114493230A - Abnormal service data processing method and device and storage medium

Info

Publication number: CN114493230A
Application number: CN202210068428.4A
Authority: CN
Inventors: 宋韶旭; 王浩宇; 张小健; 王建民
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2022-01-20
Filing date: 2022-01-20
Publication date: 2022-05-13

Abstract

The application provides a method, a device and a storage medium for processing abnormal service data, wherein the method comprises the following steps: acquiring a plurality of service data of a service, the cycle length T of each service data and the acquisition time of each service data; classifying the service data with the same time sequence position into an expansion set based on the time sequence position of the acquisition time of the service data in the data cycle of the service data; for each service data y_iWill service data y_iIs acquired at time t_iBefore and after each W_dEach service data and service data y in the extended set to which the service data corresponding to each acquisition time belongs_iEach service data in the extended set and the collection time t_iBefore and after each W_sThe service data corresponding to the acquisition time form service data y_iExtended neighborhood set N_i(ii) a Extension based on corresponding of each service dataAnd the neighborhood set is used for determining abnormal business data from a plurality of business data.

Description

Abnormal service data processing method and device and storage medium

Technical Field

The present application relates to the field of data processing technologies, and in particular, to a method and an apparatus for processing abnormal service data, and a storage medium.

Background

Business (such as website application) inevitably causes business problems in the application process. The process of troubleshooting is mainly to determine abnormal data from the business data generated in the business processing process, and then analyze the determined abnormal data to determine and solve the business problem.

At present, the method for determining abnormal data from all business data is generally a traditional K-Sigma (K-Sigma) abnormal value detection method. Specifically, as shown in fig. 1, the service data acquisition device 11 acquires service data (such as the access amount of a certain website), and sends a plurality of service data acquired within a period of time (such as one week) and acquisition time corresponding to each service data to the abnormal data identification device 12. The abnormality detection unit 121 in the abnormal data identification device 12 determines the service data y by using the following K-Sigma abnormal value detection method_jWhether it is abnormal data: determining that the formula | y is satisfied_j-μ_j|>K·σ_jTraffic data y_jIs the exception data. Wherein the content of the first and second substances,

N_jfor traffic data y_jIs in the neighborhood set, | N_jL is the set N_jRadix or component set N_jK is a preset value (e.g., K equals to 3). y is_jNeighborhood set N of_jComprises the following steps: will y_jAnd y_jCorresponding acquisition time t_jBefore and after each W_sAnd the data collection is formed by the service data corresponding to each acquisition time. Wherein, W_sJ is a natural number greater than zero for a preset window value.

When the method in the prior art is adopted to determine the abnormal data, the problem that the obtained abnormal data is not accurate exists. Inaccurate data will result in failure to determine or efficiently solve the business problem based on the abnormal data.

Disclosure of Invention

The application provides a processing method and device of abnormal business data and a storage medium, which are used for solving the problem that obtained abnormal data are not accurate when the abnormal data are determined by adopting the method in the prior art.

In a first aspect, the present application provides a method for processing abnormal service data, including:

acquiring a plurality of service data of a service, the cycle length T of each service data and the acquisition time of each service data; the cycle length is the number of the service data contained in the data cycle corresponding to the service data;

classifying the service data with the same time sequence position into an expansion set based on the time sequence position of the acquisition time of the service data in the data cycle of the service data;

for each service data y_iWill service data y_iIs acquired at time t_iBefore and after each W_dEach service data and service data y in the extended set to which the service data corresponding to each acquisition time belongs_iEach service data in the extended set and the collection time t_iBefore and after each W_sThe business data corresponding to the acquisition time form the business data y_iExtended neighborhood set N_i；

Determining abnormal business data from the plurality of business data based on the expansion neighborhood set corresponding to each business data;

wherein i is 1,2,3, …, n; n is not less than 2 and is a natural number; w_dAnd W_sAre all preset values, and W_d＜W_s(ii) a T is more than or equal to 2 and is a natural number.

Optionally, the classifying the service data with the same time sequence position into an extended set based on the time sequence position of the acquisition time of each service data in the data cycle of the service data includes:

sequentially classifying each service data in a plurality of service data arranged in time sequence into an extended set Z arranged in sequence_hObtaining T expansion sets;

the time sequence positions of the acquisition time of each service data in each expansion set in the data cycle to which the service data belongs are the same; h is 1,2,3, …, T.

marking position identification on each service data, wherein the position identification is the identification of a time sequence position of the acquisition time of the service data in the data cycle of the service data;

and grouping the service data with the same position identification into an extended set.

Optionally, determining abnormal service data from the multiple service data based on the extended neighborhood set corresponding to each service data includes:

determining abnormal business data from a plurality of business data according to the following modes based on the expansion neighborhood set corresponding to each business data:

determining that the formula | y is satisfied_i-μ_i|>K·σ_iTraffic data y_iIs abnormal data; wherein the content of the first and second substances,

N_ifor traffic data y_iExtended neighborhood set of, | N_iL is the set N_iRadix or component set N of_iK is a preset value.

Optionally, the obtaining the cycle length T of each service data includes:

based on the plurality of service data and the acquisition time of each service data, determining the period length T of each service data by adopting the following mode:

arranging the n service data according to the acquisition time sequence of each service data to form a service data sequence X;

calculating a sequence A formed by the 1 st service data to the n-k service data in the service data sequence X and a Pearson correlation coefficient c of a sequence B formed by the k +1 st service data to the n service data in the service data sequence X_kObtaining n Pearson correlation coefficients c_k；

The correlation coefficient of each Pearson is calculated according to the correlation coefficient of the Pearson c_kThe numerical values of k are sequentially arranged to form an autocorrelation sequence C; the numerical sequence is the sequence of numerical values from large to small, or the numerical valuesThe order is from small to large;

based on a preset peak threshold value C_thDetermining a plurality of spikes from the autocorrelation sequence C as follows: based on Pearson's correlation coefficient c_kAnd c_kEach W before and after_pA peak set composed of Pearson correlation coefficients, and determining the peak set with the maximum value not less than C_thThe Peak correlation coefficient;

sequencing the position serial numbers q of the peak values in the autocorrelation sequence C in an ascending order according to the numerical value to form a serial number set, and subtracting the serial numbers of two adjacent positions in the serial number set to obtain the difference absolute value of the difference value of the serial numbers of the two adjacent positions;

forming a difference sequence by the absolute values of the differences, and carrying out median calculation on the difference sequence to obtain a median of the difference sequence, wherein the median represents the cycle length T of each service data;

wherein, W_pIs a preset value; 1,2,3, …, n; k is 0,1,2,3, …, (n-1); q is 1,2,3, …, n; n is not less than 2 and is a natural number.

In a second aspect, the present application provides an exception data handling apparatus, comprising: a cycle determination unit and a data identification unit;

the period determining unit is used for acquiring a plurality of service data of a service, the period length T of each service data and the acquisition time of each service data; the cycle length is the number of the service data contained in the data cycle corresponding to the service data;

the data identification unit is used for grouping the service data with the same time sequence position into an extended set based on the time sequence position of the acquisition time of the service data in the data cycle of the service data; for each service data y_iWill service data y_iIs acquired at time t_iBefore and after each W_dEach service data and service data y in the extended set to which the service data corresponding to each acquisition time belongs_iEach service data in the extended set and the collection time t_iBefore and after each W_sThe business data corresponding to the acquisition time form the business data y_iExtended neighborhood set N_i(ii) a Determining abnormal business data from the plurality of business data based on the expansion neighborhood set corresponding to each business data;

Optionally, the period determining unit includes a processing module, an autocorrelation calculating module, a peak searching module, and a period determining module, and the data identifying unit includes a data grouping module, a neighborhood determining module, and an anomaly detecting module;

the processing module is used for acquiring a plurality of service data of a service, the cycle length T of each service data and the acquisition time of each service data; the system is also used for acquiring a plurality of service data of a service and the acquisition time of each service data;

the autocorrelation calculating module is used for arranging n service data according to the acquisition time sequence of each service data to form a service data sequence X, then calculating a sequence A formed by the 1 st service data to the n-k service data in the service data sequence X and a Pearson correlation coefficient of a sequence B formed by the k +1 st service data to the n service data in the service data sequence to obtain n Pearson correlation coefficients c_k(ii) a And the correlation coefficient of each Pearson is calculated according to the correlation coefficient of Pearson c_kThe numerical values of k are sequentially arranged to form an autocorrelation sequence C; the numerical value sequence is the sequence of numerical values from large to small or the sequence of numerical values from small to large;

the peak searching module is used for searching for a peak based on a preset peak threshold value C_thDetermining a plurality of spikes from the autocorrelation sequence C as follows: based on Pearson's correlation coefficient c_kAnd c_kEach W before and after_pA peak set composed of Pearson correlation coefficients, and determining the peak set with the maximum value not less than C_thThe Peak correlation coefficient;

the period determining module is used for sequencing the position serial numbers q of the peak values in the autocorrelation sequence C in an ascending order according to the numerical value to form a serial number set, and subtracting the adjacent two position serial numbers in the serial number set to obtain the difference absolute value of the difference value of the two adjacent position serial numbers; after the absolute values of the differences are combined into a difference sequence, carrying out median calculation on the difference sequence to obtain the median of the difference sequence, wherein the median represents the cycle length T of each service data;

the data grouping module is used for grouping the service data with the same time sequence position into an extended set based on the time sequence position of the acquisition time of the service data in the data cycle of the service data;

the neighborhood determination module is used for determining each service data y_iWill service data y_iIs acquired at time t_iBefore and after each W_dEach service data and service data y in the extended set to which the service data corresponding to each acquisition time belongs_iEach service data in the extended set and the collection time t_iBefore and after each W_sThe business data corresponding to the acquisition time form the business data y_iExtended neighborhood set N_i；

The abnormal detection module is used for determining abnormal business data from a plurality of business data based on the expansion neighborhood set corresponding to each business data;

In a third aspect, the present application provides an abnormal data processing apparatus, including:

a processor and a memory;

the memory stores executable instructions executable by the processor;

wherein execution of the executable instructions stored by the memory by the processor causes the processor to perform the method as described above.

In a fourth aspect, the present application provides a storage medium having stored therein computer-executable instructions for implementing the method as described above when executed by a processor.

In a fifth aspect, the present application provides a program product comprising a computer program which, when executed by a processor, implements the method as described above.

According to the abnormal business data processing method, device and storage medium, the neighborhood set of each business data determined by the prior art method is periodically expanded to obtain the expanded neighborhood set of each business data, and accurate abnormal business data are determined from a plurality of business data based on the expanded neighborhood set of each business data. According to the method, in the process of determining the abnormal business data, the periodic mutation data in the periodic business data are filtered, so that the accuracy of the determined abnormal business data is ensured, and the problem that the abnormal data determined by the method in the prior art is not accurate is solved.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application.

FIG. 1 is a prior art anomaly data determination system architecture diagram;

FIG. 2 is a schematic diagram of website visitation volume of a website within a week according to an embodiment of the present application;

FIG. 3 is a diagram of an architecture of a system for processing abnormal business data according to an embodiment of the present application;

fig. 4 is a schematic diagram of a processing method for abnormal service data according to an embodiment of the present application;

fig. 5 is a structural diagram of an abnormal data processing apparatus according to an embodiment of the present application;

fig. 6 is a structural diagram of an abnormal data processing apparatus according to an embodiment of the present application.

With the above figures, there are shown specific embodiments of the present application, which will be described in more detail below. These drawings and written description are not intended to limit the scope of the inventive concepts in any manner, but rather to illustrate the inventive concepts to those skilled in the art by reference to specific embodiments.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some embodiments of the present invention, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

FIG. 1 is a prior art abnormal data determination system architecture diagram. As shown in fig. 1, the service data acquisition device 11 acquires service data, and sends a plurality of service data acquired within a period of time and acquisition time corresponding to each service data to the abnormal data identification device 12. The abnormal data detection unit 121 in the abnormal data identification device 12 determines abnormal data from the plurality of service data by using a K-Sigma abnormal value detection method. The anomaly detection unit 121 sends the determined anomaly data to the processing device 10, and the processing device 10 processes the service or the service problem of the service system by the anomaly data.

For example, when a website access problem occurs in an application process of a website, abnormal data is usually determined from traffic data (e.g., the amount of website access) of a time period (e.g., one week) including the time when the website access problem occurs, and then the traffic problem is determined based on the abnormal data and solved. Fig. 2 is a schematic view of website visitation amount of a certain website in one week according to an embodiment of the present application. As shown in fig. 2, the visit data of the website is periodic data with a period of one day, and sudden changes of the data (such as the periodic sudden change data shown in fig. 2) occur at certain fixed times of each day. The periodic mutation data is generated when the website is maintained every day, is normal business data, and does not represent that business problems occur in corresponding business (namely, the website application).

As shown in fig. 1, however,the method for determining abnormal data in the prior art is based on the service data y_jNeighborhood set N of_jDetermining the service data y by adopting a K-Sigma abnormal value detection method_jWhether or not anomalous data, where neighborhood set N_jThe periodicity of the traffic data is not considered.

When determining abnormal data in periodic traffic data as shown in fig. 2 by using a method of the related art, the periodic mutation data as shown in fig. 2 is generally determined as abnormal data. That is, the abnormal data is determined from the traffic data as shown in fig. 2 by using the method of the prior art, and the obtained abnormal data is composed of a plurality of periodic mutation data and one real abnormal data as shown in fig. 2. Of the determined abnormal data, only the true abnormal data as shown in fig. 2 is data associated with or can characterize a traffic problem, and the cycle break data is redundant data not associated with a traffic problem. Therefore, when the method in the prior art is used for determining the abnormal data of the periodic service data, the obtained abnormal data is not accurate. Inaccurate abnormal data can result in failure to determine or efficiently solve the business problem based on the abnormal data. In addition, if a technician is required to analyze the determined abnormal data to determine a business problem and solve the business problem of the business or the business system, the inaccuracy of the abnormal data causes great trouble to the analysis, determination and solution of the business problem for the technician, and the efficiency and accuracy of analyzing and solving the business problem by the technician are greatly reduced.

In view of the above, the present application provides a method for processing abnormal service data, which includes periodically expanding a neighborhood set of each service data to obtain an expanded neighborhood set of each service data, and determining abnormal service data based on the expanded neighborhood set of each service data, where the determined abnormal service data does not include periodic mutation data (i.e., the periodic mutation data is filtered out), so as to ensure that the obtained abnormal service data is accurate data associated with a service problem. When the abnormal business data is used for analyzing and determining the business problems, the high efficiency and the accuracy of the business problem analyzing and solving work are greatly improved, and the data volume for analyzing and solving the business problems is reduced.

The following describes a method for processing abnormal service data provided by the present application with reference to some embodiments.

Fig. 3 is a diagram of an architecture of a system for processing abnormal service data according to an embodiment of the present application. As shown in fig. 3, the system includes: the system comprises a processing device 10, a business data acquisition device 11 and an abnormal data processing device 13 connected with the processing device, wherein the abnormal data processing device 13 comprises a period determining unit 131 and a data identifying unit 132.

Illustratively, the service data collecting device 11 collects service data of a service, and sends a plurality of service data of the service, a cycle length T of each service data, and a collecting time of each service data to the cycle determining unit 131 in the abnormal data processing device 13. The period length T is the number of service data included in the data period corresponding to the service data. The cycle determining unit 131 sends the obtained plurality of pieces of service data, the cycle length T of each piece of service data, and the acquisition time of each piece of service data to the data identifying unit 132.

The data identification unit 132 classifies the service data with the same time sequence position into an extended set based on the time sequence position of the acquisition time of the service data in the data cycle of the service data. The data identification unit 132 also identifies each service data y_iWill service data y_iIs acquired at time t_iBefore and after each W_dEach service data and service data y in the extended set to which the service data corresponding to each acquisition time belongs_iEach service data in the extended set and the collection time t_iBefore and after each W_sThe business data corresponding to the acquisition time form the business data y_iExtended neighborhood set N_i. Then, the data identification unit 132 determines abnormal service data from the plurality of service data based on the extended neighborhood set corresponding to each service data. The data recognition unit 132 transmits the abnormal traffic data to the processing device 10. The processing device 10 obtains abnormal traffic data based on the received abnormal traffic dataAnd processing the service problem of the service or the service system.

Wherein i is 1,2,3, …, n; n is not less than 2 and is a natural number; w is a group of_dAnd W_sAre all preset values, and W_d＜W_s(ii) a T is more than or equal to 2 and is a natural number.

Alternatively, after the service data acquisition device 11 acquires the service data of a service, the service data acquisition device 11 may send only the plurality of service data of the service and the acquisition time of each service data to the period determination unit 131 in the abnormal data processing device 13. The cycle determining unit 131 determines the cycle length T of each service data based on the obtained plurality of service data and the acquisition time of each service data. The cycle determining unit 131 sends the obtained plurality of pieces of service data and the acquisition time of each piece of service data, and the determined cycle length T of each piece of service data to the data identifying unit 132.

According to the abnormal business data processing method provided by the embodiment of the application, all the business data with the same time sequence position are classified into one expansion set, the expansion neighborhood set of all the business data is determined and obtained based on the expansion set to which all the business data belong, and then accurate abnormal business data is determined from the business data based on the expansion neighborhood set corresponding to all the business data. By the adoption of the abnormal business data processing method, when abnormal business data are determined from a plurality of business data, the periodic mutation data are filtered by expanding the neighborhood set, the periodic mutation data are prevented from being determined as abnormal data, and the accuracy of the determined abnormal business data is guaranteed. The processing method for the abnormal business data, provided by the embodiment of the application, solves the problem that the obtained abnormal data is not accurate when the abnormal data is determined by adopting the method in the prior art.

The following describes a method for processing abnormal service data provided in this embodiment with reference to fig. 4. Fig. 4 is a schematic diagram of a method for processing abnormal service data according to an embodiment of the present application. The execution subject of the embodiment of the present application is the abnormal data processing apparatus 13 in the embodiment shown in fig. 3. As shown in fig. 4, the method includes:

s401, acquiring a plurality of service data of a service, the cycle length T of each service data and the acquisition time of each service data; the cycle length is the number of the service data contained in the data cycle corresponding to the service data.

Specifically, the cycle determining unit 131 in the abnormal data processing apparatus 13 acquires a plurality of pieces of service data of a service, the cycle length T of each piece of service data, and the acquisition time of each piece of service data from the service data acquiring apparatus 11. The period length is the number of the service data contained in the data period corresponding to the service data; t is more than or equal to 2 and is a natural number.

Alternatively, the period determining unit 131 may also obtain a plurality of service data of a service and the acquisition time of each service data from the service data acquisition device 11, and determine the period length T of each service data based on the obtained plurality of service data and the acquisition time of each service data.

Next, the cycle determining unit 131 sends the obtained plurality of pieces of service data, the cycle length T of each piece of service data, and the acquisition time of each piece of service data to the data identifying unit 132.

The cycle determining unit 131 determines the cycle length T of each service data based on the obtained plurality of service data and the acquisition time of each service data, as follows:

the cycle determining unit 131 determines the cycle length T of each service data in the following manner based on the plurality of service data and the acquisition time of each service data. Assuming that the plurality of service data are n service data, the cycle determining unit 131 determines the cycle length T of each service data according to steps S4011 to 4016 based on the n service data and the collection time of each service data.

S4011, arranging the n service data according to the collection time sequence of each service data to form a service data sequence X.

Illustratively, the period determination unit 131 arranges n service data in the collection time sequence of each service data to form a service data sequence X shown in table 1.

Table 1 service data sequence X

Sequencing serial number i of service data in X

1

2

3

4

…

n

Service data y_i

y₁

y₂

y₃

y₄

…

y_n

S4012, calculating a sequence A composed of the 1 st service data to the n-k service data in the service data sequence X and a Pearson correlation coefficient c of a sequence B composed of the k +1 st service data to the n service data in the service data sequence X_kObtaining n Pearson correlation coefficients c_k. Wherein i is 1,2,3, …, n; k is 0,1,2,3, …, (n-1); n is not less than 2 and is a natural number.

Illustratively, the period determining unit 131 calculates the Pearson correlation system of the sequence a and the sequence B as follows in equations (1) and (2)Number c_k：

Wherein, m is the number of service data in each sequence of the sequence A or the sequence B; a is a_fFor traffic data in sequence A, b_fThe service data in the sequence B is obtained; f is 1,2,3, …, m; m is not less than 1 and is a natural number.

S4013, calculating correlation coefficient of each Pearson according to Pearson correlation coefficient c_kThe numerical order of k in (a) constitutes the autocorrelation sequence C.

Illustratively, the period determining unit 131 compares each of the pearson correlation coefficients by the pearson correlation coefficient c_kThe numerical order of k in (a) constitutes an autocorrelation sequence C, an example of which is shown in table 2. The numerical sequence is from large to small, or from small to large.

TABLE 2 autocorrelation sequences C

Wherein q is 1,2,3, …, n; n is not less than 2 and is a natural number.

S4014, based on preset peak threshold value C_thDetermining a plurality of spikes from the autocorrelation sequence C as follows: based on Pearson's correlation coefficient c_kAnd c_kEach W before and after_pDetermining peak set composed of Pearson correlation coefficients, determining the peak set with maximum value not less than C_thThe pearson correlation coefficient of (a) is a peaked value. Wherein, W_pIs a preset value.

Illustratively, the period determining unit 131 determines, for each of the pearson correlation coefficients in the autocorrelation sequence C, whether each of the pearson correlation coefficients is sharp or not as followsPeak value: based on Pearson's correlation coefficient c_kAnd c_kEach W before and after_pDetermining peak set composed of Pearson correlation coefficients, determining the peak set with maximum value not less than C_thThe pearson correlation coefficient of (a) is a peaked value.

For example, suppose W_pThe period determining unit 131 sequentially determines whether the pearson correlation coefficient shown in table 2 is a peaked value as follows:

c is to₀And c₀Before and after each 1 Pearson correlation coefficient (i.e., c)₀，c₁) Constituting a set of peaks c₀，c₁]Determining a set of peaks [ c ]₀，c₁]The median value is maximum and not less than C_thThe Peak correlation coefficient;

c is to₁And c₁Before and after each 1 Pearson correlation coefficient (i.e., c)₀，c₁，c₂) Constituting a set of peaks c₀，c₁，c₂]Determining a set of peaks [ c ]₀，c₁，c₂]The median value is maximum and not less than C_thThe Peak correlation coefficient;

c is to₂And c₂Before and after each 1 Pearson correlation coefficient (i.e., c)₁，c₂，c₃) Constituting a set of peaks c₁，c₂，c₃]Determining a set of peaks [ c ]₁，c₂，c₃]The median value is maximum and not less than C_thThe Peak correlation coefficient;

…；

similarly, until it is determined whether each pearson correlation coefficient in the autocorrelation sequence C is a peaked value.

S4015, the position serial numbers q of the peak values in the autocorrelation sequence C are sorted in an ascending order according to the numerical value to form a serial number set, and the adjacent two position serial numbers in the serial number set are subtracted to obtain the difference absolute value of the difference value of the two adjacent position serial numbers.

Illustratively, the period determination unit 131 determines if it is determined according to step S4014Determining sharp peaks in the autocorrelation sequence C as shown in Table 2 as C₀，c₄，c₈，c₁₀. Each peak value c of the period determination unit 131₀，c₄，c₈，c₁₀The

position sequence numbers

1,5,9 and 11 in the autocorrelation sequence C are sorted in ascending order according to the value size to form a sequence number set [1,5,9,11]And the serial numbers are aggregated to [1,5,9,11 ]]And subtracting the serial numbers of the two adjacent positions to obtain the absolute difference value of the serial numbers of the two adjacent positions, namely 4,4 and 2.

Alternatively, the numerical value of the position index q may be a correlation coefficient C associated with the pearson constituting the autocorrelation sequence C_kK in (1) is the same value.

S4016, forming a difference sequence by the absolute values of the differences, and performing median calculation on the difference sequence to obtain a median of the difference sequence, wherein the median represents the cycle length T of each service data.

Illustratively, the period determining unit 131 makes the absolute difference values 4,4, and 2 of the differences between the sequence numbers of two adjacent positions determined in step S4015 form a difference sequence [4,4,2], and performs median calculation on the difference sequence [4,4,2], so as to obtain a median of the difference sequence [4,4,2] as 4, where the median represents the period length T of each service data, that is, T is 4.

S402, classifying the service data with the same time sequence position into an expansion set based on the time sequence position of the acquisition time of the service data in the data cycle of the service data.

For example, after the data identification unit 132 receives the plurality of service data, the period length T of each service data, and the collection time of each service data sent by the period determination unit 131, the data identification unit 132 classifies each service data with the same time sequence position into an extended set based on the time sequence position of the collection time of each service data in the data period of the service data.

Alternatively, the data identification unit 132 may sequentially sort each of the plurality of service data into the sequentially sorted extended set Z_hIn (2), T expansion sets are obtained. Wherein the collection of each service data in each extended setTime, the time sequence positions in the data cycle to which the service data belongs are the same; h is 1,2,3, …, T.

For example, the data identification unit 132 obtains a plurality of service data, and arranges the service data according to the collection time sequence of each amateur data, so as to obtain the service data sequence X shown in table 1. Assuming that the period length T of each service data is 4 and n is 12, the data identification unit 132 sequentially puts each service data in the service data sequence X into the sequentially arranged extended set Z_hIn (2), T extended sets are obtained as shown in table 3.

TABLE 3 extended set Z_h

Extended set Z₁	[y₁，y₅，y₉]
		Extended set Z₂	[y₂，y₆，y₁₀]
Extended set Z₃	[y₃，y₇，y₁₁]
		Extended set Z₄	[y₄，y₈，y₁₂]

Optionally, the data identification unit 132 may also mark a location identifier for each service data, where the location identifier is an identifier of a time sequence location where the collection time of the service data is located in the data cycle of the service data. Then, the data identification unit 132 classifies the service data with the same location identity into an extended set.

For example, the data identification unit 132 obtains a plurality of service data in the service data sequence X as shown in table 1. Assuming that the service data in the service data sequence X shown in table 1 are arranged in time sequence, and the cycle length T of each service data is 4 and n is 12, the data identification unit 132 marks the position identifier shown in table 4 for each service data, and classifies the service data with the same position identifier into an extended set, so as to obtain an extended set shown in table 3.

TABLE 4 data period and location identification for each service data

The data cycle lengths of the service data are the same as the cycle lengths of the data cycle 1, the data cycle 2 and the data cycle 3 shown in table 4.

S403, aiming at each service data y_iWill service data y_iIs acquired at time t_iBefore and after each W_dEach service data and service data y in the extended set to which the service data corresponding to each acquisition time belongs_iEach service data in the extended set and the collection time t_iBefore and after each W_sThe business data corresponding to the acquisition time form the business data y_iExtended neighborhood set N_i。

Specifically, the data identification unit 132 identifies the traffic data y for each service data_iWill service data y_iIs acquired at time t_iBefore and after each W_dEach service data and service data y in the extended set to which the service data corresponding to each acquisition time belongs_iService data in the extended set and acquisition time t_iBefore and after each W_sThe business data corresponding to the acquisition time form the business data y_iExtended neighborhood set N_i. Wherein i is 1,2,3, …, n; 2N is not more than n and n is a natural number; w_dAnd W_sAre all preset values, and W_d＜W_s。

Exemplarily, it is assumed that the plurality of service data obtained by the data identification unit 132 are as shown in table 4, where T is 4, W_d＝0，W_sThe extended set corresponding to each service data is shown in table 3, where 1 is defined as the value. With service data y₂Extended neighborhood set N₂For example, the data identification unit 132 identifies the service data y₂Is acquired at time t₂Business data and business data y in the expansion set to which the business data corresponding to the previous and subsequent 0 acquisition times belong₂The extension set Z to which it belongs₂Each service data (i.e. y) in (2)₂，y₆，y₁₀) And a time of acquisition t₂Business data corresponding to 1 acquisition time before and after (i.e. y)₁And y₃) Form service data y₂Extended neighborhood set N₂：[y₂，y₆，y₁₀，y₁，y₃]。

In contrast, if the method of the prior art is adopted, when the service data is the same, T is 4, W_sWhen the parameter conditions are the same, such as 1, the neighborhood set N of the identified traffic data y2 is determined_2cIs [ y ]₂，y₁，y₃]。

S404, determining abnormal business data from the plurality of business data based on the expansion neighborhood set corresponding to each business data.

Illustratively, the data identification unit 132 determines abnormal service data from the plurality of service data based on the extended neighborhood set corresponding to each service data.

For example, the data identification unit 132 determines abnormal service data from the plurality of service data based on the extended neighborhood set corresponding to each service data as follows:

the data identification unit 132 determines the service data y satisfying equation (3)_iIn the case of the abnormal data,

|y_i-μ_i|>K·σ_i(3)；

wherein the content of the first and second substances,

N_ifor traffic data y_iExtended neighborhood set of, | N_iL is the set N_iRadix or component set N_iK is a preset value.

Further, the data recognition unit 132 transmits the determined abnormal traffic data to the processing device 10. The processing device 10 processes the service problem of the service or the service system based on the obtained abnormal service data.

From the above, as shown in step S403, the service data y determined by the method of the present application₂Extended neighborhood set N₂Is [ y ]₂，y₆，y₁₀，y₁，y₃]. And the service data y determined by the prior art method₂Neighborhood set N of_2cIs [ y ]₂，y₁，y₃]. It is assumed that specific numerical values of the respective service data shown in table 4 are shown in table 5 below.

TABLE 5 Business data

As can be seen from Table 5, y₂For periodic mutation data, y₂Front and back service data y₁And y₃Are all normal data. When the method for processing abnormal service data provided by the application is adopted, the method is based on y₂Extended neighborhood set N of₂Y can be determined₂Not abnormal traffic data; however, when determining abnormal data using the prior art method, y is based₂Neighborhood set N of_2cDetermining y₂Is the exception data.

The following example is combined to specifically compare and explain technical effects achieved by the processing method of abnormal service data provided by the present application and the method in the prior art. It is assumed that the time-series service data and related information generated by the service D in the period L are as shown in table 6.

Table 6 service data and related information generated by service D during time period L

By respectively adopting the abnormal service data processing method provided by the present application and the method in the prior art, the results shown in table 7 are obtained when the abnormal service data is determined from the service data shown in table 6.

TABLE 7 abnormal service data determined by different processing methods

As can be seen from table 6 and table 7, the service data "-3" in table 6 is abnormal service data associated with the service problem occurring in the service D, and the other service data in table 6 is normal service data not associated with the service problem occurring in the service D. Therefore, as can be seen from the results of the abnormal service data determined in table 7, the abnormal service data associated with the service problem can be more accurately determined by using the method for processing abnormal service data provided by the present application.

Therefore, the abnormal business data processing method provided by the application periodically expands the neighborhood set of each business data to obtain the expanded neighborhood set of each business data. Therefore, by adopting the processing method of the abnormal business data, the periodic mutation data can be filtered when the abnormal business data is determined. The abnormal business data processing method avoids that the periodic mutation data is mistaken for abnormal data associated with business problems, and further the business problem determination and solving work is interfered.

According to the abnormal business data processing method, the neighborhood set of each business data is periodically expanded to obtain the expanded neighborhood set of each business data, the abnormal business data is determined based on the expanded neighborhood set of each business data, and the accurate abnormal business data associated with business problems are obtained. The processing method for the abnormal business data greatly improves the efficiency and accuracy of business problem analysis and solution work. In addition, when the period information such as the data period and the period length of the service data is unknown, the processing method of the abnormal service data provided by the application can be used for determining the period information of the service data.

An abnormal data processing device provided by the embodiment of the present application is further provided, and the following describes the abnormal data processing device provided by the embodiment of the present application with reference to fig. 3 and fig. 5. Fig. 5 is a structural diagram of an abnormal data processing apparatus according to an embodiment of the present application. As shown in fig. 3, the abnormal data processing apparatus 13 includes: a cycle determination unit 131 and a data identification unit 132;

a period determining unit 131, configured to obtain multiple service data of a service, a period length T of each service data, and an acquisition time of each service data; the cycle length is the number of the service data contained in the data cycle corresponding to the service data.

The data identification unit 132 is configured to classify the service data with the same time sequence position into an extended set based on the time sequence position of the acquisition time of the service data in the data cycle of the service data; for each service data y_iWill service data y_iIs acquired at time t_iBefore and after each W_dEach service data and service data y in the extended set to which the service data corresponding to each acquisition time belongs_iEach service data in the extended set and the collection time t_iBefore and after each W_sThe business data corresponding to the acquisition time form the business data y_iExtended neighborhood set N_i(ii) a Determining abnormal business data from the plurality of business data based on the expansion neighborhood set corresponding to each business data; wherein i is 1,2,3, …, n; n is not less than 2 and is a natural number; w_dAnd W_sAre all preset values, and W_d＜W_s(ii) a T is more than or equal to 2 and is a natural number.

Optionally, as shown in fig. 5, the period determining unit 131 includes a processing module 1311, an autocorrelation calculating module 1312, a spike search module 1313, and a period determining module 1314.

The data recognition unit 132 includes a data grouping module 1321, a neighborhood determination module 1322, and an anomaly detection module 1323.

The processing module 1311 is configured to obtain multiple service data of a service, a cycle length T of each service data, and an acquisition time of each service data; the method is also used for acquiring a plurality of service data of one service and the acquisition time of each service data.

An autocorrelation calculating module 1312, configured to arrange the n service data in the collection time sequence of each service data to form a service data sequence X, and then calculate a Pearson correlation coefficient between a sequence a formed by the 1 st service data to the n-th service data in the service data sequence X and a sequence B formed by the k +1 th service data to the n-th service data in the service data sequence, so as to obtain n Pearson correlation coefficients c_k(ii) a And the correlation coefficient of each Pearson is calculated according to the correlation coefficient of Pearson c_kThe numerical order of k in (a) constitutes the autocorrelation sequence C. The numerical sequence is from large to small, or from small to large.

A peak search module 1313 configured to search for a peak based on a preset peak threshold C_thDetermining a plurality of spikes from the autocorrelation sequence C as follows: based on Pearson's correlation coefficient c_kAnd c_kEach W before and after_pDetermining peak set composed of Pearson correlation coefficients, determining the peak set with maximum value not less than C_thThe pearson correlation coefficient of (a) is a peaked value.

The period determining module 1314 is configured to sort the position sequence numbers q of each peaked value in the autocorrelation sequence C in an ascending order according to the magnitude of the value to form a sequence number set, and subtract two adjacent position sequence numbers in the sequence number set to obtain a difference absolute value of the difference between the two adjacent position sequence numbers; and after the absolute values of the differences are combined into a difference sequence, carrying out median calculation on the difference sequence to obtain the median of the difference sequence. Wherein, the median represents the period length T of each service data.

The data grouping module 1321 is configured to group the service data with the same time sequence position into an extended set based on the time sequence position of the acquisition time of the service data in the data cycle of the service data.

Neighborhood determination module 1322 for each traffic data y_iWill service data y_iIs acquired at time t_iBefore and after each W_dEach service data and service data y in the extended set to which the service data corresponding to each acquisition time belongs_iService data in the extended set and acquisition time t_iBefore and after each W_sThe business data corresponding to the acquisition time form the business data y_iExtended neighborhood set N_i。

The anomaly detection module 1323 is configured to determine anomalous service data from the multiple service data based on the extended neighborhood set corresponding to each service data.

The specific implementation principle and the implemented technical effect of the abnormal data processing device provided in the embodiment of the present application are similar to the specific implementation principle and the implemented technical effect of the embodiment shown in fig. 4, and are not described herein again.

The embodiment of the application also provides an abnormal data processing device. Fig. 6 is a structural diagram of an abnormal data processing apparatus according to an embodiment of the present application. As shown in fig. 6, the apparatus includes a processor 61 and a memory 62, and the memory 62 stores executable instructions of the processor 61, so that the processor 61 can be used to execute the technical solution of the foregoing method embodiment, and the implementation principle and the technical effect are similar, which are not described herein again. It should be understood that the Processor 61 may be a Central Processing Unit (CPU), other general-purpose processors, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of a method disclosed in connection with the present invention may be embodied directly in a hardware processor, or in a combination of the hardware and software modules within the processor. The Memory 62 may include a high-speed Random Access Memory (RAM), a Non-volatile Memory (NVM), at least one disk Memory, a usb disk, a removable hard disk, a read-only Memory, a magnetic disk, or an optical disk.

The embodiment of the present application further provides a storage medium, where computer execution instructions are stored in the storage medium, and when the computer execution instructions are executed by the processor, the method for processing abnormal service data is implemented. The storage medium may be implemented by any type of volatile or non-volatile storage device or combination thereof, such as a Static Random Access Memory (SRAM), an Electrically Erasable Programmable Read-Only Memory (EPROM), a Programmable Read-Only Memory (PROM), a Read-Only Memory (ROM), a magnetic Memory, a flash Memory, a magnetic disk or an optical disk. A storage media may be any available media that can be accessed by a general purpose or special purpose computer.

An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. Of course, the storage medium may also be integral to the processor. The processor and the storage medium may reside in an Application Specific Integrated Circuits (ASIC). Of course, the processor and the storage medium may reside as discrete components in an electronic device or host device.

The embodiments of the present application also provide a program product, such as a computer program, which when executed by a processor, implements the method for processing abnormal service data covered by the present application.

Those of ordinary skill in the art will understand that: all or a portion of the steps of implementing the above-described method embodiments may be performed by hardware associated with program instructions. The program may be stored in a computer-readable storage medium. When executed, the program performs steps comprising the method embodiments described above; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims

1. A method for processing abnormal service data is characterized by comprising the following steps:

for each service data y_iService data y_iIs acquired at time t_iBefore and after each W_dEach service data and service data y in the extended set to which the service data corresponding to each acquisition time belongs_iEach service data in the extended set and the collection time t_iBefore and after each W_sService data corresponding to each acquisition time form service data y_iExtended neighborhood set ofN_i；

2. The method of claim 1, wherein grouping the service data with the same time sequence position into an extended set based on the time sequence position of the collection time of the service data in the data cycle of the service data comprises:

3. The method of claim 1, wherein grouping the service data with the same time sequence position into an extended set based on the time sequence position of the collection time of the service data in the data cycle of the service data comprises:

4. The method according to any one of claims 1 to 3, wherein determining abnormal traffic data from the plurality of traffic data based on the extended neighborhood set corresponding to each traffic data comprises:

determining that the formula | y is satisfied_i-μ_i|>K·σ_iTraffic data y_iAbnormal data; wherein the content of the first and second substances,

5. The method according to any one of claims 1 to 3, wherein the obtaining of the period length T of each service data comprises:

The correlation coefficient of each Pearson is calculated according to the correlation coefficient of the Pearson c_kThe numerical values of k are sequentially arranged to form an autocorrelation sequence C; the numerical value sequence is the sequence of numerical values from large to small or the sequence of numerical values from small to large;

based on a preset peak threshold value C_thDetermining a plurality of spikes from the autocorrelation sequence C as follows: based on Pearson's correlation coefficient c_kAnd c_kEach W before and after_pA peak set composed of Pearson correlation coefficients, determining the peak set with the largest value and not less than C_thThe Peak correlation coefficient;

wherein, W_pIs a preset value; 1,2,3, …, n; k is 0,1,2,3, …, (n-1); q is 1,2,3, …, n; n is more than or equal to 2 and is a natural number.

6. An exception data handling apparatus, comprising: a cycle determination unit and a data identification unit;

7. The apparatus of claim 6, wherein the period determination unit comprises a processing module, an autocorrelation calculation module, a spike search module, a period determination module, and the data identification unit comprises a data grouping module, a neighborhood determination module, and an anomaly detection module;

8. An exception data handling apparatus, comprising:

a processor and a memory;

the memory stores executable instructions executable by the processor;

wherein execution of the executable instructions stored by the memory by the processor causes the processor to perform the method of any of claims 1-5.

9. A storage medium having stored therein computer executable instructions for performing the method of any one of claims 1-5 when executed by a processor.

10. A program product comprising a computer program which, when executed by a processor, carries out the method of any one of claims 1 to 5.