CN111191669B - Data processing method, device and storage medium - Google Patents

Data processing method, device and storage medium Download PDF

Info

Publication number
CN111191669B
CN111191669B CN201811362456.7A CN201811362456A CN111191669B CN 111191669 B CN111191669 B CN 111191669B CN 201811362456 A CN201811362456 A CN 201811362456A CN 111191669 B CN111191669 B CN 111191669B
Authority
CN
China
Prior art keywords
data
processed
algorithm
time sequence
value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811362456.7A
Other languages
Chinese (zh)
Other versions
CN111191669A (en
Inventor
刘芳
赵洪松
孙芳杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Mobile Communications Group Co Ltd
China Mobile Group Heilongjiang Co Ltd
Original Assignee
China Mobile Communications Group Co Ltd
China Mobile Group Heilongjiang Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Mobile Communications Group Co Ltd, China Mobile Group Heilongjiang Co Ltd filed Critical China Mobile Communications Group Co Ltd
Priority to CN201811362456.7A priority Critical patent/CN111191669B/en
Publication of CN111191669A publication Critical patent/CN111191669A/en
Application granted granted Critical
Publication of CN111191669B publication Critical patent/CN111191669B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Debugging And Monitoring (AREA)
  • Complex Calculations (AREA)

Abstract

The invention discloses a data processing method, which comprises the following steps: determining the data type of the data to be processed; selecting a data analysis algorithm meeting expected data analysis performance indexes according to the data types; and carrying out data analysis on the data to be processed by using the selected data analysis algorithm. The invention also discloses a data processing device and a computer storage medium.

Description

Data processing method, device and storage medium
Technical Field
The present invention relates to the field of data processing technologies, and in particular, to a data processing method, apparatus, and storage medium.
Background
In the prior art, data analysis algorithms are generally used for data analysis.
Currently, a single data analysis algorithm often cannot meet the accuracy and precision requirements of different data analysis results. As the application range and importance of data in society increases, how to select data analysis algorithms has become a very important issue at present.
Disclosure of Invention
In order to solve the existing technical problems, the embodiment of the invention provides a data processing method, a data processing device and a computer storage medium.
The embodiment of the invention provides a data processing method, which comprises the following steps:
Determining the data type of the data to be processed;
Selecting a data analysis algorithm meeting expected data analysis performance indexes according to the data types;
And carrying out data analysis on the data to be processed by using the selected data analysis algorithm.
In the above scheme, the determining the type of the data to be processed includes:
Determining whether the data to be processed is periodic data; the period data is data generated by taking a preset time length as a period.
In the above scheme, the selecting, according to the data type, a data analysis algorithm that meets an expected data analysis performance index includes:
If the data to be processed is periodic data, adopting an analog periodic algorithm; wherein the analog periodic algorithm employs the following formula:
Wherein the expected value E (t) is the average value of time sequence index values contained in the data to be processed, x i is the time sequence index value of the ith time point, m is more than or equal to 1 and less than or equal to n, t represents time, sigma (t) is the standard deviation of the time sequence index values, gamma (t) is the periodic class ratio of the index sequence in gamma space, and alpha is the normal distribution standard deviation confidence coefficient of the time sequence index values.
In the above solution, according to the data type, selecting a data analysis algorithm that meets an expected data analysis performance index, further includes:
And if the data to be processed is non-periodic data, determining whether the local fluctuation amplitude of the data to be processed reaches a fluctuation amplitude threshold value.
In the above scheme, if the local fluctuation amplitude of the data to be processed is greater than the fluctuation amplitude threshold, a ring ratio average algorithm is adopted; wherein, the ring ratio average algorithm adopts the following formula:
Wherein the expected value E (t) is the average value of time sequence index values contained in the data to be processed, t represents time, x i is the time sequence index value of the ith time point, and m is more than or equal to 1 and less than or equal to n.
In the above scheme, if the local fluctuation amplitude of the data to be processed is not greater than the fluctuation amplitude threshold, a threshold mutation algorithm is adopted; wherein, the threshold mutation algorithm adopts the following formula:
Wherein the expected value E (t m) and the expected value E (t 2m) are the average value of time sequence index values contained in the data to be processed, t represents time, x i is the time sequence index value of the ith time point, m is more than or equal to 1 and less than or equal to n, x j is the time sequence index value of the jth time point, Is a mutant factor.
An embodiment of the present invention provides a data processing apparatus, including:
The data classification module is used for determining the data type of the data to be processed;
The algorithm acquisition module is used for selecting and acquiring a data analysis algorithm meeting expected data analysis performance indexes according to the data type;
and the data analysis module is used for carrying out data analysis on the data to be processed by utilizing the selected data analysis algorithm.
In the above scheme, the data classification module is further configured to determine whether the data to be processed is periodic data; the period data is data generated by taking a preset time length as a period.
In the above scheme, the algorithm acquisition module is further configured to:
If the data to be processed is periodic data, adopting an analog periodic algorithm; wherein the analog periodic algorithm employs the following formula:
Wherein the expected value E (t) is the average value of time sequence index values contained in the data to be processed, x i is the time sequence index value of the ith time point, m is more than or equal to 1 and less than or equal to n, t represents time, sigma (t) is the standard deviation of the time sequence index values, gamma (t) is the periodic class ratio of the index sequence in gamma space, and alpha is the normal distribution standard deviation confidence coefficient of the time sequence index values.
In the above scheme, the algorithm acquisition module is further configured to:
And if the data to be processed is non-periodic data, determining whether the local fluctuation amplitude of the data to be processed reaches a fluctuation amplitude threshold value.
In the above scheme, the algorithm acquisition module is further configured to:
If the local fluctuation amplitude of the data to be processed is larger than the fluctuation amplitude threshold, adopting a ring ratio average algorithm; wherein, the ring ratio average algorithm adopts the following formula:
Wherein the expected value E (t) is the average value of time sequence index values contained in the data to be processed, t represents time, x i is the time sequence index value of the ith time point, and m is more than or equal to 1 and less than or equal to n.
In the above scheme, the algorithm acquisition module is further configured to:
If the local fluctuation amplitude of the data to be processed is not greater than the fluctuation amplitude threshold, a threshold mutation algorithm is adopted; wherein, the threshold mutation algorithm adopts the following formula:
Wherein the expected value E (t m) and the expected value E (t 2m) are the average value of time sequence index values contained in the data to be processed, t represents time, x i is the time sequence index value of the ith time point, m is more than or equal to 1 and less than or equal to n, x j is the time sequence index value of the jth time point, Is a mutant factor.
In some embodiments of the present invention, in some embodiments,
When (when)When the time sequence index value is suddenly raised; when/>When the time sequence index value is suddenly reduced; p is a sudden rise threshold and q is a sudden fall threshold.
The embodiment of the invention also provides a data processing device, which comprises: a processor and a memory for storing a computer program capable of running on the processor;
wherein the processor is configured to execute the steps of any one of the data processing methods described above when running the computer program.
The embodiment of the invention also provides a computer storage medium, on which a computer program is stored, characterized in that the computer program, when executed by a processor, implements the steps of any of the above-mentioned data processing methods.
According to the technical scheme, the data type of the data to be processed is determined, the data analysis method of the expected data analysis performance index is selected to perform data analysis on the data to be processed, dynamic matching of the data type and the data analysis algorithm of the expected data analysis performance index is achieved, the data analysis algorithm suitable for the data type of the current data to be processed is selected, the problem that the analysis result is inaccurate or insufficient in accuracy caused by the unsuitable data analysis algorithm can be solved, the accuracy and precision of the data analysis result are improved, and the effect of optimizing data processing is achieved.
Drawings
The drawings illustrate generally, by way of example and not by way of limitation, various embodiments discussed herein.
FIG. 1 is a schematic flow chart of a data processing method according to an embodiment of the invention;
FIG. 2 is a block diagram of a module of a data processing apparatus according to an embodiment of the present invention;
FIG. 3 is a flow chart of a data processing method according to an embodiment of the invention;
FIG. 4 is a graph of time series data analyzed by an analog periodic algorithm;
FIG. 5 is a graph of periodic class ratios of an index sequence in gamma space;
FIG. 6 is a graph of time series data analyzed by the ring ratio average algorithm;
FIG. 7 is a graph of time series data analyzed by a threshold mutation algorithm;
FIG. 8 is a diagram of index timing data according to one embodiment VCores of the present invention;
Fig. 9 is a schematic hardware structure of an apparatus according to an embodiment of the present invention.
Detailed Description
For a more complete understanding of the nature and the technical content of the embodiments of the present invention, reference should be made to the following detailed description of embodiments of the invention, taken in conjunction with the accompanying drawings, which are meant to be illustrative only and not limiting of the embodiments of the invention.
An embodiment of a data processing method of the present invention, as shown in fig. 1, includes:
step 101, determining the data type of the data to be processed.
The data to be processed may be time sequence data. The data types may include: periodic data and non-periodic data other than the periodic data.
The periodic data are data taking all data generated in a period as a data analysis unit. For example: monthly communication data for certain mobile users, daily system access data for certain server devices, hourly video data stored by certain video surveillance systems, and the like.
In some embodiments, the step 101 includes: determining whether the data to be processed is periodic data. Specifically, the method for determining whether the data to be processed is the periodic data can be implemented by performing fourier transform on the data to be processed, then performing wavelet variance calculation on the data to be processed after fourier transform, and checking whether the data to be processed is the periodic data according to the wavelet variance calculation result.
And 102, selecting a data analysis algorithm meeting expected data analysis performance indexes according to the data types. A data analysis algorithm that satisfies an expected data analysis performance index, comprising:
The analysis performance of the data analysis of the data of the corresponding data type is performed by using the selected data analysis algorithm, so that the expected performance index value can be achieved, for example, the accuracy of the data analysis result can reach the expected accuracy; for another example, the early warning accuracy of the abnormal early warning based on the data analysis result can reach the expected accuracy.
The step 102 includes, but is not limited to, at least one of:
if the data to be processed is periodic data, adopting an analog periodic algorithm;
The cycle data can be based on reference values of past cycles to analyze parameters of the same kind of future cycles using an analog cycle algorithm. Analog periodic algorithms are generally less costly, less time consuming, and more reliable for computing periodic data than other algorithms.
If the data to be processed is non-periodic data, determining whether the local fluctuation amplitude of the data to be processed reaches a fluctuation amplitude threshold value; the preset threshold value can be set according to actual needs.
If the local fluctuation amplitude of the data to be processed is larger than the preset threshold value, adopting a ring ratio average algorithm;
When the local fluctuation amplitude of the non-periodic data is larger than the preset threshold value, a more accurate algorithm is required because the local fluctuation amplitude of the data to be processed is larger and the periodicity is lacking. And analyzing the subsequent time sequence index value by adopting a ring ratio average value method through the average value of the time sequence index values contained in the data to be processed in a period of time before the ring ratio. Compared with other algorithms, the loop ratio average algorithm has strong flexibility, and can realize accurate judgment by adjusting the loop ratio time period according to the needs.
And if the local fluctuation amplitude of the data to be processed is not greater than the preset threshold value, adopting a threshold value mutation algorithm.
When the local fluctuation amplitude of the non-periodic data is smaller than the preset threshold value, the data with the mutation amplitude exceeding the preset threshold value can be judged while the calculation resources are saved by adopting a threshold mutation algorithm due to the fact that the local fluctuation amplitude of the data to be processed is smaller and the periodicity is lacked. Compared with other algorithms, the threshold mutation algorithm has the advantages of low cost, rapid response and capability of accurately positioning the data with mutation amplitude exceeding the preset threshold.
And 103, carrying out data analysis on the data to be processed by using the selected data analysis algorithm.
The analog periodic algorithm uses the following formula:
Wherein the expected value E (t) is the average value of time sequence index values contained in the data to be processed, x i is the time sequence index value of the ith time point, m is more than or equal to 1 and less than or equal to n, t represents time, sigma (t) is the standard deviation of the time sequence index value, gamma (t) is the periodic class ratio of the index sequence in gamma space, and alpha is the normal distribution standard deviation confidence coefficient of the time sequence standard value. Alpha can be expressed in terms of percentage, and is used to represent the reliability of the positive-going standard deviation of the time sequence standard value contained in the data to be processed, and in some practical applications, the reliability of the equipment and the system may be affected by the stability of the equipment and the system, and the higher the stability of the equipment and the system, the closer alpha is to 100%.
The ring ratio average algorithm adopts the following formula:
Wherein the expected value E (t) is the average value of time sequence index values contained in the data to be processed, t represents time, x i is the time sequence index value of the ith time point, and m is more than or equal to 1 and less than or equal to n.
The threshold mutation algorithm uses the following formula:
Wherein the expected value E (t m) and the expected value E (t 2m) are the average value of time sequence index values contained in the data to be processed, t represents time, x i is the time sequence index value of the ith time point, m is more than or equal to 1 and less than or equal to n, x j is the time sequence index value of the jth time point, Is a mutant factor.
In some embodiments of the present invention, in some embodiments,
When (when)When the time sequence index value is suddenly raised; when/>When the time sequence index value is suddenly reduced; p is a sudden rise threshold and q is a sudden fall threshold.
An embodiment of the present invention is a data processing apparatus 21, the composition structure of which is shown in fig. 2, including: a data analysis module 201, an algorithm acquisition module 202 and a data analysis module 203; wherein,
The data classification module 201 is configured to determine a data type of the data to be processed.
In some embodiments, the data classification module 201 is further configured to determine whether the data to be processed is periodic data; the period data is data generated by taking a preset time length as a period.
The algorithm obtaining module 202 is configured to select, according to the data type, a data analysis algorithm that meets an expected data analysis performance index;
In some embodiments, the algorithm acquisition module 202 is further configured to: if the data to be processed is periodic data, adopting an analog periodic algorithm. Wherein the analog periodic algorithm employs the following formula:
Wherein the expected value E (t) is the average value of time sequence index values contained in the data to be processed, x i is the time sequence index value of the ith time point, m is more than or equal to 1 and less than or equal to n, t represents time, sigma (t) is the standard deviation of the time sequence index values, gamma (t) is the periodic class ratio of the index sequence in gamma space, and alpha is the normal distribution standard deviation confidence coefficient of the time sequence index values.
In some embodiments, the algorithm acquisition module 202 is further configured to:
And if the data to be processed is non-periodic data, determining whether the local fluctuation amplitude of the data to be processed reaches a fluctuation amplitude threshold value.
If the local fluctuation amplitude of the data to be processed is larger than the fluctuation amplitude threshold, adopting a ring ratio average algorithm; wherein, the ring ratio average algorithm adopts the following formula:
Wherein the expected value E (t) is the average value of time sequence index values contained in the data to be processed, t represents time, x i is the time sequence index value of the ith time point, and m is more than or equal to 1 and less than or equal to n.
If the local fluctuation amplitude of the data to be processed is not greater than the fluctuation amplitude threshold, a threshold mutation algorithm is adopted; wherein, the threshold mutation algorithm adopts the following formula:
Wherein the expected value E (t m) and the expected value E (t 2m) are the average value of time sequence index values contained in the data to be processed, t represents time, x i is the time sequence index value of the ith time point, m is more than or equal to 1 and less than or equal to n, x j is the time sequence index value of the jth time point, Is a mutant factor.
In some embodiments of the present invention, in some embodiments,
When (when)When the time sequence index value is suddenly raised; when/>When the time sequence index value is suddenly reduced; p is a sudden rise threshold and q is a sudden fall threshold.
The data analysis module 203 is configured to perform data analysis on the data to be processed by using the selected data analysis algorithm.
The data processing method according to one embodiment of the present invention, the flow of which is shown in fig. 3, includes:
basic data is obtained from a data source, whether data preprocessing is needed is determined according to the rule of the basic data acquisition frequency 301, and the average data of adjacent time can be obtained for smoothing.
The data to be processed is obtained through data preprocessing 301; and judging whether the data to be processed is the periodic data 302, wherein the judging whether the data to be processed is the periodic data 302 can adopt a method of carrying out Fourier transformation on the data to be processed and carrying out wavelet variance calculation to check whether the data is the periodic data.
In the case of periodic data, an analog periodic algorithm 303 is employed.
If the data is non-periodic data, judging whether the local fluctuation amplitude of the data to be processed is larger than a fluctuation amplitude threshold 304. The fluctuation amplitude threshold can be set according to actual needs.
If the local fluctuation amplitude of the data to be processed is greater than the fluctuation amplitude threshold, a ring ratio average algorithm 305 is adopted.
If the local fluctuation amplitude of the data to be processed is not greater than the fluctuation amplitude threshold, a threshold mutation algorithm 306 is adopted.
Wherein the analog period algorithm 303 employs the following formula:
Wherein the expected value E (t) is the average value of time sequence index values contained in the data to be processed, x i is the time sequence index value of the ith time point, m is more than or equal to 1 and less than or equal to n, t represents time, sigma (t) is the standard deviation of the time sequence index values, gamma (t) is the periodic class ratio of the index sequence in gamma space, and alpha is the normal distribution standard deviation confidence coefficient of the time sequence index values.
In one embodiment of the analog period algorithm 303, an index x n(t1),xn-1(t2),...,xn-m+1(tm (1. Ltoreq.m.ltoreq.n) is set as a timing index value generated by the big data cluster system within any period of time, where the timing time 0<t 1<t2<tm is used to calculate an expected value E (t) of x n(t1),xn-1(t2),...,xn-m+1(tm) (1. Ltoreq.m.ltoreq.n), and the expected value E (t) is set as a critical threshold of the timing index value; calculating the deviation value of each time sequence index value relative to the expected value E (t), namely x i -E (t); calculating the mean, i.e. variance, of the squares of the deviation values of the time sequence index valuesSquare to sigma (t).
And calculating an index sequence period class ratio gamma (t) under the gamma space with the normal distribution standard deviation confidence coefficient of the time sequence index value being 99%, comparing with a set threshold E (t), and taking an abnormal value exceeding the threshold range. As shown in the time sequence data diagram analyzed by the analog period algorithm in FIG. 4, the time sequence data diagram is mapped to a marked threshold line gamma (t) = -1 in FIG. 5, and an abnormal interval value with gamma (t). Ltoreq.1 as a period is obtained.
The ring ratio average algorithm 305 uses the following formula:
Wherein the expected value E (t) is the average value of time sequence index values contained in the data to be processed, t represents time, x i is the time sequence index value of the ith time point, and m is more than or equal to 1 and less than or equal to n.
In one specific embodiment of the ring ratio average algorithm 305, the comparison between the time sequence index value fluctuation condition of a large data cluster and the transversal mark threshold s is shown in fig. 6:
Setting an index x n(t1),xn-1(t2),...,xn-m+1(tm) (m is more than or equal to 1 and n is less than or equal to n) as a time sequence index value generated in a period of time of a big data cluster system, wherein the time sequence time 0<t 1<t2<tm is used for calculating an expected value E (t) of x n(t1),xn-1(t2),...,xn-m+1(tm) (m is more than or equal to 1 and n), setting a critical threshold value s of the time index, and if the index exceeds the threshold value, the index is an abnormal point to be optimized, and the value of s can be flexibly configured through production best practice in production practice.
The threshold mutation algorithm 306 uses the following formula:
Wherein the expected value E (t m) and the expected value E (t 2m) are the average value of time sequence index values contained in the data to be processed, t represents time, x i is the time sequence index value of the ith time point, m is more than or equal to 1 and less than or equal to n, x j is the time sequence index value of the jth time point, Is a mutant factor.
In some embodiments of the present invention, in some embodiments,
When (when)When the time sequence index value is suddenly raised; when/>When the time sequence index value is suddenly reduced; p is a sudden rise threshold and q is a sudden fall threshold.
In one embodiment of the threshold mutation algorithm 306, xn(t1),xn-1(t2),...,xn-m+1(tm)(1≤m≤n),xn-m(tm+1),xn-m-1(tm+2),...,xn-2m+1(t2m)(1≤m≤n) is set as two consecutive segments of timing indicator values in a space of a certain timing indicator value R of the big data cluster. Calculating an expected value E (t m) of x n(t1),xn-1(t2),...,xn-m+1(tm) (1.ltoreq.m.ltoreq.n), calculating an expected value E (t 2m) of x n-m(tm+1),xn-m-1(tm+2),...,xn-2m+1(t2m) (1.ltoreq.m.ltoreq.n), and calculating the ratio of the expected E (t m) to the expected E (t 2m)If/>The timing index value rises suddenly, if/>The timing indicator value drops suddenly.
As shown in fig. 7, if the sudden rise threshold p=2 is set, the expected E (t 2m) of the index value sequence is 2 times that of E (t m), that is, the index sudden rise is two or more times that of the index expected sudden rise; if the dip threshold q=0.5, the index value sequence is expected to be 1/2 times as large as E (t 2m) (t m), i.e., the index expected dip is 1/2 times or less as small as the index dip. The sudden rise threshold p and the sudden fall threshold q are thresholds for judging the sudden change amplitude, and can be obtained through algorithm training, and the width of the thresholds can be formulated by operation and maintenance personnel.
In one embodiment of the present invention, a basic index matrix for large data cluster optimization includes:
The basic index matrix is divided into three groups of index matrix corresponding to Applications (Applications), index matrix corresponding to Memory (Memory), index matrix corresponding to thread and remote procedure call (RPC, remote Procedure Call) call, and the statistical information and the running condition of the whole cluster are described for tuning, monitoring and fault diagnosis. The basic index matrix provided by the invention depicts the image of the big data cluster, summarizes the operation and maintenance experience of the most core important index fixed phone in the process of producing and operating the big data cluster by the Heilongjiang movement, and establishes the intelligent operation and maintenance foundation of the big data cluster by combining with the machine learning algorithms such as the ring ratio average algorithm, the threshold mutation algorithm, the analog period algorithm and the like of the data processing method provided by the invention.
(1) Application index
Name of the name Description of the invention
Vcores Available Vcores size
allocatedVCores Queue allocation Vcores number
allocatedMB Queue allocation content MB size
allocatedContainers Queue allocation container number
TABLE 1
The key index of the health degree of the big data cluster operation comprises VCores virtual central processing unit (CPU, central Processing Unit) cores, queue allocation virtual CPU cores, queue memory size and the number of allocation Containers (Containers), and the index can accurately describe the portrait of the big data cluster operation behavior, and effectively analyze the time sequence data of the index by combining the data processing method provided by the invention.
(2) Memory index
TABLE 2
The measurement of the cluster index heap memory and Java GC directly influences the cluster file reading and writing speed, and is closely related to the query and storage of the upper layer HBase, hive, impala of the cluster, and the service operation support system (BOSS, business & Operation Support System) of the Heilongjiang mobile company charges and accounts big data cluster layer improves the speed of inquiring cloud details and bills by the aid of the heap memory and GC parameter setting.
(3) Thread number and RPC call index
/>
TABLE 3 Table 3
The state of the thread number has influence on the cluster server and the client, the increase or decrease of the thread number can cause the increase of the cluster load pressure or insufficient processing performance, and the load pressure of the node can be regulated by comprehensively analyzing the running, blocking, waiting and ending conditions of the thread number, so that the cluster load is balanced; the RPC call is widely applied in a big data cluster environment, is the basis of communication call among the clusters, clients and components in the clusters, and can be used for completing portraits through a group of indexes of the RPC if the interaction states of the application clients, the management nodes, the data nodes and the distributed file system can be optimized.
One embodiment of the present invention comprises:
(1) Cluster historical index data sampling
And extracting the index sequence data of Applications VCores in the index matrix of 2018-03-07T09:00:00 of the big data cluster of the Heilongjiang mobile business system as a sample. The sample space takes 10 minutes of sample number, i.e., from 2018-03-07T09:00 to 2018-03-07T10:00:00, with continuous historical sample data as in Table 4.
Time (Time) VCores
2018-03-07T09:01:00 85
2018-03-07T09:02:00 25
2018-03-07T09:03:00 91
2018-03-07T09:04:00 78
2018-03-07T09:05:00 63
2018-03-07T09:06:00 28
2018-03-07T09:07:00 22
2018-03-07T09:08:00 X=(22+32)/2
2018-03-07T09:09:00 32
2018-03-07T09:10:00 50
TABLE 4 Table 4
VCores the index timing data is shown in fig. 8.
(2) Data preprocessing
If regular frequency time sequence data is taken, no smooth preprocessing is needed for the data. As indicated in fig. 8, no VCores data is available at the moment 2018-03-07t09:08:00, and the VCores indexes at the moments 2018-03-07t09:07:00 and 2018-03-07t09:09:00 need to be processed in a mean value, and the VCores index value is 27 after the data at the moment 2018-03-07t09:08:00 are preprocessed.
(3) Calculation process
According to wavelet variance calculation, obtaining non-periodic data, wherein the local fluctuation range is larger than the global fluctuation range, adopting a threshold mutation algorithm, setting p=2, q=0.5, and according to the historical index sequence value, the method comprises the following steps:
x10(t1)=85,x9(t2)=25,x8(t3)=91,x7(t4)=78,x6(t5)=67
x5(t6)=28,x4(t7)=22,x3(t8)=29,x2(t9)=32,x1(t10)=50
Can see When the index is larger than the dip threshold p, the index has dip trend, and the state is an important focus point for optimizing the big data cluster, so that optimized diagnosis information is provided for us: i.e., identify system subtasks completed, system load is reduced, or system IOs, networks, memory disks are encountered that cause VCores to be underutilized. The technical schemes of the ring ratio average algorithm and the analog period algorithm are detailed in the foregoing description, and are not repeated here.
In order to implement the data processing method according to the embodiment of the present invention, the embodiment of the present invention further provides a data processing device implemented based on hardware, as shown in fig. 9, where the data processing device 91 includes: a processor 901 and a memory 902 for storing a computer program capable of running on the processor, wherein,
The processor 901 is configured to execute, when executing the computer program:
Determining the data type of the data to be processed;
Selecting a data analysis algorithm meeting expected data analysis performance indexes according to the data types;
And carrying out data analysis on the data to be processed by using the selected data analysis algorithm.
In some embodiments, the processor 901 is further configured to, when executing the computer program, perform:
Determining whether the data to be processed is periodic data; the period data is data generated by taking a preset time length as a period.
If the data to be processed is periodic data, adopting an analog periodic algorithm; wherein the analog periodic algorithm employs the following formula:
Wherein the expected value E (t) is the average value of time sequence index values contained in the data to be processed, x i is the time sequence index value of the ith time point, m is more than or equal to 1 and less than or equal to n, t represents time, sigma (t) is the standard deviation of the time sequence index values, gamma (t) is the periodic class ratio of the index sequence in gamma space, and alpha is the normal distribution standard deviation confidence coefficient of the time sequence index values.
In some embodiments, the processor 901 is further configured to, when executing the computer program, perform:
If the data to be processed is non-periodic data, determining whether the local fluctuation amplitude of the data to be processed reaches a preset threshold value.
If the local fluctuation amplitude of the data to be processed is larger than the preset threshold value, adopting a ring ratio average algorithm; wherein, the ring ratio average algorithm adopts the following formula:
Wherein the expected value E (t) is the average value of time sequence index values contained in the data to be processed, t represents time, x i is the time sequence index value of the ith time point, and m is more than or equal to 1 and less than or equal to n.
If the local fluctuation amplitude of the data to be processed is not greater than the preset threshold value, a threshold value mutation algorithm is adopted; wherein, the threshold mutation algorithm adopts the following formula:
/>
Wherein the expected value E (t m) and the expected value E (t 2m) are the average value of time sequence index values contained in the data to be processed, t represents time, x i is the time sequence index value of the ith time point, m is more than or equal to 1 and less than or equal to n, x j is the time sequence index value of the jth time point, Is a mutant factor.
In some embodiments of the present invention, in some embodiments,
When (when)When the time sequence index value is suddenly raised; when/>When the time sequence index value is suddenly reduced; p is a sudden rise threshold and q is a sudden fall threshold.
The device for information push control provided in the above embodiment and the method embodiment for information push control belong to the same concept, and specific implementation processes of the device and the method embodiment are detailed in the method embodiment, and are not repeated here.
Of course, in practical application, as shown in fig. 9, the data processing apparatus may further comprise at least one communication interface 903. The various components in the data processing apparatus are coupled together by a bus system 904. It is appreciated that the bus system 904 is used to facilitate connected communications between these components. The bus system 904 includes a power bus, a control bus, and a status signal bus in addition to a data bus. But for clarity of illustration, the various buses are labeled as bus system 904 in fig. 9.
Wherein the communication interface 903 is used to interact with other devices.
Specifically, the processor 901 may send an operation result query request to an application server corresponding to the callee application through the communication interface 903, to obtain an operation result of the callee application sent by the application server.
It is to be appreciated that the memory 902 can be either volatile memory or nonvolatile memory, and can include both volatile and nonvolatile memory. Wherein the nonvolatile Memory may be Read Only Memory (ROM), programmable Read Only Memory (PROM, programmable Read-Only Memory), erasable programmable Read Only Memory (EPROM, erasable Programmable Read-Only Memory), electrically erasable programmable Read Only Memory (EEPROM, ELECTRICALLY ERASABLE PROGRAMMABLE READ-Only Memory), magnetic random access Memory (FRAM, ferromagnetic random access Memory), flash Memory (Flash Memory), magnetic surface Memory, optical disk, or compact disk-Only Memory (CD-ROM, compact Disc Read-Only Memory); the magnetic surface memory may be a disk memory or a tape memory. The volatile memory may be random access memory (RAM, random Access Memory) which acts as external cache memory. By way of example, and not limitation, many forms of RAM are available, such as static random access memory (SRAM, static Random Access Memory), synchronous static random access memory (SSRAM, synchronous Static Random Access Memory), dynamic random access memory (DRAM, dynamic Random Access Memory), synchronous dynamic random access memory (SDRAM, synchronous Dynamic Random Access Memory), double data rate synchronous dynamic random access memory (ddr SDRAM, double Data Rate Synchronous Dynamic Random Access Memory), enhanced synchronous dynamic random access memory (ESDRAM, enhanced Synchronous Dynamic Random Access Memory), synchronous link dynamic random access memory (SLDRAM, syncLink Dynamic Random Access Memory), direct memory bus random access memory (DRRAM, direct Rambus Random Access Memory). The memory 902 described by embodiments of the present invention is intended to comprise, without being limited to, these and any other suitable types of memory.
In an embodiment of the present invention, a computer readable storage medium is further provided, for storing the computer program provided in the foregoing embodiment, to complete the steps of the foregoing method. The computer readable storage medium may be FRAM, ROM, PROM, EPROM, EEPROM, flash Memory, magnetic surface Memory, optical disk, or CD-ROM; but may be various devices including one or any combination of the above memories, such as a mobile phone, a computer, a smart home appliance, a server, etc.
The technical advantages of the invention are mainly represented by the following aspects:
(1) The invention can reduce operation and maintenance cost and improve the optimization and diagnosis efficiency of the multi-node multi-component. In the method, the basic operation and maintenance indexes can be monitored in an omnibearing way through a machine learning related algorithm, and the dilemma of manually logging in the nodes one by one to confirm the log state of the cluster is avoided.
(2) The ring ratio mean algorithm, the threshold value mutation algorithm and the analog period algorithm provided by the invention can automatically learn to realize the refined management of the cluster basic indexes, and the proposed basic index matrix is combined with algorithms in various scenes to describe the running fluctuation and periodic variation conditions of the cluster system, so that the situation that the false alarm rate is high by simply defining a threshold value is avoided.
(3) The invention solves the current situation that the static data decision accuracy is low and front and back trend analysis is not available. According to the big data cluster optimization analysis method, the pressure change of the big data clusters and the periodic change of different time windows are dynamically described, and whether the index time sequence data have mutation abnormality is analyzed according to the data deep mining and analysis of the dynamic change rule and trend.
(4) The big data cluster basic index matrix provided by the invention can be used for carrying out standardized classification on health indexes, comprehensively monitoring and managing an operation index system by combining a unified algorithm decision tree, displaying system health portraits and improving the global control of cluster operation states.
In summary, the invention realizes the rapid fault positioning of the large data cluster, effectively reduces the cost, assists in optimizing the decision by analyzing and mining the operation data, ensures the stable operation of the system, releases human resources and inputs more advanced technical research and production system management work.
It should be noted that: the technical schemes described in the embodiments of the present invention may be arbitrarily combined without any collision.
The foregoing description is only of the preferred embodiments of the present invention, and is not intended to limit the scope of the present invention.

Claims (10)

1. A method of data processing, the method comprising:
collecting data of a first index in a period of time as data to be processed; the first index is an index in a basic index matrix of big data cluster optimization; the data to be processed is time sequence data;
Determining whether the data to be processed is periodic data; the periodic data are data generated by taking a preset duration as a period; the periodic data includes at least one of: mobile user monthly communication data; the server device accesses data on a daily basis; video data stored by the video monitoring system every hour;
Selecting a data analysis algorithm meeting expected data analysis performance indexes according to a determination result of whether the data to be processed is the cycle data or not;
performing data analysis on the data to be processed by using the selected data analysis algorithm;
According to the analysis result, monitoring the running condition of the data to be processed;
The selecting a data analysis algorithm meeting expected data analysis performance indexes according to the determination result of whether the data to be processed is the cycle data or not, including:
if the data to be processed is periodic data, adopting an analog periodic algorithm;
If the data to be processed is non-periodic data, determining whether the local fluctuation amplitude of the data to be processed reaches a fluctuation amplitude threshold value; if the local fluctuation amplitude of the data to be processed is larger than the fluctuation amplitude threshold, adopting a ring ratio average algorithm; and if the local fluctuation amplitude of the data to be processed is not greater than the fluctuation amplitude threshold, adopting a threshold mutation algorithm.
2. The method of claim 1, wherein the analog periodic algorithm employs the following formula:
Wherein the expected value E (t) is the average value of time sequence index values contained in the data to be processed, x i is the time sequence index value of the ith time point, m is more than or equal to 1 and less than or equal to n, t represents time, sigma (t) is the standard deviation of the time sequence index value, gamma (t) is the periodic class ratio of the index sequence in gamma space, and alpha is the normal distribution standard deviation confidence coefficient of the time sequence index value.
3. The method of claim 1, wherein the ring ratio mean algorithm employs the following formula:
Wherein the expected value E (t) is the average value of time sequence index values contained in the data to be processed, t represents time, x i is the time sequence index value of the ith time point, and m is more than or equal to 1 and less than or equal to n.
4. The method of claim 1, wherein the threshold mutation algorithm employs the following formula:
Wherein the expected value E (t m) and the expected value E (t 2m) are the average value of time sequence index values contained in the data to be processed, t represents time, x i is the time sequence index value of the ith time point, m is more than or equal to 1 and less than or equal to n, x j is the time sequence index value of the jth time point, Is a mutant factor.
5. A data processing apparatus, the apparatus comprising:
the data acquisition module is used for acquiring data of the first index in a period of time as data to be processed; the first index is an index in a basic index matrix of big data cluster optimization; the data to be processed is time sequence data;
The data classification module is used for determining whether the data to be processed are periodic data or not; the periodic data are data generated by taking a preset duration as a period; the periodic data includes at least one of: mobile user monthly communication data; the server device accesses data on a daily basis; video data stored by the video monitoring system every hour;
the algorithm acquisition module is used for selecting and acquiring a data analysis algorithm meeting expected data analysis performance indexes according to the determination result of whether the data to be processed is the cycle data or not;
The data analysis module is used for carrying out data analysis on the data to be processed by utilizing the selected data analysis algorithm;
the data monitoring module is used for monitoring the running condition of the data to be processed according to the analysis result;
The algorithm acquisition module is further configured to:
if the data to be processed is periodic data, adopting an analog periodic algorithm;
If the data to be processed is non-periodic data, determining whether the local fluctuation amplitude of the data to be processed reaches a fluctuation amplitude threshold value; if the local fluctuation amplitude of the data to be processed is larger than the fluctuation amplitude threshold, adopting a ring ratio average algorithm; and if the local fluctuation amplitude of the data to be processed is not greater than the fluctuation amplitude threshold, adopting a threshold mutation algorithm.
6. The apparatus of claim 5, wherein the analog periodic algorithm employs the following formula:
Wherein the expected value E (t) is the average value of time sequence index values contained in the data to be processed, x i is the time sequence index value of the ith time point, m is more than or equal to 1 and less than or equal to n, t represents time, sigma (t) is the standard deviation of the time sequence index values, gamma (t) is the periodic class ratio of the index sequence in gamma space, and alpha is the normal distribution standard deviation confidence coefficient of the time sequence index values.
7. The apparatus of claim 5, wherein the ring ratio mean algorithm employs the following formula:
Wherein the expected value E (t) is the average value of time sequence index values contained in the data to be processed, t represents time, x i is the time sequence index value of the ith time point, and m is more than or equal to 1 and less than or equal to n.
8. The apparatus of claim 5, wherein the threshold mutation algorithm employs the following formula:
Wherein the expected value E (t m) and the expected value E (t 2m) are the average value of time sequence index values contained in the data to be processed, t represents time, x i is the time sequence index value of the ith time point, m is more than or equal to 1 and less than or equal to n, x j is the time sequence index value of the jth time point, Is a mutant factor.
9. A data processing apparatus, the apparatus comprising: a processor and a memory for storing a computer program capable of running on the processor;
wherein the processor is adapted to perform the steps of the method of any of claims 1 to 4 when the computer program is run.
10. A computer storage medium having stored thereon a computer program, which when executed by a processor realizes the steps of the method according to any of claims 1 to 4.
CN201811362456.7A 2018-11-15 2018-11-15 Data processing method, device and storage medium Active CN111191669B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811362456.7A CN111191669B (en) 2018-11-15 2018-11-15 Data processing method, device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811362456.7A CN111191669B (en) 2018-11-15 2018-11-15 Data processing method, device and storage medium

Publications (2)

Publication Number Publication Date
CN111191669A CN111191669A (en) 2020-05-22
CN111191669B true CN111191669B (en) 2024-05-07

Family

ID=70707167

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811362456.7A Active CN111191669B (en) 2018-11-15 2018-11-15 Data processing method, device and storage medium

Country Status (1)

Country Link
CN (1) CN111191669B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2004104865A2 (en) * 2003-05-12 2004-12-02 Sun Microsystems, Inc. Methods and systems for intellectual capital sharing and control
DE102006042975A1 (en) * 2006-09-13 2008-03-27 Siemens Ag Method for operating communication network comprising several nodes, particularly of sensor network, involves learning model parameters with learning method and comparing predication error parameter with estimated error parameter
WO2017024045A1 (en) * 2015-08-04 2017-02-09 James Carey Video identification and analytical recognition system

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2004104865A2 (en) * 2003-05-12 2004-12-02 Sun Microsystems, Inc. Methods and systems for intellectual capital sharing and control
DE102006042975A1 (en) * 2006-09-13 2008-03-27 Siemens Ag Method for operating communication network comprising several nodes, particularly of sensor network, involves learning model parameters with learning method and comparing predication error parameter with estimated error parameter
WO2017024045A1 (en) * 2015-08-04 2017-02-09 James Carey Video identification and analytical recognition system

Also Published As

Publication number Publication date
CN111191669A (en) 2020-05-22

Similar Documents

Publication Publication Date Title
US10248528B2 (en) System monitoring method and apparatus
CN106250306A (en) A kind of performance prediction method being applicable to enterprise-level O&amp;M automatization platform
CN108509313B (en) Service monitoring method, platform and storage medium
CN111367747B (en) Index abnormal detection early warning device based on time annotation
US11243951B2 (en) Systems and methods for automated analysis, screening, and reporting of group performance
CN116485552A (en) Fund investment wind control method, device, medium and terminal
CN111800807A (en) Method and device for alarming number of base station users
CN111191669B (en) Data processing method, device and storage medium
CN112749035B (en) Abnormality detection method, abnormality detection device, and computer-readable medium
CN117113159A (en) Deep learning-based power consumer side load classification method and system
CN115935212A (en) Adjustable load clustering method and system based on longitudinal trend prediction
CN116383645A (en) Intelligent system health degree monitoring and evaluating method based on anomaly detection
CN112882854B (en) Method and device for processing request exception
CN113962579A (en) Method and device for monitoring state of engineering mechanical equipment and computer storage medium
CN112582080A (en) Internet of things equipment state monitoring method and system
Jehangiri et al. Distributed predictive performance anomaly detection for virtualised platforms
CN113723710B (en) Customer loss prediction method, system, storage medium and electronic equipment
CN110955196A (en) Processing method and system for production process index data
CN113949624B (en) Distribution method, device, equipment and medium of link sampling number
CN117391261B (en) AI intelligent water service system of internet of things based on low-power consumption ultrasonic measurement
CN115473343B (en) Intelligent gateway multi-master-station parallel access test method
CN113741815B (en) Storage system management and control method, device and equipment and readable storage medium
CN115914052B (en) Domain name health condition detection method and device
CN111274230B (en) Data migration management method, device, equipment and storage medium
CN117439927A (en) Method and system for detecting and processing collective burst flow of large number of cloud hard disks

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant