CN115686995A - Data monitoring processing method and device - Google Patents

Data monitoring processing method and device Download PDF

Info

Publication number
CN115686995A
CN115686995A CN202210815893.XA CN202210815893A CN115686995A CN 115686995 A CN115686995 A CN 115686995A CN 202210815893 A CN202210815893 A CN 202210815893A CN 115686995 A CN115686995 A CN 115686995A
Authority
CN
China
Prior art keywords
data
monitoring
model
dimensional
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210815893.XA
Other languages
Chinese (zh)
Inventor
张娇
陈耀文
杨玉新
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Industrial and Commercial Bank of China Ltd ICBC
Original Assignee
Industrial and Commercial Bank of China Ltd ICBC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Industrial and Commercial Bank of China Ltd ICBC filed Critical Industrial and Commercial Bank of China Ltd ICBC
Priority to CN202210815893.XA priority Critical patent/CN115686995A/en
Publication of CN115686995A publication Critical patent/CN115686995A/en
Pending legal-status Critical Current

Links

Images

Abstract

The invention provides a data monitoring processing method and device, relates to the technical field of data monitoring, and can be used in the financial field or other technical fields. The method comprises the following steps: acquiring multi-dimensional data and derivative characteristics reflecting system performance risks; the multi-dimensional data comprises basic environment information of an application system, configuration information of a virtual machine and test log information; the derivation characteristics are the rising speed of the heap memory, the rising speed of the garbage recovery times and the rising speed of the garbage recovery time consumption in the life cycle; monitoring the multidimensional data and the derivative characteristics based on a preset monitoring model to obtain a monitoring result; and the preset monitoring model is obtained by training a decision tree algorithm model according to multi-dimensional sample data and derived characteristic sample data. The device performs the above method. The data monitoring and processing method and device provided by the embodiment of the invention can comprehensively and accurately monitor data, and further timely perform system risk prevention and control.

Description

Data monitoring processing method and device
Technical Field
The invention relates to the technical field of data monitoring, in particular to a data monitoring processing method and device.
Background
With the gradual advance of IT architecture conversion and digital conversion, more and more application systems are converted into a distributed system framework based on an x86 server, and the application of a novel system service framework and a distributed database server brings a series of problems to daily technical test work and brings difficulty to JVM data monitoring.
Disclosure of Invention
For solving the problems in the prior art, embodiments of the present invention provide a data monitoring processing method and apparatus, which can at least partially solve the problems in the prior art.
In one aspect, the present invention provides a data monitoring processing method, including:
acquiring multi-dimensional data and derivative characteristics reflecting system performance risks; the multi-dimensional data comprises application system basic environment information, virtual machine configuration information and test log information; the derivation characteristics are the rising speed of the heap memory, the rising speed of the garbage recovery times and the rising speed of the garbage recovery time consumption in the life cycle;
monitoring the multidimensional data and the derived features based on a preset monitoring model to obtain a monitoring result;
and the preset monitoring model is obtained by training a decision tree algorithm model according to multi-dimensional sample data and derived characteristic sample data.
Wherein, acquiring the multi-dimensional sample data comprises:
acquiring initial multi-dimensional data, and performing data cleaning on data fields of the initial multi-dimensional data to obtain model characteristic index system data; the model characteristic index system data comprises application system environment index system data, virtual machine basic parameter index system data and test log information index system data;
carrying out data cleaning on the data field content of the model characteristic index system data to obtain test abnormal data in a test period;
and carrying out discretization, normalization and vectorization treatment on the abnormal test data in sequence, and carrying out discrimination marking to obtain the multi-dimensional sample data.
Wherein, obtaining the derived feature sample data comprises:
acquiring the variable quantity of memory data of the old age heap, the variable quantity of garbage collection times and the variable quantity of garbage collection time consumption;
respectively calculating the ratio of the memory data variable quantity of the old age heap, the garbage recovery time variable quantity and the garbage recovery time consumption variable quantity to a preset monitoring period;
and carrying out discretization, normalization and vectorization treatment on each ratio result in sequence, and carrying out discrimination marking to obtain the derived characteristic sample data.
Wherein the decision tree algorithm model is a distributed gradient lifting framework; correspondingly, training a decision tree algorithm model according to the multi-dimensional sample data and the derived feature sample data comprises the following steps:
initializing and setting training parameters of a distributed gradient lifting frame;
and adjusting the training parameters, and repeatedly training the distributed gradient lifting frame until the tree depth, the leaf node sample weight and the learning weight which avoid over-fitting are obtained.
The data monitoring and processing method further comprises the following steps:
after the distributed gradient lifting frame is repeatedly trained, the generalization capability of the distributed gradient lifting frame is checked by adopting a non-cross validation mode and a cross validation mode, and the preset monitoring model is obtained.
The data monitoring and processing method further comprises the following steps:
if the monitoring result is determined to be an abnormal monitoring result, acquiring each characteristic weight value of the preset monitoring model;
and (4) sequentially arranging the characteristic weight values in a descending order, and extracting k characteristic weight values ranked in the front.
The data monitoring and processing method further comprises the following steps:
and updating training data in a training data set according to the monitoring result, wherein the training data comprises the multi-dimensional sample data and the derived feature sample data.
In one aspect, the present invention provides a data monitoring and processing apparatus, including:
the acquisition unit is used for acquiring multi-dimensional data and derivative characteristics reflecting system performance risks; the multi-dimensional data comprises basic environment information of an application system, configuration information of a virtual machine and test log information; the derivation characteristics are the rising speed of the heap memory, the rising speed of the garbage recovery times and the rising speed of the garbage recovery time consumption in the life cycle;
the monitoring unit is used for monitoring the multi-dimensional data and the derivative characteristics based on a preset monitoring model to obtain a monitoring result;
and the preset monitoring model is obtained by training a decision tree algorithm model according to multi-dimensional sample data and derived characteristic sample data.
In another aspect, an embodiment of the present invention provides an electronic device, including: a processor, a memory, and a bus, wherein,
the processor and the memory are communicated with each other through the bus;
the memory stores program instructions executable by the processor, the processor invoking the program instructions to perform a method comprising:
acquiring multi-dimensional data and derivative characteristics reflecting system performance risks; the multi-dimensional data comprises basic environment information of an application system, configuration information of a virtual machine and test log information; the derivation characteristics are the rising speed of the heap memory, the rising speed of the garbage recovery times and the rising speed of the garbage recovery time consumption in the life cycle;
monitoring the multidimensional data and the derivative characteristics based on a preset monitoring model to obtain a monitoring result;
and the preset monitoring model is obtained by training a decision tree algorithm model according to multi-dimensional sample data and derived characteristic sample data.
An embodiment of the present invention provides a non-transitory computer-readable storage medium, including:
the non-transitory computer readable storage medium stores computer instructions that cause the computer to perform a method of:
acquiring multi-dimensional data and derivative characteristics reflecting system performance risks; the multi-dimensional data comprises application system basic environment information, virtual machine configuration information and test log information; the derived characteristics are the rising speed of the heap memory, the rising speed of the garbage recovery times and the rising speed of the garbage recovery time consumption in the life cycle;
monitoring the multidimensional data and the derivative characteristics based on a preset monitoring model to obtain a monitoring result;
and the preset monitoring model is obtained by training a decision tree algorithm model according to multi-dimensional sample data and derived characteristic sample data.
The data monitoring processing method and the data monitoring processing device provided by the embodiment of the invention are used for acquiring multi-dimensional data and derivative characteristics reflecting system performance risks; the multi-dimensional data comprises basic environment information of an application system, configuration information of a virtual machine and test log information; the derived characteristics are the rising speed of the heap memory, the rising speed of the garbage recovery times and the rising speed of the garbage recovery time consumption in the life cycle; monitoring the multidimensional data and the derivative characteristics based on a preset monitoring model to obtain a monitoring result; the preset monitoring model is obtained by training a decision tree algorithm model according to multi-dimensional sample data and derived feature sample data, so that data can be monitored comprehensively and accurately, and system risk prevention and control can be performed timely.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts. In the drawings:
fig. 1 is a schematic flow chart of a data monitoring processing method according to an embodiment of the present invention.
Fig. 2 is a schematic flow chart of a data monitoring processing method according to another embodiment of the present invention.
Fig. 3 is a schematic flow chart of the modularization of the data monitoring processing method according to the embodiment of the present invention.
Fig. 4 is a schematic structural diagram of a data monitoring and processing apparatus according to an embodiment of the present invention.
Fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the embodiments of the present invention are further described in detail below with reference to the accompanying drawings. The exemplary embodiments and descriptions of the present invention are provided to explain the present invention, but not to limit the present invention. It should be noted that the embodiments and features of the embodiments in the present application may be arbitrarily combined with each other without conflict.
Fig. 1 is a schematic flow chart of a data monitoring processing method according to an embodiment of the present invention, and as shown in fig. 1, the data monitoring processing method according to the embodiment of the present invention includes:
step S1: acquiring multi-dimensional data and derivative characteristics reflecting system performance risks; the multi-dimensional data comprises basic environment information of an application system, configuration information of a virtual machine and test log information; the derived characteristics are the rising speed of the heap memory, the rising speed of the garbage recovery times and the rising speed of the garbage recovery time consumption in the life cycle.
Step S2: monitoring the multidimensional data and the derivative characteristics based on a preset monitoring model to obtain a monitoring result;
and the preset monitoring model is obtained by training a decision tree algorithm model according to multi-dimensional sample data and derived characteristic sample data.
In the step S1, the device obtains multidimensional data and derivative features reflecting system performance risks; the multi-dimensional data comprises basic environment information of an application system, configuration information of a virtual machine and test log information; the derived characteristics are the rising speed of the heap memory, the rising speed of the garbage recovery times and the rising speed of the garbage recovery time consumption in the life cycle. The apparatus may be a computer device performing the method, and may comprise, for example, a server. It should be noted that the embodiments of the present invention relate to the acquisition and analysis of data being authorized by the user.
The basic environment information of the application system may include related physical information during system setup, including but not limited to an application name, an operating system type, a test system environment, a test cluster node, a version number, release time, a number of CPUs, a memory, and the like.
The virtual machine configuration information may be specifically Java virtual machine configuration information, which refers to Java virtual machine configuration information generated by the system service during the initialization process, including but not limited to initial heap memory, maximum heap memory, minimum heap memory, initial new generation memory, maximum new generation memory, eden region/Survivor region ratio, aged generation/new generation ratio, aged generation memory, permanent generation memory, GC processing mechanism, throughput, pause, etc.
The test log information is obtained by taking a test process as a unit and recording the response conditions of all service information, and the content includes, but is not limited to, statistical time, process ID, CPU utilization, memory state, GC type, heap information (survivor area size, eden area size, aged generation size), GC start time, GC end time, garbage collection times, garbage collection consumed time, GC total time, and the like.
For derived features reflecting system performance risk, the following is illustrated:
the faster the heap memory rises in the life cycle, the greater the system performance anomaly risk; the slower the heap memory rises within the life cycle, the less the risk of system performance anomalies.
The higher the rising speed of the garbage recycling times is, the higher the system performance abnormal risk is; the slower the garbage collection frequency rising speed is, the smaller the system performance abnormity risk is. Garbage Collection (GC for short).
The higher the time consumption and the rising speed of the garbage recovery are, the higher the system performance abnormal risk is; the slower the garbage recovery time consumption is increased, the smaller the system performance abnormity risk is.
In the step S2, the device monitors the multidimensional data and the derived features based on a preset monitoring model to obtain a monitoring result; the multidimensional data and the derivative characteristics can be fused, the integral input is carried out on the preset monitoring model, and the output result of the preset monitoring model is used as the monitoring result.
The monitoring result may include a normal monitoring result, which indicates that the data monitoring result is normal.
The monitoring result may include an anomaly monitoring result indicating that the data monitoring result is anomalous.
And the preset monitoring model is obtained by training a decision tree algorithm model according to multi-dimensional sample data and derived characteristic sample data. Acquiring multi-dimensional sample data, comprising:
acquiring initial multi-dimensional data, and performing data cleaning on data fields of the initial multi-dimensional data to obtain model characteristic index system data; the model characteristic index system data comprises application system environment index system data, virtual machine basic parameter index system data and test log information index system data; as shown in fig. 2:
step 1, the embodiment of the present invention relates to three main bodies: application system, database, java virtual machine (JVM for short). An application system is regarded as a whole, and data such as a system version, a database, a Java virtual machine and a daily test log related to the application system are collected from the application system, so that the application system can be refined into three dimensions of basic environment information, virtual machine configuration information and test log information of the application system.
The application system basic environment information may include, but is not limited to, an application name, an operating system type, a test system environment, a test cluster node, and other data.
The virtual machine configuration information may include, but is not limited to, initial heap memory, maximum heap memory, minimum heap memory, GC (Garbage Collection) processing mechanism, and the like.
The test log information refers to specific processing information of related links involved in a test process.
Step 2, constructing model characteristic index system data, and refining the data into three parts mainly according to different dimensionalities of data sources: the system comprises application system environment index system data, virtual machine basic parameter index system data and test log information index system data, wherein the specific index system data is shown in a table 1:
TABLE 1
Figure BDA0003742298170000061
Figure BDA0003742298170000071
(1) Application system environment index system: the application system environment mainly refers to relevant physical information during system construction, and includes but is not limited to application names, operating system types, test system environments, test cluster nodes, version numbers, version issuing time, CPU numbers, memories and the like;
(2) JVM basic parameter index System: the JVM basic parameters mainly refer to Java virtual machine configuration information generated by the system service during the initialization process, including but not limited to initial heap memory, maximum heap memory, minimum heap memory, initial new generation memory, maximum new generation memory, eden region/Survivor region ratio, old generation/new generation ratio, old generation memory, permanent generation memory, GC processing mechanism, throughput, pause, and the like.
(3) Testing a log information index system: the test log information takes one test process as a unit, the response conditions of all service information are recorded, in order to obtain the key information of the log, the natural language processing technology is used for carrying out keyword identification on the log text, positioning the position and the content of the keyword, and extracting the target data information. The contents include, but are not limited to, statistics time, process ID, CPU usage, memory status, GC type, heap information (survivor region size, eden region size, aged generation size), GC start time, GC end time, garbage collection times, garbage collection consumed time, GC total time, etc.
And 3, performing processing and selection on the original data collected from multiple dimensions by the feature engineering, wherein the original data set has multiple types of structured and unstructured data, and the problems that the data sources are more and the data cannot be in one-to-one correspondence exist, so that different types of data need to be processed in different modes respectively and are finally collected into an initial training set required by model training.
And carrying out data cleaning on the data fields of the initial multi-dimensional data, and eliminating the data fields except the model feature index system data to realize the preliminary dimensionality reduction of the data and obtain the model feature index system data.
Key information can be extracted from the test log information to obtain test log information index system data, which is described as follows:
a. dividing the obtained log file according to a unique test ID generated in each test process, wherein the log file totally contains N pieces of test information, namely B0= { B1, B2, B3, …, bN };
b. and performing text splitting, analysis and data extraction on the B0, and identifying the contents of application names, test system environments, test time, GC types, GC time and the like related to each test log record to obtain a target characteristic value. Traversing all log instances in a log file text, and acquiring a log related feature vector BsN = { test ID, application name list, test system environment list, GC type list … };
c. and repeating the steps a and b for N times until the traversal is completed, and forming a log information feature matrix, wherein T = { Bs1, bs2, …, bsN }.
Carrying out data cleaning on the data field content of the model characteristic index system data to obtain test abnormal data in a test period; the specific rule of data cleansing is as follows:
if (GC type = NULL) > this test record is normal, being a non-object to be analyzed.
if (GC type! = NULL) > the test records an exception, as the object to be analyzed.
The fetch interval = last test ID occurrence time-first test ID occurrence time is defined.
if (get time interval > = 'set threshold') - > the test record is abnormal, is not the object to be analyzed.
if (fetch time interval < 'set threshold') - > the test record is normal, which is the object to be analyzed.
And carrying out discretization, normalization and vectorization treatment on the abnormal test data in sequence, and carrying out discrimination marking to obtain the multi-dimensional sample data. The discretization is illustrated as follows:
that is, data is divided according to a set threshold and expressed by boolean types, such as initial heap memory, maximum heap memory, minimum heap memory, and the like:
if (field value < 'set threshold') - > field is assigned 1, otherwise it is assigned 0.
The normalization is explained as follows:
namely, mapping the data value to [0,1], eliminating the influence of the dimension on the subsequent model construction, such as the number of CPUs (central processing units), the memory, the garbage collection times, the garbage collection consumption time, the total GC time and the like, and the rule is as follows:
W*=(W-Wmin)/(Wmax-Wmin)。
and (3) vectorizing each record by taking the model characteristic index system data constructed in the step (2) and the derivative characteristics in the step (3) as a whole, and further normalizing all records, namely vectorizing a real number value matrix formed by each field in the index dimension to obtain a real number characteristic matrix T.
Step 4, according to the related concept standard of the JVM garbage recycling mechanism, GC types which are generally generated by the current system are defined and labeled, and the GC types are not single, so that the preliminarily screened negative analysis objects are further refined and judged according to the GC types, the larger the GC influence performance degree is, the higher the assignment is, and the specific standardized method is shown in Table 2:
TABLE 2
GC type Normalization
Newborn GC (minor GC) 1
Aged GC (major GC) 2
Global GC (full GC) 3
Further judging the derivative characteristics under different GC types, setting different basic thresholds by combining the daily occurrence frequencies of different types of GC, marking the test ID possibly having performance problems as 1, otherwise marking the test ID as 0, taking Full GC as an example, and specifically following the following rules:
rule 1: if (Full GC type frequency of occurrence > = 'base threshold 1') - > is: the assignment is 1, no: the value is assigned to 0;
rule 2: if (rise rate of Full GC times > = 'basal threshold 2') - > is: the value is 1, no: the value is assigned to 0;
rule 3: if (rise rate of Full GC elapsed > = 'base threshold 3') - > is: the assignment is 1, no: the value is assigned to 0;
rule 4: if (old age heap memory rise speed > = 'base threshold 4' in life cycle) — > is: the assignment is 1, no: the value is assigned to 0;
combining all GC types, each GC type corresponds to four judgment rules, and finally, a judgment value matrix is constructed for the target test ID, namely G = { rule 1, rule 2, rule 3, … … }.
Outputting test abnormal condition judgment, and judging that no abnormal condition exists when the test ID does not meet any rule; when the test ID satisfies one or more determination rules, that is, it is determined that the test is abnormal, and the more determination rules are satisfied, the more the abnormal condition is serious, and the specific determination rules are shown in table 3:
TABLE 3
Test ID Matrix of discrimination values The result of the judgment
Test1 G1={0,0,0,…} 0
Test2 G2={1,0,0,…} 1
Test 3 G3={1,1,0,…} 2
Test4 G3={1,0,1,…} 2
Test5 G3={1,1,1,…} 3
…… …… ……
Acquiring the derived feature sample data, including:
acquiring the variable quantity of memory data of the old age heap, the variable quantity of garbage collection times and the variable quantity of garbage collection time consumption;
respectively calculating the ratio of the memory data variable quantity of the old age heap, the garbage recovery time variable quantity and the garbage recovery time consumption variable quantity to a preset monitoring period; the preset monitoring period can be set independently according to actual conditions, and the ratio results are calculated according to the following formulas respectively:
the method comprises the following steps that (1) the rising speed of the heap memory in a life cycle = aged heap memory data variable quantity/preset monitoring cycle;
the increasing speed of the garbage recovery times = the variable quantity of the garbage recovery times/a preset monitoring period;
the garbage recovery time consumption increasing speed = garbage recovery time consumption variable quantity/preset monitoring period;
and carrying out discretization, normalization and vectorization treatment on each ratio result in sequence, and carrying out discrimination marking to obtain the derived characteristic sample data. Reference is made to the above description and no further description is given.
Step 5, the decision tree algorithm model is a distributed gradient lifting framework; namely, the Light Gradient Boosting Machine.
And (3) integrating the model feature index system data in the step (2) with the discrimination rule shown in the table (2) in the step (4) to obtain an initial feature matrix of the model to be trained.
And (4) integrating the real number characteristic matrix T in the step (3) and the discrimination value matrix G in the step (4) to be used as the input of the model to be trained.
Correspondingly, training a decision tree algorithm model according to the multi-dimensional sample data and the derived feature sample data comprises the following steps:
initializing and setting training parameters of a distributed gradient lifting frame; and defining a basic XGboost model, and performing initialization setting on a general type parameter, a boost parameter and a learning task parameter.
And adjusting the training parameters, and repeatedly training the distributed gradient lifting frame until the tree depth, the leaf node sample weight and the learning weight which avoid over-fitting are obtained. And repeatedly training the model through parameter adjustment until the optimal setting of parameters such as tree depth, leaf node sample weight, learning weight and the like which avoid overfitting is determined, and optimizing the model.
The whole training process can be realized by adopting the following modes:
a. defining a basic XGboost model, and carrying out initialization setting on a general type parameter, a boost parameter and a learning task parameter;
b. according to a Histogram algorithm, the optimal split points are found, and the number of candidate split points is a constant number.
c. The sampling method adopted by each sample is changed from random sampling to a unilateral sampling method for sampling the sample with smaller absolute value of gradient according to a certain proportion and reserving the sample with larger absolute value of gradient.
d. Binding the features which cannot take the same value at the same time, and reducing the dimension of the data features;
e. repeatedly training the model through parameter adjustment until the optimal setting of parameters such as tree depth, leaf node sample weight, learning weight and the like which avoid over-fitting is determined, and optimizing the model;
f. and (3) verifying the generalization capability of the model by adopting a non-cross verification mode and a cross verification mode respectively to finally obtain the optimal training model LightGBM.
Step 6, the final monitoring analysis and abnormal condition feedback early warning are mainly divided into the following steps:
first, the analysis is aided. According to the model constructed in the step 5, the larger the feature weight value of a single feature is, the larger the influence thereof is, so that the feature weights affecting the degree of abnormal conditions (the feature weight value of each feature is one of the contents of the output results of the model training) are ranked, the weights are marked as TOP = { (H1: weight (H1)), H2: weight (H2)), …, hn: weight (Hn)) } from large to small, and the front Top5 with the largest weight value is selected as a key index set REC = { H1, H2, H3, H4, H5}, which assists the subsequent manual analysis.
And secondly, monitoring and carrying out abnormity early warning. Obtaining the latest test data and related JVM data in the system according to target test content, judging by using an abnormal condition early warning training model, judging whether the new test data has risk or abnormal condition of OOM (memory overflow), outputting a judgment result, simultaneously feeding back related weight index values of a key index set REC, assisting technicians to develop emergency measures in time, and avoiding performance risk and system abnormal condition in time according to contents such as memory size adjustment, application system setting parameters and the like.
And 7, continuously optimizing. And further adding new training data information to the training data set according to the model feedback result so as to achieve the effect of iteratively optimizing the training model.
As shown in fig. 3, the method according to the embodiment of the present invention may be implemented based on modularization, and specifically includes:
a data acquisition module: for obtaining multidimensional data and derived features reflecting system performance risks.
A data characteristic system construction module: and the method is used for constructing model feature index system data.
A characteristic engineering processing module: the method is used for obtaining multi-dimensional sample data and derived feature sample data through feature engineering processing.
JVM control and abnormity early warning model training module: method for training decision tree algorithm model to obtain preset monitoring model
The result feedback and model optimization module: and the method is used for optimizing the data in the model training set according to the model application output result.
The data monitoring processing method provided by the embodiment of the invention obtains multi-dimensional data and derivative characteristics reflecting system performance risks; the multi-dimensional data comprises basic environment information of an application system, configuration information of a virtual machine and test log information; the derivation characteristics are the rising speed of the heap memory, the rising speed of the garbage recovery times and the rising speed of the garbage recovery time consumption in the life cycle; monitoring the multidimensional data and the derivative characteristics based on a preset monitoring model to obtain a monitoring result; the preset monitoring model is obtained by training a decision tree algorithm model according to multi-dimensional sample data and derived feature sample data, so that data can be monitored comprehensively and accurately, and system risk prevention and control can be performed timely.
Further, acquiring the multi-dimensional sample data includes:
acquiring initial multi-dimensional data, and performing data cleaning on data fields of the initial multi-dimensional data to obtain model characteristic index system data; the model characteristic index system data comprises application system environment index system data, virtual machine basic parameter index system data and test log information index system data; reference is made to the above description and no further description is given.
Performing data cleaning on the data field content of the model characteristic index system data to obtain test abnormal data in a test period; reference is made to the above description and no further description is given.
And carrying out discretization, normalization and vectorization treatment on the abnormal test data in sequence, and carrying out discrimination marking to obtain the multi-dimensional sample data. Reference is made to the above description and no further description is made.
According to the data monitoring processing method provided by the embodiment of the invention, data dimension reduction is realized through data processing, and the model training efficiency can be improved.
Further, acquiring the derived feature sample data includes:
acquiring the variable quantity of memory data of the old age heap, the variable quantity of garbage collection times and the variable quantity of garbage collection time consumption; reference is made to the above description and no further description is given.
Respectively calculating the memory data variation of the old age heap, the garbage recovery time variation and the ratio of the garbage recovery consumed time variation to a preset monitoring period; reference is made to the above description and no further description is made.
And carrying out discretization, normalization and vectorization treatment on each ratio result in sequence, and carrying out discrimination marking to obtain the derived characteristic sample data. Reference is made to the above description and no further description is made.
According to the data monitoring processing method provided by the embodiment of the invention, data dimension reduction is realized through data processing, and the model training efficiency can be improved.
Further, the decision tree algorithm model is a distributed gradient lifting framework; correspondingly, training a decision tree algorithm model according to the multi-dimensional sample data and the derived feature sample data comprises the following steps:
initializing and setting training parameters of a distributed gradient lifting frame; reference is made to the above description and no further description is made.
And adjusting the training parameters, and repeatedly training the distributed gradient lifting frame until the tree depth, the leaf node sample weight and the learning weight which avoid over-fitting are obtained. Reference is made to the above description and no further description is made.
The data monitoring and processing method provided by the embodiment of the invention can avoid overfitting of the model.
Further, the data monitoring processing method further includes:
after the distributed gradient lifting frame is repeatedly trained, the generalization capability of the distributed gradient lifting frame is checked by adopting a non-cross validation mode and a cross validation mode, and the preset monitoring model is obtained. Reference is made to the above description and no further description is made.
The data monitoring and processing method provided by the embodiment of the invention can improve the generalization capability of the model.
Further, the data monitoring processing method further includes:
if the monitoring result is determined to be an abnormal monitoring result, acquiring each characteristic weight value of the preset monitoring model; reference is made to the above description and no further description is made.
And (4) sequentially arranging the characteristic weight values in a descending order, and extracting k characteristic weight values ranked in the front. Reference is made to the above description and no further description is made.
The data monitoring processing method provided by the embodiment of the invention is convenient for a user to analyze the influence of the model characteristics on the abnormal monitoring result.
Further, the data monitoring processing method further includes:
and updating training data in a training data set according to the monitoring result, wherein the training data comprises the multi-dimensional sample data and the derived feature sample data. Reference is made to the above description and no further description is given.
According to the data monitoring processing method provided by the embodiment of the invention, the accuracy of model monitoring can be improved by updating the training data set.
It should be noted that the data monitoring and processing method provided in the embodiment of the present invention may be used in the financial field, and may also be used in any technical field other than the financial field.
Fig. 4 is a schematic structural diagram of a data monitoring processing apparatus according to an embodiment of the present invention, and as shown in fig. 4, the data monitoring processing apparatus according to the embodiment of the present invention includes an obtaining unit 401 and a monitoring unit 402, where:
the obtaining unit 401 is configured to obtain multidimensional data and derivative features that reflect system performance risks; the multi-dimensional data comprises basic environment information of an application system, configuration information of a virtual machine and test log information; the derivation characteristics are the rising speed of the heap memory, the rising speed of the garbage recovery times and the rising speed of the garbage recovery time consumption in the life cycle; the monitoring unit 402 is configured to monitor the multidimensional data and the derived features based on a preset monitoring model to obtain a monitoring result; and the preset monitoring model is obtained by training a decision tree algorithm model according to multi-dimensional sample data and derived characteristic sample data.
Specifically, the obtaining unit 401 in the apparatus is configured to obtain multidimensional data and derivative features reflecting system performance risks; the multi-dimensional data comprises basic environment information of an application system, configuration information of a virtual machine and test log information; the derivation characteristics are the rising speed of the heap memory, the rising speed of the garbage recovery times and the rising speed of the garbage recovery time consumption in the life cycle; the monitoring unit 402 is configured to monitor the multidimensional data and the derived features based on a preset monitoring model to obtain a monitoring result; and the preset monitoring model is obtained by training a decision tree algorithm model according to multi-dimensional sample data and derived characteristic sample data.
The data monitoring and processing device provided by the embodiment of the invention obtains multi-dimensional data and derivative characteristics reflecting system performance risks; the multi-dimensional data comprises basic environment information of an application system, configuration information of a virtual machine and test log information; the derivation characteristics are the rising speed of the heap memory, the rising speed of the garbage recovery times and the rising speed of the garbage recovery time consumption in the life cycle; monitoring the multidimensional data and the derivative characteristics based on a preset monitoring model to obtain a monitoring result; the preset monitoring model is obtained by training a decision tree algorithm model according to multi-dimensional sample data and derived feature sample data, so that data can be monitored comprehensively and accurately, and system risk prevention and control can be performed timely.
Further, the data monitoring and processing device is further configured to:
acquiring initial multi-dimensional data, and performing data cleaning on data fields of the initial multi-dimensional data to obtain model characteristic index system data; the model characteristic index system data comprises application system environment index system data, virtual machine basic parameter index system data and test log information index system data;
carrying out data cleaning on the data field content of the model characteristic index system data to obtain test abnormal data in a test period;
and carrying out discretization, normalization and vectorization treatment on the abnormal test data in sequence, and carrying out discrimination marking to obtain the multi-dimensional sample data.
The data monitoring and processing device provided by the embodiment of the invention realizes data dimension reduction through data processing, and can improve the model training efficiency.
Further, the data monitoring and processing device is further configured to:
acquiring the variable quantity of memory data of the old age heap, the variable quantity of garbage collection times and the variable quantity of garbage collection time consumption;
respectively calculating the memory data variation of the old age heap, the garbage recovery time variation and the ratio of the garbage recovery consumed time variation to a preset monitoring period;
and carrying out discretization, normalization and vectorization treatment on each ratio result in sequence, and carrying out discrimination marking to obtain the derived characteristic sample data.
The data monitoring and processing device provided by the embodiment of the invention realizes data dimension reduction through data processing, and can improve the model training efficiency.
Further, the decision tree algorithm model is a distributed gradient lifting framework; correspondingly, the data monitoring and processing device is further used for:
initializing and setting training parameters of a distributed gradient lifting frame;
and adjusting the training parameters, and repeatedly training the distributed gradient lifting frame until the tree depth, the leaf node sample weight and the learning weight which avoid over-fitting are obtained.
The data monitoring and processing device provided by the embodiment of the invention can avoid model overfitting.
Further, the data monitoring processing device is further configured to:
after the distributed gradient lifting frame is repeatedly trained, the generalization capability of the distributed gradient lifting frame is checked by adopting a non-cross validation mode and a cross validation mode, and the preset monitoring model is obtained.
The data monitoring and processing device provided by the embodiment of the invention can improve the generalization capability of the model.
Further, the data monitoring and processing device is further configured to:
if the monitoring result is determined to be an abnormal monitoring result, acquiring each characteristic weight value of the preset monitoring model;
and (4) sequentially arranging the characteristic weight values in a descending order, and extracting k characteristic weight values ranked in the front.
The data monitoring and processing device provided by the embodiment of the invention is convenient for a user to analyze the influence of the model characteristics on the abnormal monitoring result.
Further, the data monitoring and processing device is further configured to:
and updating training data in a training data set according to the monitoring result, wherein the training data comprises the multi-dimensional sample data and the derived feature sample data.
The data monitoring processing device provided by the embodiment of the invention can improve the accuracy of model monitoring by updating the training data set.
The embodiment of the data monitoring and processing apparatus provided in the embodiment of the present invention may be specifically configured to execute the processing flows of the above method embodiments, and the functions of the embodiment are not described herein again, and refer to the detailed description of the above method embodiments.
Fig. 5 is a schematic structural diagram of an entity of an electronic device according to an embodiment of the present invention, and as shown in fig. 5, the electronic device includes: a processor (processor) 501, a memory (memory) 502, and a bus 503;
the processor 501 and the memory 502 complete communication with each other through a bus 503;
the processor 501 is configured to call program instructions in the memory 502 to perform the methods provided by the above-mentioned method embodiments, for example, including:
acquiring multi-dimensional data and derivative characteristics reflecting system performance risks; the multi-dimensional data comprises basic environment information of an application system, configuration information of a virtual machine and test log information; the derivation characteristics are the rising speed of the heap memory, the rising speed of the garbage recovery times and the rising speed of the garbage recovery time consumption in the life cycle;
monitoring the multidimensional data and the derived features based on a preset monitoring model to obtain a monitoring result;
and the preset monitoring model is obtained by training a decision tree algorithm model according to multi-dimensional sample data and derived characteristic sample data.
The present embodiment discloses a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, enable the computer to perform the method provided by the above method embodiments, for example, including:
acquiring multi-dimensional data and derivative characteristics reflecting system performance risks; the multi-dimensional data comprises basic environment information of an application system, configuration information of a virtual machine and test log information; the derivation characteristics are the rising speed of the heap memory, the rising speed of the garbage recovery times and the rising speed of the garbage recovery time consumption in the life cycle;
monitoring the multidimensional data and the derived features based on a preset monitoring model to obtain a monitoring result;
and the preset monitoring model is obtained by training a decision tree algorithm model according to the multi-dimensional sample data and the derived characteristic sample data.
The present embodiment provides a computer-readable storage medium, which stores a computer program, where the computer program causes the computer to execute the method provided by the above method embodiments, for example, the method includes:
acquiring multi-dimensional data and derivative characteristics reflecting system performance risks; the multi-dimensional data comprises basic environment information of an application system, configuration information of a virtual machine and test log information; the derived characteristics are the rising speed of the heap memory, the rising speed of the garbage recovery times and the rising speed of the garbage recovery time consumption in the life cycle;
monitoring the multidimensional data and the derivative characteristics based on a preset monitoring model to obtain a monitoring result;
and the preset monitoring model is obtained by training a decision tree algorithm model according to multi-dimensional sample data and derived characteristic sample data.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
In the description of the specification, reference to the description of "one embodiment," a specific embodiment, "" some embodiments, "" e.g., "an example," "a specific example," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are only exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (10)

1. A data monitoring processing method is characterized by comprising the following steps:
acquiring multi-dimensional data and derivative characteristics reflecting system performance risks; the multi-dimensional data comprises basic environment information of an application system, configuration information of a virtual machine and test log information; the derivation characteristics are the rising speed of the heap memory, the rising speed of the garbage recovery times and the rising speed of the garbage recovery time consumption in the life cycle;
monitoring the multidimensional data and the derivative characteristics based on a preset monitoring model to obtain a monitoring result;
and the preset monitoring model is obtained by training a decision tree algorithm model according to multi-dimensional sample data and derived characteristic sample data.
2. The data monitoring processing method of claim 1, wherein obtaining the multi-dimensional sample data comprises:
acquiring initial multi-dimensional data, and performing data cleaning on data fields of the initial multi-dimensional data to obtain model characteristic index system data; the model characteristic index system data comprises application system environment index system data, virtual machine basic parameter index system data and test log information index system data;
carrying out data cleaning on the data field content of the model characteristic index system data to obtain test abnormal data in a test period;
and carrying out discretization, normalization and vectorization treatment on the abnormal test data in sequence, and carrying out discrimination marking to obtain the multi-dimensional sample data.
3. The data monitoring processing method of claim 1, wherein obtaining the derived feature sample data comprises:
acquiring the variable quantity of memory data of the old age heap, the variable quantity of garbage collection times and the variable quantity of garbage collection time consumption;
respectively calculating the ratio of the memory data variable quantity of the old age heap, the garbage recovery time variable quantity and the garbage recovery time consumption variable quantity to a preset monitoring period;
and carrying out discretization, normalization and vectorization treatment on each ratio result in sequence, and carrying out discrimination marking to obtain the derived characteristic sample data.
4. The data monitoring processing method according to any one of claims 1 to 3, wherein the decision tree algorithm model is a distributed gradient boosting framework; correspondingly, training a decision tree algorithm model according to the multi-dimensional sample data and the derived feature sample data comprises the following steps:
initializing and setting training parameters of a distributed gradient lifting frame;
and adjusting the training parameters, and repeatedly training the distributed gradient lifting frame until the tree depth, the leaf node sample weight and the learning weight which avoid over-fitting are obtained.
5. The data monitoring processing method of claim 4, further comprising:
after the distributed gradient lifting frame is repeatedly trained, the generalization capability of the distributed gradient lifting frame is checked by adopting a non-cross validation mode and a cross validation mode, and the preset monitoring model is obtained.
6. The data monitoring processing method of claim 1, further comprising:
if the monitoring result is determined to be an abnormal monitoring result, acquiring each characteristic weight value of the preset monitoring model;
and (4) sequentially arranging the characteristic weight values in a descending order, and extracting k characteristic weight values ranked in the front.
7. The data monitoring processing method of claim 1, further comprising:
and updating training data in a training data set according to the monitoring result, wherein the training data comprises the multi-dimensional sample data and the derived feature sample data.
8. A data monitoring processing apparatus, comprising:
the acquisition unit is used for acquiring multi-dimensional data and derivative characteristics reflecting system performance risks; the multi-dimensional data comprises application system basic environment information, virtual machine configuration information and test log information; the derivation characteristics are the rising speed of the heap memory, the rising speed of the garbage recovery times and the rising speed of the garbage recovery time consumption in the life cycle;
the monitoring unit is used for monitoring the multi-dimensional data and the derivative characteristics based on a preset monitoring model to obtain a monitoring result;
and the preset monitoring model is obtained by training a decision tree algorithm model according to multi-dimensional sample data and derived characteristic sample data.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the method of any of claims 1 to 7 are implemented when the computer program is executed by the processor.
10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 7.
CN202210815893.XA 2022-07-12 2022-07-12 Data monitoring processing method and device Pending CN115686995A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210815893.XA CN115686995A (en) 2022-07-12 2022-07-12 Data monitoring processing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210815893.XA CN115686995A (en) 2022-07-12 2022-07-12 Data monitoring processing method and device

Publications (1)

Publication Number Publication Date
CN115686995A true CN115686995A (en) 2023-02-03

Family

ID=85061625

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210815893.XA Pending CN115686995A (en) 2022-07-12 2022-07-12 Data monitoring processing method and device

Country Status (1)

Country Link
CN (1) CN115686995A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117369954A (en) * 2023-12-08 2024-01-09 成都乐超人科技有限公司 JVM optimization method and device of risk processing framework for big data construction

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117369954A (en) * 2023-12-08 2024-01-09 成都乐超人科技有限公司 JVM optimization method and device of risk processing framework for big data construction
CN117369954B (en) * 2023-12-08 2024-03-05 成都乐超人科技有限公司 JVM optimization method and device of risk processing framework for big data construction

Similar Documents

Publication Publication Date Title
CN111782472B (en) System abnormality detection method, device, equipment and storage medium
EP4195112A1 (en) Systems and methods for enriching modeling tools and infrastructure with semantics
JP6758368B2 (en) Data discovery node
Liao et al. Gunther: Search-based auto-tuning of mapreduce
CN110688288A (en) Automatic testing method, device, equipment and storage medium based on artificial intelligence
CN110910982A (en) Self-coding model training method, device, equipment and storage medium
CN113779272B (en) Knowledge graph-based data processing method, device, equipment and storage medium
EP3836041A1 (en) Interpretation of machine learning results using feature analysis
CN111090579B (en) Software defect prediction method based on Pearson correlation weighting association classification rule
Staniak et al. The landscape of R packages for automated exploratory data analysis
CN110674211B (en) Automatic analysis method and device for AWR report of Oracle database
US20220076157A1 (en) Data analysis system using artificial intelligence
CN113221960A (en) Construction method and collection method of high-quality vulnerability data collection model
CN115705501A (en) Hyper-parametric spatial optimization of machine learning data processing pipeline
CN115686995A (en) Data monitoring processing method and device
Suleman et al. Google play store app ranking prediction using machine learning algorithm
KR102345410B1 (en) Big data intelligent collecting method and device
CN113743461B (en) Unmanned aerial vehicle cluster health degree assessment method and device
Vaz et al. On creation of synthetic samples from gans for fake news identification algorithms
CN112732549B (en) Test program classification method based on cluster analysis
CN114610590A (en) Method, device and equipment for determining operation time length and storage medium
JP2021152751A (en) Analysis support device and analysis support method
Shin Case study: Real-world machine learning application for hardware failure detection.
CN111753992A (en) Screening method and screening system
Zaim et al. Software Defect Prediction Framework Using Hybrid Software Metric

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination