CN117273552A

CN117273552A - Big data intelligent treatment decision-making method and system based on machine learning

Info

Publication number: CN117273552A
Application number: CN202311558280.3A
Authority: CN
Inventors: 苗敬峰; 胥继云; 夏敏; 周芳; 张新军; 张迪
Original assignee: Shandong Shunguo Electronic Technology Co ltd
Current assignee: Shandong Shunguo Electronic Technology Co ltd
Priority date: 2023-11-22
Filing date: 2023-11-22
Publication date: 2023-12-22
Anticipated expiration: 2043-11-22
Also published as: CN117273552B

Abstract

The invention discloses a big data intelligent management decision-making method and system based on machine learning, which relate to the technical field of data management and comprise the following steps: constructing a data evaluation index system; determining an automated quality metrics tool; acquiring historical treatment data in a large database; obtaining a plurality of training data quality index values; constructing a data quality risk prediction probability model; obtaining a plurality of data quality index values of data to be treated; obtaining a quality total index value of data to be treated; obtaining a data quality risk index of the data to be treated; judging whether the data quality risk index of the data to be treated is larger than a preset value, screening normal data, and taking all the normal data as a data basis of a big data intelligent treatment decision. The invention has the advantages that: the problem that the existing false alarm normal data is abnormal or true data is not reported is effectively solved, the data quality is improved, and the accuracy of intelligent management decision of big data is ensured.

Description

Big data intelligent treatment decision-making method and system based on machine learning

Technical Field

The invention relates to the technical field of data management, in particular to a big data intelligent management decision method and system based on machine learning.

Background

Data quality refers to the degree to which characteristics and attributes of data meet a particular need and desire. Data quality is becoming critical in the modern information age. Accurate, reliable, and high quality data is of great importance to individuals, businesses, and society. The high-quality data is the basis for making intelligent decisions, can reduce the operation cost and the maintenance cost, can establish and maintain a strong customer relationship, and can be used as the basis for researchers and analysts to identify trends, make predictions and support decisions. In summary, it is important to maintain the high quality characteristics of data.

In the existing big data treatment decision process, due to imperfect data sources, insufficient data cleaning, unreasonable parameter setting, data drift and the like, the problem that false normal data is abnormal or true data is missed is caused, the big data intelligent treatment decision is difficult to obtain high-quality data support, and the final decision strategy is difficult to be attached to the big data trend to the greatest extent.

Disclosure of Invention

In order to solve the technical problems, the technical scheme provides a big data intelligent treatment decision method and system based on machine learning, which solves the problems that in the existing big data treatment decision process, due to imperfect data sources, insufficient data cleaning, unreasonable parameter setting, data drifting and the like, false alarm normal data is abnormal or true data is missed, high-quality data support is difficult to obtain in big data intelligent treatment decision, and the final decision strategy is difficult to be attached to the trend of the big data most.

In order to achieve the above purpose, the invention adopts the following technical scheme:

a big data intelligent governance decision-making method based on machine learning comprises the following steps:

constructing a data evaluation index system, wherein the data evaluation index system consists of a plurality of data quality indexes, and the data evaluation index system comprises one or more of accuracy, consistency, reliability, timeliness, uniqueness, validity, understandability, compliance and safety;

determining an automated quality metric tool for metric assignment to a data quality index, the automated quality metric tool being one or more of Trifacta, openRefine or DataWrangler;

acquiring historical treatment data in a large database;

performing measurement and assignment on the data quality index of the historical treatment data by adopting an automatic quality measurement tool to obtain a plurality of training data quality index values;

simulating the training data quality index value based on Monte Carlo, and constructing a data quality risk prediction probability model;

performing measurement and assignment on the data quality index of the data to be treated by adopting an automatic quality measurement tool to obtain a plurality of data quality index values of the data to be treated;

summing a plurality of data quality index values of the data to be treated to obtain a quality total index value of the data to be treated;

substituting the total quality index value of the data to be treated into a data quality risk prediction probability model to obtain a data quality risk index of the data to be treated;

judging whether the data quality risk index of the data to be treated is larger than a preset value, if so, judging that the data to be treated is abnormal data, and if not, judging that the data to be treated is normal data;

and taking all normal data as a data basis for big data intelligent governance decision-making.

Preferably, the simulating the training data quality index value based on monte carlo specifically includes:

determining a training data anchor point based on the training data quality index value, wherein the training data anchor point is composed of the most optimistic data quality index value, the highest frequency data quality index value and the most pessimistic data quality index value;

calculating the arithmetic average value of all the training data quality index values;

calculating a standard deviation of the training data quality index value based on the arithmetic mean of all the training data quality index values and the training data anchor point;

constructing a data quality index random variable value calculation model corresponding to the training data quality index value based on the standard deviation of the training data quality index value and the arithmetic average value of the training data quality index value, wherein the data quality index random variable value calculation model takes a set probability value as input and takes a random variable value of the data quality index as output;

setting a data training index;

traversing the values in the value interval of 0-1 by taking the set data training indexes as the value interval to obtain a plurality of training probability values;

substituting the training probability value into a data quality index random variable value calculation model to obtain a plurality of random variable values of the data quality index;

randomly combining all random variable values of all data quality indexes to obtain a plurality of groups of random variable value groups;

summing all random variable values in each random variable value group to obtain a plurality of random variable total values;

and carrying out statistical analysis based on the total values of the random variables to obtain a data quality risk prediction probability model.

Preferably, the most optimistic data quality index value refers to the maximum value of training data quality index values;

the highest frequency data quality index value refers to the value with the highest occurrence frequency in the training data quality index values;

the pessimistic data quality index value refers to the minimum value of the training data quality index values.

Preferably, the standard deviation of the training data quality index value is calculated according to the following formula:

；

in the method, in the process of the invention,for standard deviation of training data quality index value +.>For the arithmetic mean value of all training data quality index values, +.>For the most optimistic data quality indicator value +.>For the highest frequency data quality index value, +.>Is the most pessimistic data quality index value.

Preferably, the expression of the data quality index random variable value calculation model is:

；

in the method, in the process of the invention,and (3) a random variable value of the data quality index, wherein p is a set probability value, and norm is an inverse cumulative distribution function for calculating normal distribution.

Preferably, the statistical analysis is performed based on a plurality of total random variable values, and the obtaining of the data quality risk prediction probability model specifically includes:

calculating the proportion of the occurrence times of each random variable total value in the occurrence times of all random variable total values, and recording the proportion as the occurrence probability of the random variable total values;

accumulating the occurrence probability of all the random variable total values smaller than the current random variable total value based on the occurrence probability of each random variable total value, and recording the accumulated occurrence probability as the accumulated probability of the random variable total value;

taking the total value of the random variable as an x axis and the cumulative probability of the total value of the random variable as a y axis to obtain a data quality risk prediction probability curve;

and carrying out mathematical expression fitting on the data quality risk prediction probability curve to obtain a data quality risk prediction probability model.

Further, a big data intelligent governance decision system based on machine learning is provided, which is characterized in that the big data intelligent governance decision method based on machine learning comprises:

the quality measurement module is used for determining an automatic quality measurement tool, carrying out measurement assignment on the data quality index of the historical treatment data by adopting the automatic quality measurement tool, and carrying out measurement assignment on the data quality index of the treatment data by adopting the automatic quality measurement tool;

the risk model construction module is electrically connected with the quality measurement module, and the risk model construction unit is used for simulating the training data quality index value based on Monte Carlo to construct a data quality risk prediction probability model;

the data risk calculation module is electrically connected with the quality measurement module and the risk model construction module, and is used for summing a plurality of data quality index values of the data to be treated to obtain a quality total index value of the data to be treated and substituting the quality total index value of the data to be treated into the data quality risk prediction probability model to obtain a data quality risk index of the data to be treated;

the data analysis module is electrically connected with the data risk calculation module and is used for judging whether the data quality risk index of the data to be treated is larger than a preset value, if so, judging that the data to be treated is abnormal, and if not, judging that the data to be treated is normal.

Optionally, the risk model building module includes:

a data preprocessing unit for determining a training data anchor point based on the training data quality index values, calculating an arithmetic average value of all the training data quality index values, and calculating a standard deviation of the training data quality index values based on the arithmetic average value of all the training data quality index values and the training data anchor point;

the model unit is used for constructing a data quality index random variable value calculation model corresponding to the training data quality index value based on the standard deviation of the training data quality index value and the arithmetic average value of the training data quality index value;

the training value determining unit is used for traversing the values in the value interval of 0-1 by taking the set data training indexes as the value interval to obtain a plurality of training probability values and substituting the training probability values into the data quality index random variable value calculation model to obtain a plurality of random variable values of the data quality index;

the random combination unit is used for randomly combining all random variable values of all data quality indexes to obtain a plurality of groups of random variable value groups;

the summation unit is used for summing all random variable values in each random variable value group to obtain a plurality of random variable total values;

the model fitting unit is used for calculating the proportion of the occurrence times of each random variable total value in the occurrence times of all random variable total values, recording the proportion as the occurrence probability of the random variable total values, accumulating the occurrence probability of all random variable total values smaller than the current random variable total value based on the occurrence probability of each random variable total value, recording the accumulation probability of the random variable total values as the accumulation probability of the random variable total values, taking the random variable total value as an x-axis, taking the accumulation probability of the random variable total values as a y-axis, and obtaining a data quality risk prediction probability curve and a mathematical expression for fitting the data quality risk prediction probability curve to obtain a data quality risk prediction probability model.

Compared with the prior art, the invention has the beneficial effects that:

according to the scheme, the data quality risk prediction probability model is constructed by simulating historical treatment data in a large database based on Monte Carlo, then the data quality index value of the actual data quality index to be treated is substituted into the data quality risk prediction probability model to obtain the data quality risk index, and then whether the data is abnormal or not is judged according to the data quality risk index.

Drawings

FIG. 1 is a flow chart of a big data intelligent treatment decision-making method based on machine learning according to the scheme;

FIG. 2 is a flow chart of a method for constructing a data quality risk prediction probability model in the present solution;

fig. 3 is a flowchart of a method for obtaining a data quality risk prediction probability model in the present solution.

Detailed Description

The following description is presented to enable one of ordinary skill in the art to make and use the invention. The preferred embodiments in the following description are by way of example only and other obvious variations will occur to those skilled in the art.

Referring to fig. 1, a big data intelligent governance decision method based on machine learning includes:

constructing a data evaluation index system, wherein the data evaluation index system consists of a plurality of data quality indexes, and comprises one or more of accuracy, consistency, reliability, timeliness, uniqueness, effectiveness, comprehensibility, compliance and safety, and the data quality indexes focused by different organizations and projects are also different, so that the credibility and the effectiveness of the data in the analysis and decision process are ensured, and the data quality indexes are required to be determined according to specific data requirements and business backgrounds;

an automated quality metric tool is determined, the automated quality metric tool is used for performing metric assignment on the data quality index, the automated quality metric tool is one or more of Trifacta, openRefine or DataWrangler, and different automated quality metric tools are suitable for different data types, such as Apache Flink is suitable for real-time data processing and batch processing. The method has the characteristics of low delay, high throughput and fault tolerance, and is suitable for processing real-time big data; pandas and NumPy can be used for processing large-scale data, and are particularly suitable for data analysis and cleaning;

acquiring historical treatment data in a large database;

According to the scheme, the data quality risk prediction probability model is constructed by simulating historical treatment data in a large database based on Monte Carlo, then the data quality index value of the actual data quality index to be treated is substituted into the data quality risk prediction probability model to obtain the data quality risk index, and whether the data is abnormal or not is judged according to the data quality risk index.

Referring to fig. 2, simulating the training data quality index value based on monte carlo, the constructing a data quality risk prediction probability model specifically includes:

constructing a data quality index random variable value calculation model corresponding to the training data quality index value based on the standard deviation of the training data quality index value and the arithmetic average value of the training data quality index value, wherein the data quality index random variable value calculation model takes the set probability value as input and takes the random variable value of the data quality index as output;

setting a data training index;

Monte Carlo simulation is a numerical calculation method for solving complex randomness and uncertainty problems. The core idea of Monte Carlo simulation is to approximate the solution or nature of the problem by a large number of random samples, as the number of samples increases, the simulation result will get closer to the true value. According to the scheme, a data quality index random variable value calculation model is built through Monte Carlo simulation, a set probability value is taken as input, a random variable value of a data quality index is taken as output, and after the set simulation times are carried out, a plurality of random variable values of the data quality index are subjected to statistical analysis to obtain a data quality risk prediction probability model.

The most optimistic data quality index value refers to the maximum value of the training data quality index values;

the pessimistic data quality index value refers to the minimum value among the training data quality index values.

Based on the most optimistic, highest frequency and pessimistic data quality index values, the data quality index values can be fixed in a section, and then Monte Carlo simulation is performed to simulate most of the possible situations obtained by the data quality measurement.

The standard deviation of the training data quality index value is calculated as follows:

；

The expression of the data quality index random variable value calculation model is as follows:

；

It will be appreciated that norm v is one of the functions commonly used in statistical and data analysis and is typically used to calculate an inverse cumulative distribution function of a normal distribution, with the input parameters including the probability value p and the mean and standard deviation of the normal distribution, the function returning a random variable value such that the cumulative distribution function of the random variable is equal to the given probability value p. The scheme obtains the random variable value of the data quality index by using a norm function.

Referring to fig. 3, statistical analysis is performed based on a plurality of total random variable values, and the obtaining a data quality risk prediction probability model specifically includes:

Based on the number of training probability values, a plurality of random variable value groups can be obtained, the random variable value groups are summed to obtain a plurality of random variable total values, statistical analysis is performed, and the cumulative probability of the random variable total values is calculated. Fitting a data quality risk prediction probability curve by using the accumulated probability corresponding to the random variable total value, wherein when the simulation times are more, the obtained data quality risk prediction probability curve is more accurate, and the data quality risk prediction probability of any quality total index value in an estimated interval can be obtained from the data quality risk prediction probability curve.

Furthermore, the scheme is based on the same inventive concept as the big data intelligent governance decision method based on machine learning, and also provides a big data intelligent governance decision system based on machine learning, which comprises:

The risk model construction module comprises:

the data preprocessing unit is used for determining a training data anchor point based on the training data quality index value, calculating an arithmetic average value of all training data quality index values and calculating a standard deviation of the training data quality index value based on the arithmetic average value of all training data quality index values and the training data anchor point;

the model unit is used for constructing a data quality index random variable value calculation model corresponding to the training data quality index value based on the standard deviation of the training data quality index value and the arithmetic mean value of the training data quality index value;

the model fitting unit is used for calculating the proportion of the occurrence times of each random variable total value in the occurrence times of all random variable total values, marking the proportion as the occurrence probability of the random variable total values, accumulating the occurrence probabilities of all random variable total values smaller than the current random variable total value based on the occurrence probability of each random variable total value, marking the accumulation probability of the random variable total values as the accumulation probability of the random variable total values, taking the random variable total value as an x-axis, taking the accumulation probability of the random variable total values as a y-axis, obtaining a data quality risk prediction probability curve and carrying out mathematical expression of the fitting data quality risk prediction probability curve, and obtaining a data quality risk prediction probability model.

In summary, the invention has the advantages that: the problem that the existing false alarm normal data is abnormal or true data is not reported is effectively solved, and therefore high-quality data support can be obtained when the big data intelligent treatment decision is carried out, and the accuracy of the big data intelligent treatment decision is guaranteed.

The foregoing has shown and described the basic principles, principal features and advantages of the invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, and that the above embodiments and descriptions are merely illustrative of the principles of the present invention, and various changes and modifications may be made therein without departing from the spirit and scope of the invention, which is defined by the appended claims. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims

1. The big data intelligent governance decision-making method based on machine learning is characterized by comprising the following steps:

acquiring historical treatment data in a large database;

2. The machine learning-based big data intelligent governance decision method of claim 1, wherein the modeling training data quality index values based on monte carlo, and constructing the data quality risk prediction probability model specifically comprises:

setting a data training index;

3. The machine learning based big data intelligent governance decision method of claim 2, wherein the most optimistic data quality indicator value refers to a maximum value of training data quality indicator values;

4. The machine learning-based big data intelligent governance decision method of claim 3, wherein the standard deviation of the training data quality index value is calculated according to the formula:

；

5. The machine learning-based big data intelligent governance decision method of claim 4, wherein the expression of the data quality index random variable value calculation model is:

；

6. The machine learning-based big data intelligent governance decision method of claim 5, wherein the statistical analysis based on the total value of a plurality of random variables to obtain a data quality risk prediction probability model specifically comprises:

7. A machine learning based big data intelligent governance decision system for implementing the machine learning based big data intelligent governance decision method of any of claims 1-6, comprising:

8. The machine learning based big data intelligent abatement decision system of claim 7, wherein the risk model building module comprises: