CN117273552A - Big data intelligent treatment decision-making method and system based on machine learning - Google Patents
Big data intelligent treatment decision-making method and system based on machine learning Download PDFInfo
- Publication number
- CN117273552A CN117273552A CN202311558280.3A CN202311558280A CN117273552A CN 117273552 A CN117273552 A CN 117273552A CN 202311558280 A CN202311558280 A CN 202311558280A CN 117273552 A CN117273552 A CN 117273552A
- Authority
- CN
- China
- Prior art keywords
- data
- value
- data quality
- random variable
- quality index
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 35
- 238000010801 machine learning Methods 0.000 title claims abstract description 21
- 238000012549 training Methods 0.000 claims abstract description 113
- 230000002159 abnormal effect Effects 0.000 claims abstract description 12
- 238000011157 data evaluation Methods 0.000 claims abstract description 9
- 238000013442 quality metrics Methods 0.000 claims abstract description 9
- 238000005259 measurement Methods 0.000 claims description 37
- 238000004364 calculation method Methods 0.000 claims description 26
- 238000010276 construction Methods 0.000 claims description 10
- 238000009825 accumulation Methods 0.000 claims description 9
- 230000001186 cumulative effect Effects 0.000 claims description 9
- 238000007619 statistical method Methods 0.000 claims description 9
- 238000007405 data analysis Methods 0.000 claims description 5
- 238000005315 distribution function Methods 0.000 claims description 5
- 238000007781 pre-processing Methods 0.000 claims description 3
- 238000007726 management method Methods 0.000 abstract description 3
- 238000013523 data management Methods 0.000 abstract description 2
- 238000012216 screening Methods 0.000 abstract 1
- 238000000342 Monte Carlo simulation Methods 0.000 description 4
- 238000012545 processing Methods 0.000 description 4
- 238000004140 cleaning Methods 0.000 description 3
- 230000006870 function Effects 0.000 description 3
- 238000004088 simulation Methods 0.000 description 3
- 208000025174 PANDAS Diseases 0.000 description 1
- 208000021155 Paediatric autoimmune neuropsychiatric disorders associated with streptococcal infection Diseases 0.000 description 1
- 240000000220 Panda oleosa Species 0.000 description 1
- 235000016496 Panda oleosa Nutrition 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000012423 maintenance Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q10/00—Administration; Management
- G06Q10/06—Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
- G06Q10/063—Operations research, analysis or management
- G06Q10/0639—Performance analysis of employees; Performance analysis of enterprise or organisation operations
- G06Q10/06393—Score-carding, benchmarking or key performance indicator [KPI] analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q10/00—Administration; Management
- G06Q10/06—Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
- G06Q10/063—Operations research, analysis or management
- G06Q10/0635—Risk analysis of enterprise or organisation activities
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02P—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
- Y02P90/00—Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
- Y02P90/30—Computing systems specially adapted for manufacturing
Landscapes
- Business, Economics & Management (AREA)
- Human Resources & Organizations (AREA)
- Engineering & Computer Science (AREA)
- Economics (AREA)
- Entrepreneurship & Innovation (AREA)
- Theoretical Computer Science (AREA)
- Strategic Management (AREA)
- Development Economics (AREA)
- Educational Administration (AREA)
- General Physics & Mathematics (AREA)
- Physics & Mathematics (AREA)
- Tourism & Hospitality (AREA)
- Quality & Reliability (AREA)
- Operations Research (AREA)
- General Business, Economics & Management (AREA)
- Marketing (AREA)
- Game Theory and Decision Science (AREA)
- Software Systems (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Medical Informatics (AREA)
- Artificial Intelligence (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention discloses a big data intelligent management decision-making method and system based on machine learning, which relate to the technical field of data management and comprise the following steps: constructing a data evaluation index system; determining an automated quality metrics tool; acquiring historical treatment data in a large database; obtaining a plurality of training data quality index values; constructing a data quality risk prediction probability model; obtaining a plurality of data quality index values of data to be treated; obtaining a quality total index value of data to be treated; obtaining a data quality risk index of the data to be treated; judging whether the data quality risk index of the data to be treated is larger than a preset value, screening normal data, and taking all the normal data as a data basis of a big data intelligent treatment decision. The invention has the advantages that: the problem that the existing false alarm normal data is abnormal or true data is not reported is effectively solved, the data quality is improved, and the accuracy of intelligent management decision of big data is ensured.
Description
Technical Field
The invention relates to the technical field of data management, in particular to a big data intelligent management decision method and system based on machine learning.
Background
Data quality refers to the degree to which characteristics and attributes of data meet a particular need and desire. Data quality is becoming critical in the modern information age. Accurate, reliable, and high quality data is of great importance to individuals, businesses, and society. The high-quality data is the basis for making intelligent decisions, can reduce the operation cost and the maintenance cost, can establish and maintain a strong customer relationship, and can be used as the basis for researchers and analysts to identify trends, make predictions and support decisions. In summary, it is important to maintain the high quality characteristics of data.
In the existing big data treatment decision process, due to imperfect data sources, insufficient data cleaning, unreasonable parameter setting, data drift and the like, the problem that false normal data is abnormal or true data is missed is caused, the big data intelligent treatment decision is difficult to obtain high-quality data support, and the final decision strategy is difficult to be attached to the big data trend to the greatest extent.
Disclosure of Invention
In order to solve the technical problems, the technical scheme provides a big data intelligent treatment decision method and system based on machine learning, which solves the problems that in the existing big data treatment decision process, due to imperfect data sources, insufficient data cleaning, unreasonable parameter setting, data drifting and the like, false alarm normal data is abnormal or true data is missed, high-quality data support is difficult to obtain in big data intelligent treatment decision, and the final decision strategy is difficult to be attached to the trend of the big data most.
In order to achieve the above purpose, the invention adopts the following technical scheme:
a big data intelligent governance decision-making method based on machine learning comprises the following steps:
constructing a data evaluation index system, wherein the data evaluation index system consists of a plurality of data quality indexes, and the data evaluation index system comprises one or more of accuracy, consistency, reliability, timeliness, uniqueness, validity, understandability, compliance and safety;
determining an automated quality metric tool for metric assignment to a data quality index, the automated quality metric tool being one or more of Trifacta, openRefine or DataWrangler;
acquiring historical treatment data in a large database;
performing measurement and assignment on the data quality index of the historical treatment data by adopting an automatic quality measurement tool to obtain a plurality of training data quality index values;
simulating the training data quality index value based on Monte Carlo, and constructing a data quality risk prediction probability model;
performing measurement and assignment on the data quality index of the data to be treated by adopting an automatic quality measurement tool to obtain a plurality of data quality index values of the data to be treated;
summing a plurality of data quality index values of the data to be treated to obtain a quality total index value of the data to be treated;
substituting the total quality index value of the data to be treated into a data quality risk prediction probability model to obtain a data quality risk index of the data to be treated;
judging whether the data quality risk index of the data to be treated is larger than a preset value, if so, judging that the data to be treated is abnormal data, and if not, judging that the data to be treated is normal data;
and taking all normal data as a data basis for big data intelligent governance decision-making.
Preferably, the simulating the training data quality index value based on monte carlo specifically includes:
determining a training data anchor point based on the training data quality index value, wherein the training data anchor point is composed of the most optimistic data quality index value, the highest frequency data quality index value and the most pessimistic data quality index value;
calculating the arithmetic average value of all the training data quality index values;
calculating a standard deviation of the training data quality index value based on the arithmetic mean of all the training data quality index values and the training data anchor point;
constructing a data quality index random variable value calculation model corresponding to the training data quality index value based on the standard deviation of the training data quality index value and the arithmetic average value of the training data quality index value, wherein the data quality index random variable value calculation model takes a set probability value as input and takes a random variable value of the data quality index as output;
setting a data training index;
traversing the values in the value interval of 0-1 by taking the set data training indexes as the value interval to obtain a plurality of training probability values;
substituting the training probability value into a data quality index random variable value calculation model to obtain a plurality of random variable values of the data quality index;
randomly combining all random variable values of all data quality indexes to obtain a plurality of groups of random variable value groups;
summing all random variable values in each random variable value group to obtain a plurality of random variable total values;
and carrying out statistical analysis based on the total values of the random variables to obtain a data quality risk prediction probability model.
Preferably, the most optimistic data quality index value refers to the maximum value of training data quality index values;
the highest frequency data quality index value refers to the value with the highest occurrence frequency in the training data quality index values;
the pessimistic data quality index value refers to the minimum value of the training data quality index values.
Preferably, the standard deviation of the training data quality index value is calculated according to the following formula:
;
in the method, in the process of the invention,for standard deviation of training data quality index value +.>For the arithmetic mean value of all training data quality index values, +.>For the most optimistic data quality indicator value +.>For the highest frequency data quality index value, +.>Is the most pessimistic data quality index value.
Preferably, the expression of the data quality index random variable value calculation model is:
;
in the method, in the process of the invention,and (3) a random variable value of the data quality index, wherein p is a set probability value, and norm is an inverse cumulative distribution function for calculating normal distribution.
Preferably, the statistical analysis is performed based on a plurality of total random variable values, and the obtaining of the data quality risk prediction probability model specifically includes:
calculating the proportion of the occurrence times of each random variable total value in the occurrence times of all random variable total values, and recording the proportion as the occurrence probability of the random variable total values;
accumulating the occurrence probability of all the random variable total values smaller than the current random variable total value based on the occurrence probability of each random variable total value, and recording the accumulated occurrence probability as the accumulated probability of the random variable total value;
taking the total value of the random variable as an x axis and the cumulative probability of the total value of the random variable as a y axis to obtain a data quality risk prediction probability curve;
and carrying out mathematical expression fitting on the data quality risk prediction probability curve to obtain a data quality risk prediction probability model.
Further, a big data intelligent governance decision system based on machine learning is provided, which is characterized in that the big data intelligent governance decision method based on machine learning comprises:
the quality measurement module is used for determining an automatic quality measurement tool, carrying out measurement assignment on the data quality index of the historical treatment data by adopting the automatic quality measurement tool, and carrying out measurement assignment on the data quality index of the treatment data by adopting the automatic quality measurement tool;
the risk model construction module is electrically connected with the quality measurement module, and the risk model construction unit is used for simulating the training data quality index value based on Monte Carlo to construct a data quality risk prediction probability model;
the data risk calculation module is electrically connected with the quality measurement module and the risk model construction module, and is used for summing a plurality of data quality index values of the data to be treated to obtain a quality total index value of the data to be treated and substituting the quality total index value of the data to be treated into the data quality risk prediction probability model to obtain a data quality risk index of the data to be treated;
the data analysis module is electrically connected with the data risk calculation module and is used for judging whether the data quality risk index of the data to be treated is larger than a preset value, if so, judging that the data to be treated is abnormal, and if not, judging that the data to be treated is normal.
Optionally, the risk model building module includes:
a data preprocessing unit for determining a training data anchor point based on the training data quality index values, calculating an arithmetic average value of all the training data quality index values, and calculating a standard deviation of the training data quality index values based on the arithmetic average value of all the training data quality index values and the training data anchor point;
the model unit is used for constructing a data quality index random variable value calculation model corresponding to the training data quality index value based on the standard deviation of the training data quality index value and the arithmetic average value of the training data quality index value;
the training value determining unit is used for traversing the values in the value interval of 0-1 by taking the set data training indexes as the value interval to obtain a plurality of training probability values and substituting the training probability values into the data quality index random variable value calculation model to obtain a plurality of random variable values of the data quality index;
the random combination unit is used for randomly combining all random variable values of all data quality indexes to obtain a plurality of groups of random variable value groups;
the summation unit is used for summing all random variable values in each random variable value group to obtain a plurality of random variable total values;
the model fitting unit is used for calculating the proportion of the occurrence times of each random variable total value in the occurrence times of all random variable total values, recording the proportion as the occurrence probability of the random variable total values, accumulating the occurrence probability of all random variable total values smaller than the current random variable total value based on the occurrence probability of each random variable total value, recording the accumulation probability of the random variable total values as the accumulation probability of the random variable total values, taking the random variable total value as an x-axis, taking the accumulation probability of the random variable total values as a y-axis, and obtaining a data quality risk prediction probability curve and a mathematical expression for fitting the data quality risk prediction probability curve to obtain a data quality risk prediction probability model.
Compared with the prior art, the invention has the beneficial effects that:
according to the scheme, the data quality risk prediction probability model is constructed by simulating historical treatment data in a large database based on Monte Carlo, then the data quality index value of the actual data quality index to be treated is substituted into the data quality risk prediction probability model to obtain the data quality risk index, and then whether the data is abnormal or not is judged according to the data quality risk index.
Drawings
FIG. 1 is a flow chart of a big data intelligent treatment decision-making method based on machine learning according to the scheme;
FIG. 2 is a flow chart of a method for constructing a data quality risk prediction probability model in the present solution;
fig. 3 is a flowchart of a method for obtaining a data quality risk prediction probability model in the present solution.
Detailed Description
The following description is presented to enable one of ordinary skill in the art to make and use the invention. The preferred embodiments in the following description are by way of example only and other obvious variations will occur to those skilled in the art.
Referring to fig. 1, a big data intelligent governance decision method based on machine learning includes:
constructing a data evaluation index system, wherein the data evaluation index system consists of a plurality of data quality indexes, and comprises one or more of accuracy, consistency, reliability, timeliness, uniqueness, effectiveness, comprehensibility, compliance and safety, and the data quality indexes focused by different organizations and projects are also different, so that the credibility and the effectiveness of the data in the analysis and decision process are ensured, and the data quality indexes are required to be determined according to specific data requirements and business backgrounds;
an automated quality metric tool is determined, the automated quality metric tool is used for performing metric assignment on the data quality index, the automated quality metric tool is one or more of Trifacta, openRefine or DataWrangler, and different automated quality metric tools are suitable for different data types, such as Apache Flink is suitable for real-time data processing and batch processing. The method has the characteristics of low delay, high throughput and fault tolerance, and is suitable for processing real-time big data; pandas and NumPy can be used for processing large-scale data, and are particularly suitable for data analysis and cleaning;
acquiring historical treatment data in a large database;
performing measurement and assignment on the data quality index of the historical treatment data by adopting an automatic quality measurement tool to obtain a plurality of training data quality index values;
simulating the training data quality index value based on Monte Carlo, and constructing a data quality risk prediction probability model;
performing measurement and assignment on the data quality index of the data to be treated by adopting an automatic quality measurement tool to obtain a plurality of data quality index values of the data to be treated;
summing a plurality of data quality index values of the data to be treated to obtain a quality total index value of the data to be treated;
substituting the total quality index value of the data to be treated into a data quality risk prediction probability model to obtain a data quality risk index of the data to be treated;
judging whether the data quality risk index of the data to be treated is larger than a preset value, if so, judging that the data to be treated is abnormal data, and if not, judging that the data to be treated is normal data;
and taking all normal data as a data basis for big data intelligent governance decision-making.
According to the scheme, the data quality risk prediction probability model is constructed by simulating historical treatment data in a large database based on Monte Carlo, then the data quality index value of the actual data quality index to be treated is substituted into the data quality risk prediction probability model to obtain the data quality risk index, and whether the data is abnormal or not is judged according to the data quality risk index.
Referring to fig. 2, simulating the training data quality index value based on monte carlo, the constructing a data quality risk prediction probability model specifically includes:
determining a training data anchor point based on the training data quality index value, wherein the training data anchor point is composed of the most optimistic data quality index value, the highest frequency data quality index value and the most pessimistic data quality index value;
calculating the arithmetic average value of all the training data quality index values;
calculating a standard deviation of the training data quality index value based on the arithmetic mean of all the training data quality index values and the training data anchor point;
constructing a data quality index random variable value calculation model corresponding to the training data quality index value based on the standard deviation of the training data quality index value and the arithmetic average value of the training data quality index value, wherein the data quality index random variable value calculation model takes the set probability value as input and takes the random variable value of the data quality index as output;
setting a data training index;
traversing the values in the value interval of 0-1 by taking the set data training indexes as the value interval to obtain a plurality of training probability values;
substituting the training probability value into a data quality index random variable value calculation model to obtain a plurality of random variable values of the data quality index;
randomly combining all random variable values of all data quality indexes to obtain a plurality of groups of random variable value groups;
summing all random variable values in each random variable value group to obtain a plurality of random variable total values;
and carrying out statistical analysis based on the total values of the random variables to obtain a data quality risk prediction probability model.
Monte Carlo simulation is a numerical calculation method for solving complex randomness and uncertainty problems. The core idea of Monte Carlo simulation is to approximate the solution or nature of the problem by a large number of random samples, as the number of samples increases, the simulation result will get closer to the true value. According to the scheme, a data quality index random variable value calculation model is built through Monte Carlo simulation, a set probability value is taken as input, a random variable value of a data quality index is taken as output, and after the set simulation times are carried out, a plurality of random variable values of the data quality index are subjected to statistical analysis to obtain a data quality risk prediction probability model.
The most optimistic data quality index value refers to the maximum value of the training data quality index values;
the highest frequency data quality index value refers to the value with the highest occurrence frequency in the training data quality index values;
the pessimistic data quality index value refers to the minimum value among the training data quality index values.
Based on the most optimistic, highest frequency and pessimistic data quality index values, the data quality index values can be fixed in a section, and then Monte Carlo simulation is performed to simulate most of the possible situations obtained by the data quality measurement.
The standard deviation of the training data quality index value is calculated as follows:
;
in the method, in the process of the invention,for standard deviation of training data quality index value +.>For the arithmetic mean value of all training data quality index values, +.>For the most optimistic data quality indicator value +.>For the highest frequency data quality index value, +.>Is the most pessimistic data quality index value.
The expression of the data quality index random variable value calculation model is as follows:
;
in the method, in the process of the invention,and (3) a random variable value of the data quality index, wherein p is a set probability value, and norm is an inverse cumulative distribution function for calculating normal distribution.
It will be appreciated that norm v is one of the functions commonly used in statistical and data analysis and is typically used to calculate an inverse cumulative distribution function of a normal distribution, with the input parameters including the probability value p and the mean and standard deviation of the normal distribution, the function returning a random variable value such that the cumulative distribution function of the random variable is equal to the given probability value p. The scheme obtains the random variable value of the data quality index by using a norm function.
Referring to fig. 3, statistical analysis is performed based on a plurality of total random variable values, and the obtaining a data quality risk prediction probability model specifically includes:
calculating the proportion of the occurrence times of each random variable total value in the occurrence times of all random variable total values, and recording the proportion as the occurrence probability of the random variable total values;
accumulating the occurrence probability of all the random variable total values smaller than the current random variable total value based on the occurrence probability of each random variable total value, and recording the accumulated occurrence probability as the accumulated probability of the random variable total value;
taking the total value of the random variable as an x axis and the cumulative probability of the total value of the random variable as a y axis to obtain a data quality risk prediction probability curve;
and carrying out mathematical expression fitting on the data quality risk prediction probability curve to obtain a data quality risk prediction probability model.
Based on the number of training probability values, a plurality of random variable value groups can be obtained, the random variable value groups are summed to obtain a plurality of random variable total values, statistical analysis is performed, and the cumulative probability of the random variable total values is calculated. Fitting a data quality risk prediction probability curve by using the accumulated probability corresponding to the random variable total value, wherein when the simulation times are more, the obtained data quality risk prediction probability curve is more accurate, and the data quality risk prediction probability of any quality total index value in an estimated interval can be obtained from the data quality risk prediction probability curve.
Furthermore, the scheme is based on the same inventive concept as the big data intelligent governance decision method based on machine learning, and also provides a big data intelligent governance decision system based on machine learning, which comprises:
the quality measurement module is used for determining an automatic quality measurement tool, carrying out measurement assignment on the data quality index of the historical treatment data by adopting the automatic quality measurement tool, and carrying out measurement assignment on the data quality index of the treatment data by adopting the automatic quality measurement tool;
the risk model construction module is electrically connected with the quality measurement module, and the risk model construction unit is used for simulating the training data quality index value based on Monte Carlo to construct a data quality risk prediction probability model;
the data risk calculation module is electrically connected with the quality measurement module and the risk model construction module, and is used for summing a plurality of data quality index values of the data to be treated to obtain a quality total index value of the data to be treated and substituting the quality total index value of the data to be treated into the data quality risk prediction probability model to obtain a data quality risk index of the data to be treated;
the data analysis module is electrically connected with the data risk calculation module and is used for judging whether the data quality risk index of the data to be treated is larger than a preset value, if so, judging that the data to be treated is abnormal, and if not, judging that the data to be treated is normal.
The risk model construction module comprises:
the data preprocessing unit is used for determining a training data anchor point based on the training data quality index value, calculating an arithmetic average value of all training data quality index values and calculating a standard deviation of the training data quality index value based on the arithmetic average value of all training data quality index values and the training data anchor point;
the model unit is used for constructing a data quality index random variable value calculation model corresponding to the training data quality index value based on the standard deviation of the training data quality index value and the arithmetic mean value of the training data quality index value;
the training value determining unit is used for traversing the values in the value interval of 0-1 by taking the set data training indexes as the value interval to obtain a plurality of training probability values and substituting the training probability values into the data quality index random variable value calculation model to obtain a plurality of random variable values of the data quality index;
the random combination unit is used for randomly combining all random variable values of all data quality indexes to obtain a plurality of groups of random variable value groups;
the summation unit is used for summing all random variable values in each random variable value group to obtain a plurality of random variable total values;
the model fitting unit is used for calculating the proportion of the occurrence times of each random variable total value in the occurrence times of all random variable total values, marking the proportion as the occurrence probability of the random variable total values, accumulating the occurrence probabilities of all random variable total values smaller than the current random variable total value based on the occurrence probability of each random variable total value, marking the accumulation probability of the random variable total values as the accumulation probability of the random variable total values, taking the random variable total value as an x-axis, taking the accumulation probability of the random variable total values as a y-axis, obtaining a data quality risk prediction probability curve and carrying out mathematical expression of the fitting data quality risk prediction probability curve, and obtaining a data quality risk prediction probability model.
In summary, the invention has the advantages that: the problem that the existing false alarm normal data is abnormal or true data is not reported is effectively solved, and therefore high-quality data support can be obtained when the big data intelligent treatment decision is carried out, and the accuracy of the big data intelligent treatment decision is guaranteed.
The foregoing has shown and described the basic principles, principal features and advantages of the invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, and that the above embodiments and descriptions are merely illustrative of the principles of the present invention, and various changes and modifications may be made therein without departing from the spirit and scope of the invention, which is defined by the appended claims. The scope of the invention is defined by the appended claims and equivalents thereof.
Claims (8)
1. The big data intelligent governance decision-making method based on machine learning is characterized by comprising the following steps:
constructing a data evaluation index system, wherein the data evaluation index system consists of a plurality of data quality indexes, and the data evaluation index system comprises one or more of accuracy, consistency, reliability, timeliness, uniqueness, validity, understandability, compliance and safety;
determining an automated quality metric tool for metric assignment to a data quality index, the automated quality metric tool being one or more of Trifacta, openRefine or DataWrangler;
acquiring historical treatment data in a large database;
performing measurement and assignment on the data quality index of the historical treatment data by adopting an automatic quality measurement tool to obtain a plurality of training data quality index values;
simulating the training data quality index value based on Monte Carlo, and constructing a data quality risk prediction probability model;
performing measurement and assignment on the data quality index of the data to be treated by adopting an automatic quality measurement tool to obtain a plurality of data quality index values of the data to be treated;
summing a plurality of data quality index values of the data to be treated to obtain a quality total index value of the data to be treated;
substituting the total quality index value of the data to be treated into a data quality risk prediction probability model to obtain a data quality risk index of the data to be treated;
judging whether the data quality risk index of the data to be treated is larger than a preset value, if so, judging that the data to be treated is abnormal data, and if not, judging that the data to be treated is normal data;
and taking all normal data as a data basis for big data intelligent governance decision-making.
2. The machine learning-based big data intelligent governance decision method of claim 1, wherein the modeling training data quality index values based on monte carlo, and constructing the data quality risk prediction probability model specifically comprises:
determining a training data anchor point based on the training data quality index value, wherein the training data anchor point is composed of the most optimistic data quality index value, the highest frequency data quality index value and the most pessimistic data quality index value;
calculating the arithmetic average value of all the training data quality index values;
calculating a standard deviation of the training data quality index value based on the arithmetic mean of all the training data quality index values and the training data anchor point;
constructing a data quality index random variable value calculation model corresponding to the training data quality index value based on the standard deviation of the training data quality index value and the arithmetic average value of the training data quality index value, wherein the data quality index random variable value calculation model takes a set probability value as input and takes a random variable value of the data quality index as output;
setting a data training index;
traversing the values in the value interval of 0-1 by taking the set data training indexes as the value interval to obtain a plurality of training probability values;
substituting the training probability value into a data quality index random variable value calculation model to obtain a plurality of random variable values of the data quality index;
randomly combining all random variable values of all data quality indexes to obtain a plurality of groups of random variable value groups;
summing all random variable values in each random variable value group to obtain a plurality of random variable total values;
and carrying out statistical analysis based on the total values of the random variables to obtain a data quality risk prediction probability model.
3. The machine learning based big data intelligent governance decision method of claim 2, wherein the most optimistic data quality indicator value refers to a maximum value of training data quality indicator values;
the highest frequency data quality index value refers to the value with the highest occurrence frequency in the training data quality index values;
the pessimistic data quality index value refers to the minimum value of the training data quality index values.
4. The machine learning-based big data intelligent governance decision method of claim 3, wherein the standard deviation of the training data quality index value is calculated according to the formula:
;
in the method, in the process of the invention,for standard deviation of training data quality index value +.>For the arithmetic mean value of all training data quality index values, +.>For the most optimistic data quality indicator value +.>For the highest frequency data quality index value, +.>Is the most pessimistic data quality index value.
5. The machine learning-based big data intelligent governance decision method of claim 4, wherein the expression of the data quality index random variable value calculation model is:
;
in the method, in the process of the invention,and (3) a random variable value of the data quality index, wherein p is a set probability value, and norm is an inverse cumulative distribution function for calculating normal distribution.
6. The machine learning-based big data intelligent governance decision method of claim 5, wherein the statistical analysis based on the total value of a plurality of random variables to obtain a data quality risk prediction probability model specifically comprises:
calculating the proportion of the occurrence times of each random variable total value in the occurrence times of all random variable total values, and recording the proportion as the occurrence probability of the random variable total values;
accumulating the occurrence probability of all the random variable total values smaller than the current random variable total value based on the occurrence probability of each random variable total value, and recording the accumulated occurrence probability as the accumulated probability of the random variable total value;
taking the total value of the random variable as an x axis and the cumulative probability of the total value of the random variable as a y axis to obtain a data quality risk prediction probability curve;
and carrying out mathematical expression fitting on the data quality risk prediction probability curve to obtain a data quality risk prediction probability model.
7. A machine learning based big data intelligent governance decision system for implementing the machine learning based big data intelligent governance decision method of any of claims 1-6, comprising:
the quality measurement module is used for determining an automatic quality measurement tool, carrying out measurement assignment on the data quality index of the historical treatment data by adopting the automatic quality measurement tool, and carrying out measurement assignment on the data quality index of the treatment data by adopting the automatic quality measurement tool;
the risk model construction module is electrically connected with the quality measurement module, and the risk model construction unit is used for simulating the training data quality index value based on Monte Carlo to construct a data quality risk prediction probability model;
the data risk calculation module is electrically connected with the quality measurement module and the risk model construction module, and is used for summing a plurality of data quality index values of the data to be treated to obtain a quality total index value of the data to be treated and substituting the quality total index value of the data to be treated into the data quality risk prediction probability model to obtain a data quality risk index of the data to be treated;
the data analysis module is electrically connected with the data risk calculation module and is used for judging whether the data quality risk index of the data to be treated is larger than a preset value, if so, judging that the data to be treated is abnormal, and if not, judging that the data to be treated is normal.
8. The machine learning based big data intelligent abatement decision system of claim 7, wherein the risk model building module comprises:
a data preprocessing unit for determining a training data anchor point based on the training data quality index values, calculating an arithmetic average value of all the training data quality index values, and calculating a standard deviation of the training data quality index values based on the arithmetic average value of all the training data quality index values and the training data anchor point;
the model unit is used for constructing a data quality index random variable value calculation model corresponding to the training data quality index value based on the standard deviation of the training data quality index value and the arithmetic average value of the training data quality index value;
the training value determining unit is used for traversing the values in the value interval of 0-1 by taking the set data training indexes as the value interval to obtain a plurality of training probability values and substituting the training probability values into the data quality index random variable value calculation model to obtain a plurality of random variable values of the data quality index;
the random combination unit is used for randomly combining all random variable values of all data quality indexes to obtain a plurality of groups of random variable value groups;
the summation unit is used for summing all random variable values in each random variable value group to obtain a plurality of random variable total values;
the model fitting unit is used for calculating the proportion of the occurrence times of each random variable total value in the occurrence times of all random variable total values, recording the proportion as the occurrence probability of the random variable total values, accumulating the occurrence probability of all random variable total values smaller than the current random variable total value based on the occurrence probability of each random variable total value, recording the accumulation probability of the random variable total values as the accumulation probability of the random variable total values, taking the random variable total value as an x-axis, taking the accumulation probability of the random variable total values as a y-axis, and obtaining a data quality risk prediction probability curve and a mathematical expression for fitting the data quality risk prediction probability curve to obtain a data quality risk prediction probability model.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311558280.3A CN117273552B (en) | 2023-11-22 | 2023-11-22 | Big data intelligent treatment decision-making method and system based on machine learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311558280.3A CN117273552B (en) | 2023-11-22 | 2023-11-22 | Big data intelligent treatment decision-making method and system based on machine learning |
Publications (2)
Publication Number | Publication Date |
---|---|
CN117273552A true CN117273552A (en) | 2023-12-22 |
CN117273552B CN117273552B (en) | 2024-02-13 |
Family
ID=89201250
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311558280.3A Active CN117273552B (en) | 2023-11-22 | 2023-11-22 | Big data intelligent treatment decision-making method and system based on machine learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117273552B (en) |
Citations (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108764707A (en) * | 2018-05-24 | 2018-11-06 | 国信优易数据有限公司 | A kind of data assessment system and method |
CN108829750A (en) * | 2018-05-24 | 2018-11-16 | 国信优易数据有限公司 | A kind of quality of data determines system and method |
CN110727665A (en) * | 2019-09-23 | 2020-01-24 | 江河瑞通(北京)技术有限公司 | Internet of things equipment reported data quality analysis method and system |
CN111159169A (en) * | 2019-12-31 | 2020-05-15 | 中国联合网络通信集团有限公司 | Data management method and equipment |
CN111967774A (en) * | 2020-08-18 | 2020-11-20 | 中国银行股份有限公司 | Software quality risk prediction method and device |
KR20210060978A (en) * | 2019-11-19 | 2021-05-27 | 충북대학교 산학협력단 | Training Data Quality Assessment Technique for Machine Learning-based Software |
CN113127459A (en) * | 2019-12-31 | 2021-07-16 | 贵州医渡云技术有限公司 | Data governance implementation method and device, readable medium and electronic equipment |
CN113361624A (en) * | 2021-06-22 | 2021-09-07 | 北京邮电大学 | Machine learning-based sensing data quality evaluation method |
CN113469571A (en) * | 2021-07-22 | 2021-10-01 | 广东电网有限责任公司广州供电局 | Data quality evaluation method and device, computer equipment and readable storage medium |
CN113674105A (en) * | 2021-07-28 | 2021-11-19 | 国网天津市电力公司电力科学研究院 | Power quality on-line monitoring data quality assessment method |
US11204851B1 (en) * | 2020-07-31 | 2021-12-21 | International Business Machines Corporation | Real-time data quality analysis |
KR20220041600A (en) * | 2020-09-25 | 2022-04-01 | (주)디엘정보기술 | Method of evaluating quality of smart factory data |
CN116028489A (en) * | 2022-12-19 | 2023-04-28 | 城云科技(中国)有限公司 | Automatic data exploration method and application thereof |
CN116187450A (en) * | 2022-12-20 | 2023-05-30 | 中电信数智科技有限公司 | User AI reasoning service method based on data quality intelligent evaluation |
CN116703228A (en) * | 2023-06-14 | 2023-09-05 | 红有软件股份有限公司 | Big data quality evaluation method and system |
-
2023
- 2023-11-22 CN CN202311558280.3A patent/CN117273552B/en active Active
Patent Citations (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108829750A (en) * | 2018-05-24 | 2018-11-16 | 国信优易数据有限公司 | A kind of quality of data determines system and method |
CN108764707A (en) * | 2018-05-24 | 2018-11-06 | 国信优易数据有限公司 | A kind of data assessment system and method |
CN110727665A (en) * | 2019-09-23 | 2020-01-24 | 江河瑞通(北京)技术有限公司 | Internet of things equipment reported data quality analysis method and system |
KR20210060978A (en) * | 2019-11-19 | 2021-05-27 | 충북대학교 산학협력단 | Training Data Quality Assessment Technique for Machine Learning-based Software |
CN113127459A (en) * | 2019-12-31 | 2021-07-16 | 贵州医渡云技术有限公司 | Data governance implementation method and device, readable medium and electronic equipment |
CN111159169A (en) * | 2019-12-31 | 2020-05-15 | 中国联合网络通信集团有限公司 | Data management method and equipment |
US11204851B1 (en) * | 2020-07-31 | 2021-12-21 | International Business Machines Corporation | Real-time data quality analysis |
CN111967774A (en) * | 2020-08-18 | 2020-11-20 | 中国银行股份有限公司 | Software quality risk prediction method and device |
KR20220041600A (en) * | 2020-09-25 | 2022-04-01 | (주)디엘정보기술 | Method of evaluating quality of smart factory data |
CN113361624A (en) * | 2021-06-22 | 2021-09-07 | 北京邮电大学 | Machine learning-based sensing data quality evaluation method |
CN113469571A (en) * | 2021-07-22 | 2021-10-01 | 广东电网有限责任公司广州供电局 | Data quality evaluation method and device, computer equipment and readable storage medium |
CN113674105A (en) * | 2021-07-28 | 2021-11-19 | 国网天津市电力公司电力科学研究院 | Power quality on-line monitoring data quality assessment method |
CN116028489A (en) * | 2022-12-19 | 2023-04-28 | 城云科技(中国)有限公司 | Automatic data exploration method and application thereof |
CN116187450A (en) * | 2022-12-20 | 2023-05-30 | 中电信数智科技有限公司 | User AI reasoning service method based on data quality intelligent evaluation |
CN116703228A (en) * | 2023-06-14 | 2023-09-05 | 红有软件股份有限公司 | Big data quality evaluation method and system |
Non-Patent Citations (3)
Title |
---|
GEORGIOS FATOUROS等: ""Comprehensive Architecture for Data Quality Assessment in Industrial IoT"", 《2023 19TH INTERNATIONAL CONFERENCE ON DISTRIBUTED COMPUTING IN SMART SYSTEMS AND THE INTERNET OF THINGS (DCOSSIOT)》, pages 512 - 517 * |
WON-JUNG JANG等: ""A Study on Data Profiling: Focusing on Attribute Value Quality Index"", 《APPL. SCI.》, vol. 9, pages 1 - 14 * |
郑家朋;: "基于蒙特卡洛方法的提高原油采收率潜力风险性评价", 石油天然气学报, no. 01, pages 343 - 350 * |
Also Published As
Publication number | Publication date |
---|---|
CN117273552B (en) | 2024-02-13 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111046564B (en) | Residual life prediction method for two-stage degraded product | |
CN111080502B (en) | Big data identification method for regional enterprise data abnormal behaviors | |
CN109544399B (en) | Power transmission equipment state evaluation method and device based on multi-source heterogeneous data | |
CN115293463B (en) | Glass lens processing supervision method and system based on cutting quality prediction | |
CN111738308A (en) | Dynamic threshold detection method for monitoring index based on clustering and semi-supervised learning | |
CN117235649B (en) | Industrial equipment state intelligent monitoring system and method based on big data | |
CN116561519A (en) | Electric carbon emission process monitoring method and system based on big data of power grid | |
CN115775110A (en) | Service quality assessment method and device | |
CN114548494A (en) | Visual cost data prediction intelligent analysis system | |
CN117273552B (en) | Big data intelligent treatment decision-making method and system based on machine learning | |
CN113269378A (en) | Network traffic processing method and device, electronic equipment and readable storage medium | |
CN113891342A (en) | Base station inspection method and device, electronic equipment and storage medium | |
CN114881112A (en) | System anomaly detection method, device, equipment and medium | |
CN113962558A (en) | Industrial internet platform evaluation method and system based on production data management | |
CN109887253B (en) | Correlation analysis method for petrochemical device alarm | |
CN115858606A (en) | Method, device and equipment for detecting abnormity of time series data and storage medium | |
CN113313529A (en) | Finished oil sales amount prediction method based on time regression sequence | |
CN115376691A (en) | Risk level evaluation method and device, electronic equipment and storage medium | |
Leemis | Seven habits of highly successful input modelers | |
CN117709592A (en) | New energy consumption capability influence analysis method based on gravity center method | |
CN115146997A (en) | Evaluation method and device based on power data, electronic equipment and storage medium | |
CN117609740A (en) | Intelligent prediction maintenance system based on industrial large model | |
CN115169789A (en) | Batch product composite spot inspection compatibility inspection and reliability fusion evaluation method | |
CN114529202A (en) | Project evaluation method and device, electronic equipment and storage medium | |
CN116579640A (en) | Power marketing service channel user experience assessment method and system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |