CN117273552A - Big data intelligent treatment decision-making method and system based on machine learning - Google Patents

Big data intelligent treatment decision-making method and system based on machine learning Download PDF

Info

Publication number
CN117273552A
CN117273552A CN202311558280.3A CN202311558280A CN117273552A CN 117273552 A CN117273552 A CN 117273552A CN 202311558280 A CN202311558280 A CN 202311558280A CN 117273552 A CN117273552 A CN 117273552A
Authority
CN
China
Prior art keywords
data
value
data quality
random variable
quality index
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202311558280.3A
Other languages
Chinese (zh)
Other versions
CN117273552B (en
Inventor
苗敬峰
胥继云
夏敏
周芳
张新军
张迪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong Shunguo Electronic Technology Co ltd
Original Assignee
Shandong Shunguo Electronic Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong Shunguo Electronic Technology Co ltd filed Critical Shandong Shunguo Electronic Technology Co ltd
Priority to CN202311558280.3A priority Critical patent/CN117273552B/en
Publication of CN117273552A publication Critical patent/CN117273552A/en
Application granted granted Critical
Publication of CN117273552B publication Critical patent/CN117273552B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0639Performance analysis of employees; Performance analysis of enterprise or organisation operations
    • G06Q10/06393Score-carding, benchmarking or key performance indicator [KPI] analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0635Risk analysis of enterprise or organisation activities
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/30Computing systems specially adapted for manufacturing

Landscapes

  • Business, Economics & Management (AREA)
  • Human Resources & Organizations (AREA)
  • Engineering & Computer Science (AREA)
  • Economics (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Theoretical Computer Science (AREA)
  • Strategic Management (AREA)
  • Development Economics (AREA)
  • Educational Administration (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Tourism & Hospitality (AREA)
  • Quality & Reliability (AREA)
  • Operations Research (AREA)
  • General Business, Economics & Management (AREA)
  • Marketing (AREA)
  • Game Theory and Decision Science (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a big data intelligent management decision-making method and system based on machine learning, which relate to the technical field of data management and comprise the following steps: constructing a data evaluation index system; determining an automated quality metrics tool; acquiring historical treatment data in a large database; obtaining a plurality of training data quality index values; constructing a data quality risk prediction probability model; obtaining a plurality of data quality index values of data to be treated; obtaining a quality total index value of data to be treated; obtaining a data quality risk index of the data to be treated; judging whether the data quality risk index of the data to be treated is larger than a preset value, screening normal data, and taking all the normal data as a data basis of a big data intelligent treatment decision. The invention has the advantages that: the problem that the existing false alarm normal data is abnormal or true data is not reported is effectively solved, the data quality is improved, and the accuracy of intelligent management decision of big data is ensured.

Description

Big data intelligent treatment decision-making method and system based on machine learning
Technical Field
The invention relates to the technical field of data management, in particular to a big data intelligent management decision method and system based on machine learning.
Background
Data quality refers to the degree to which characteristics and attributes of data meet a particular need and desire. Data quality is becoming critical in the modern information age. Accurate, reliable, and high quality data is of great importance to individuals, businesses, and society. The high-quality data is the basis for making intelligent decisions, can reduce the operation cost and the maintenance cost, can establish and maintain a strong customer relationship, and can be used as the basis for researchers and analysts to identify trends, make predictions and support decisions. In summary, it is important to maintain the high quality characteristics of data.
In the existing big data treatment decision process, due to imperfect data sources, insufficient data cleaning, unreasonable parameter setting, data drift and the like, the problem that false normal data is abnormal or true data is missed is caused, the big data intelligent treatment decision is difficult to obtain high-quality data support, and the final decision strategy is difficult to be attached to the big data trend to the greatest extent.
Disclosure of Invention
In order to solve the technical problems, the technical scheme provides a big data intelligent treatment decision method and system based on machine learning, which solves the problems that in the existing big data treatment decision process, due to imperfect data sources, insufficient data cleaning, unreasonable parameter setting, data drifting and the like, false alarm normal data is abnormal or true data is missed, high-quality data support is difficult to obtain in big data intelligent treatment decision, and the final decision strategy is difficult to be attached to the trend of the big data most.
In order to achieve the above purpose, the invention adopts the following technical scheme:
a big data intelligent governance decision-making method based on machine learning comprises the following steps:
constructing a data evaluation index system, wherein the data evaluation index system consists of a plurality of data quality indexes, and the data evaluation index system comprises one or more of accuracy, consistency, reliability, timeliness, uniqueness, validity, understandability, compliance and safety;
determining an automated quality metric tool for metric assignment to a data quality index, the automated quality metric tool being one or more of Trifacta, openRefine or DataWrangler;
acquiring historical treatment data in a large database;
performing measurement and assignment on the data quality index of the historical treatment data by adopting an automatic quality measurement tool to obtain a plurality of training data quality index values;
simulating the training data quality index value based on Monte Carlo, and constructing a data quality risk prediction probability model;
performing measurement and assignment on the data quality index of the data to be treated by adopting an automatic quality measurement tool to obtain a plurality of data quality index values of the data to be treated;
summing a plurality of data quality index values of the data to be treated to obtain a quality total index value of the data to be treated;
substituting the total quality index value of the data to be treated into a data quality risk prediction probability model to obtain a data quality risk index of the data to be treated;
judging whether the data quality risk index of the data to be treated is larger than a preset value, if so, judging that the data to be treated is abnormal data, and if not, judging that the data to be treated is normal data;
and taking all normal data as a data basis for big data intelligent governance decision-making.
Preferably, the simulating the training data quality index value based on monte carlo specifically includes:
determining a training data anchor point based on the training data quality index value, wherein the training data anchor point is composed of the most optimistic data quality index value, the highest frequency data quality index value and the most pessimistic data quality index value;
calculating the arithmetic average value of all the training data quality index values;
calculating a standard deviation of the training data quality index value based on the arithmetic mean of all the training data quality index values and the training data anchor point;
constructing a data quality index random variable value calculation model corresponding to the training data quality index value based on the standard deviation of the training data quality index value and the arithmetic average value of the training data quality index value, wherein the data quality index random variable value calculation model takes a set probability value as input and takes a random variable value of the data quality index as output;
setting a data training index;
traversing the values in the value interval of 0-1 by taking the set data training indexes as the value interval to obtain a plurality of training probability values;
substituting the training probability value into a data quality index random variable value calculation model to obtain a plurality of random variable values of the data quality index;
randomly combining all random variable values of all data quality indexes to obtain a plurality of groups of random variable value groups;
summing all random variable values in each random variable value group to obtain a plurality of random variable total values;
and carrying out statistical analysis based on the total values of the random variables to obtain a data quality risk prediction probability model.
Preferably, the most optimistic data quality index value refers to the maximum value of training data quality index values;
the highest frequency data quality index value refers to the value with the highest occurrence frequency in the training data quality index values;
the pessimistic data quality index value refers to the minimum value of the training data quality index values.
Preferably, the standard deviation of the training data quality index value is calculated according to the following formula:
in the method, in the process of the invention,for standard deviation of training data quality index value +.>For the arithmetic mean value of all training data quality index values, +.>For the most optimistic data quality indicator value +.>For the highest frequency data quality index value, +.>Is the most pessimistic data quality index value.
Preferably, the expression of the data quality index random variable value calculation model is:
in the method, in the process of the invention,and (3) a random variable value of the data quality index, wherein p is a set probability value, and norm is an inverse cumulative distribution function for calculating normal distribution.
Preferably, the statistical analysis is performed based on a plurality of total random variable values, and the obtaining of the data quality risk prediction probability model specifically includes:
calculating the proportion of the occurrence times of each random variable total value in the occurrence times of all random variable total values, and recording the proportion as the occurrence probability of the random variable total values;
accumulating the occurrence probability of all the random variable total values smaller than the current random variable total value based on the occurrence probability of each random variable total value, and recording the accumulated occurrence probability as the accumulated probability of the random variable total value;
taking the total value of the random variable as an x axis and the cumulative probability of the total value of the random variable as a y axis to obtain a data quality risk prediction probability curve;
and carrying out mathematical expression fitting on the data quality risk prediction probability curve to obtain a data quality risk prediction probability model.
Further, a big data intelligent governance decision system based on machine learning is provided, which is characterized in that the big data intelligent governance decision method based on machine learning comprises:
the quality measurement module is used for determining an automatic quality measurement tool, carrying out measurement assignment on the data quality index of the historical treatment data by adopting the automatic quality measurement tool, and carrying out measurement assignment on the data quality index of the treatment data by adopting the automatic quality measurement tool;
the risk model construction module is electrically connected with the quality measurement module, and the risk model construction unit is used for simulating the training data quality index value based on Monte Carlo to construct a data quality risk prediction probability model;
the data risk calculation module is electrically connected with the quality measurement module and the risk model construction module, and is used for summing a plurality of data quality index values of the data to be treated to obtain a quality total index value of the data to be treated and substituting the quality total index value of the data to be treated into the data quality risk prediction probability model to obtain a data quality risk index of the data to be treated;
the data analysis module is electrically connected with the data risk calculation module and is used for judging whether the data quality risk index of the data to be treated is larger than a preset value, if so, judging that the data to be treated is abnormal, and if not, judging that the data to be treated is normal.
Optionally, the risk model building module includes:
a data preprocessing unit for determining a training data anchor point based on the training data quality index values, calculating an arithmetic average value of all the training data quality index values, and calculating a standard deviation of the training data quality index values based on the arithmetic average value of all the training data quality index values and the training data anchor point;
the model unit is used for constructing a data quality index random variable value calculation model corresponding to the training data quality index value based on the standard deviation of the training data quality index value and the arithmetic average value of the training data quality index value;
the training value determining unit is used for traversing the values in the value interval of 0-1 by taking the set data training indexes as the value interval to obtain a plurality of training probability values and substituting the training probability values into the data quality index random variable value calculation model to obtain a plurality of random variable values of the data quality index;
the random combination unit is used for randomly combining all random variable values of all data quality indexes to obtain a plurality of groups of random variable value groups;
the summation unit is used for summing all random variable values in each random variable value group to obtain a plurality of random variable total values;
the model fitting unit is used for calculating the proportion of the occurrence times of each random variable total value in the occurrence times of all random variable total values, recording the proportion as the occurrence probability of the random variable total values, accumulating the occurrence probability of all random variable total values smaller than the current random variable total value based on the occurrence probability of each random variable total value, recording the accumulation probability of the random variable total values as the accumulation probability of the random variable total values, taking the random variable total value as an x-axis, taking the accumulation probability of the random variable total values as a y-axis, and obtaining a data quality risk prediction probability curve and a mathematical expression for fitting the data quality risk prediction probability curve to obtain a data quality risk prediction probability model.
Compared with the prior art, the invention has the beneficial effects that:
according to the scheme, the data quality risk prediction probability model is constructed by simulating historical treatment data in a large database based on Monte Carlo, then the data quality index value of the actual data quality index to be treated is substituted into the data quality risk prediction probability model to obtain the data quality risk index, and then whether the data is abnormal or not is judged according to the data quality risk index.
Drawings
FIG. 1 is a flow chart of a big data intelligent treatment decision-making method based on machine learning according to the scheme;
FIG. 2 is a flow chart of a method for constructing a data quality risk prediction probability model in the present solution;
fig. 3 is a flowchart of a method for obtaining a data quality risk prediction probability model in the present solution.
Detailed Description
The following description is presented to enable one of ordinary skill in the art to make and use the invention. The preferred embodiments in the following description are by way of example only and other obvious variations will occur to those skilled in the art.
Referring to fig. 1, a big data intelligent governance decision method based on machine learning includes:
constructing a data evaluation index system, wherein the data evaluation index system consists of a plurality of data quality indexes, and comprises one or more of accuracy, consistency, reliability, timeliness, uniqueness, effectiveness, comprehensibility, compliance and safety, and the data quality indexes focused by different organizations and projects are also different, so that the credibility and the effectiveness of the data in the analysis and decision process are ensured, and the data quality indexes are required to be determined according to specific data requirements and business backgrounds;
an automated quality metric tool is determined, the automated quality metric tool is used for performing metric assignment on the data quality index, the automated quality metric tool is one or more of Trifacta, openRefine or DataWrangler, and different automated quality metric tools are suitable for different data types, such as Apache Flink is suitable for real-time data processing and batch processing. The method has the characteristics of low delay, high throughput and fault tolerance, and is suitable for processing real-time big data; pandas and NumPy can be used for processing large-scale data, and are particularly suitable for data analysis and cleaning;
acquiring historical treatment data in a large database;
performing measurement and assignment on the data quality index of the historical treatment data by adopting an automatic quality measurement tool to obtain a plurality of training data quality index values;
simulating the training data quality index value based on Monte Carlo, and constructing a data quality risk prediction probability model;
performing measurement and assignment on the data quality index of the data to be treated by adopting an automatic quality measurement tool to obtain a plurality of data quality index values of the data to be treated;
summing a plurality of data quality index values of the data to be treated to obtain a quality total index value of the data to be treated;
substituting the total quality index value of the data to be treated into a data quality risk prediction probability model to obtain a data quality risk index of the data to be treated;
judging whether the data quality risk index of the data to be treated is larger than a preset value, if so, judging that the data to be treated is abnormal data, and if not, judging that the data to be treated is normal data;
and taking all normal data as a data basis for big data intelligent governance decision-making.
According to the scheme, the data quality risk prediction probability model is constructed by simulating historical treatment data in a large database based on Monte Carlo, then the data quality index value of the actual data quality index to be treated is substituted into the data quality risk prediction probability model to obtain the data quality risk index, and whether the data is abnormal or not is judged according to the data quality risk index.
Referring to fig. 2, simulating the training data quality index value based on monte carlo, the constructing a data quality risk prediction probability model specifically includes:
determining a training data anchor point based on the training data quality index value, wherein the training data anchor point is composed of the most optimistic data quality index value, the highest frequency data quality index value and the most pessimistic data quality index value;
calculating the arithmetic average value of all the training data quality index values;
calculating a standard deviation of the training data quality index value based on the arithmetic mean of all the training data quality index values and the training data anchor point;
constructing a data quality index random variable value calculation model corresponding to the training data quality index value based on the standard deviation of the training data quality index value and the arithmetic average value of the training data quality index value, wherein the data quality index random variable value calculation model takes the set probability value as input and takes the random variable value of the data quality index as output;
setting a data training index;
traversing the values in the value interval of 0-1 by taking the set data training indexes as the value interval to obtain a plurality of training probability values;
substituting the training probability value into a data quality index random variable value calculation model to obtain a plurality of random variable values of the data quality index;
randomly combining all random variable values of all data quality indexes to obtain a plurality of groups of random variable value groups;
summing all random variable values in each random variable value group to obtain a plurality of random variable total values;
and carrying out statistical analysis based on the total values of the random variables to obtain a data quality risk prediction probability model.
Monte Carlo simulation is a numerical calculation method for solving complex randomness and uncertainty problems. The core idea of Monte Carlo simulation is to approximate the solution or nature of the problem by a large number of random samples, as the number of samples increases, the simulation result will get closer to the true value. According to the scheme, a data quality index random variable value calculation model is built through Monte Carlo simulation, a set probability value is taken as input, a random variable value of a data quality index is taken as output, and after the set simulation times are carried out, a plurality of random variable values of the data quality index are subjected to statistical analysis to obtain a data quality risk prediction probability model.
The most optimistic data quality index value refers to the maximum value of the training data quality index values;
the highest frequency data quality index value refers to the value with the highest occurrence frequency in the training data quality index values;
the pessimistic data quality index value refers to the minimum value among the training data quality index values.
Based on the most optimistic, highest frequency and pessimistic data quality index values, the data quality index values can be fixed in a section, and then Monte Carlo simulation is performed to simulate most of the possible situations obtained by the data quality measurement.
The standard deviation of the training data quality index value is calculated as follows:
in the method, in the process of the invention,for standard deviation of training data quality index value +.>For the arithmetic mean value of all training data quality index values, +.>For the most optimistic data quality indicator value +.>For the highest frequency data quality index value, +.>Is the most pessimistic data quality index value.
The expression of the data quality index random variable value calculation model is as follows:
in the method, in the process of the invention,and (3) a random variable value of the data quality index, wherein p is a set probability value, and norm is an inverse cumulative distribution function for calculating normal distribution.
It will be appreciated that norm v is one of the functions commonly used in statistical and data analysis and is typically used to calculate an inverse cumulative distribution function of a normal distribution, with the input parameters including the probability value p and the mean and standard deviation of the normal distribution, the function returning a random variable value such that the cumulative distribution function of the random variable is equal to the given probability value p. The scheme obtains the random variable value of the data quality index by using a norm function.
Referring to fig. 3, statistical analysis is performed based on a plurality of total random variable values, and the obtaining a data quality risk prediction probability model specifically includes:
calculating the proportion of the occurrence times of each random variable total value in the occurrence times of all random variable total values, and recording the proportion as the occurrence probability of the random variable total values;
accumulating the occurrence probability of all the random variable total values smaller than the current random variable total value based on the occurrence probability of each random variable total value, and recording the accumulated occurrence probability as the accumulated probability of the random variable total value;
taking the total value of the random variable as an x axis and the cumulative probability of the total value of the random variable as a y axis to obtain a data quality risk prediction probability curve;
and carrying out mathematical expression fitting on the data quality risk prediction probability curve to obtain a data quality risk prediction probability model.
Based on the number of training probability values, a plurality of random variable value groups can be obtained, the random variable value groups are summed to obtain a plurality of random variable total values, statistical analysis is performed, and the cumulative probability of the random variable total values is calculated. Fitting a data quality risk prediction probability curve by using the accumulated probability corresponding to the random variable total value, wherein when the simulation times are more, the obtained data quality risk prediction probability curve is more accurate, and the data quality risk prediction probability of any quality total index value in an estimated interval can be obtained from the data quality risk prediction probability curve.
Furthermore, the scheme is based on the same inventive concept as the big data intelligent governance decision method based on machine learning, and also provides a big data intelligent governance decision system based on machine learning, which comprises:
the quality measurement module is used for determining an automatic quality measurement tool, carrying out measurement assignment on the data quality index of the historical treatment data by adopting the automatic quality measurement tool, and carrying out measurement assignment on the data quality index of the treatment data by adopting the automatic quality measurement tool;
the risk model construction module is electrically connected with the quality measurement module, and the risk model construction unit is used for simulating the training data quality index value based on Monte Carlo to construct a data quality risk prediction probability model;
the data risk calculation module is electrically connected with the quality measurement module and the risk model construction module, and is used for summing a plurality of data quality index values of the data to be treated to obtain a quality total index value of the data to be treated and substituting the quality total index value of the data to be treated into the data quality risk prediction probability model to obtain a data quality risk index of the data to be treated;
the data analysis module is electrically connected with the data risk calculation module and is used for judging whether the data quality risk index of the data to be treated is larger than a preset value, if so, judging that the data to be treated is abnormal, and if not, judging that the data to be treated is normal.
The risk model construction module comprises:
the data preprocessing unit is used for determining a training data anchor point based on the training data quality index value, calculating an arithmetic average value of all training data quality index values and calculating a standard deviation of the training data quality index value based on the arithmetic average value of all training data quality index values and the training data anchor point;
the model unit is used for constructing a data quality index random variable value calculation model corresponding to the training data quality index value based on the standard deviation of the training data quality index value and the arithmetic mean value of the training data quality index value;
the training value determining unit is used for traversing the values in the value interval of 0-1 by taking the set data training indexes as the value interval to obtain a plurality of training probability values and substituting the training probability values into the data quality index random variable value calculation model to obtain a plurality of random variable values of the data quality index;
the random combination unit is used for randomly combining all random variable values of all data quality indexes to obtain a plurality of groups of random variable value groups;
the summation unit is used for summing all random variable values in each random variable value group to obtain a plurality of random variable total values;
the model fitting unit is used for calculating the proportion of the occurrence times of each random variable total value in the occurrence times of all random variable total values, marking the proportion as the occurrence probability of the random variable total values, accumulating the occurrence probabilities of all random variable total values smaller than the current random variable total value based on the occurrence probability of each random variable total value, marking the accumulation probability of the random variable total values as the accumulation probability of the random variable total values, taking the random variable total value as an x-axis, taking the accumulation probability of the random variable total values as a y-axis, obtaining a data quality risk prediction probability curve and carrying out mathematical expression of the fitting data quality risk prediction probability curve, and obtaining a data quality risk prediction probability model.
In summary, the invention has the advantages that: the problem that the existing false alarm normal data is abnormal or true data is not reported is effectively solved, and therefore high-quality data support can be obtained when the big data intelligent treatment decision is carried out, and the accuracy of the big data intelligent treatment decision is guaranteed.
The foregoing has shown and described the basic principles, principal features and advantages of the invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, and that the above embodiments and descriptions are merely illustrative of the principles of the present invention, and various changes and modifications may be made therein without departing from the spirit and scope of the invention, which is defined by the appended claims. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims (8)

1. The big data intelligent governance decision-making method based on machine learning is characterized by comprising the following steps:
constructing a data evaluation index system, wherein the data evaluation index system consists of a plurality of data quality indexes, and the data evaluation index system comprises one or more of accuracy, consistency, reliability, timeliness, uniqueness, validity, understandability, compliance and safety;
determining an automated quality metric tool for metric assignment to a data quality index, the automated quality metric tool being one or more of Trifacta, openRefine or DataWrangler;
acquiring historical treatment data in a large database;
performing measurement and assignment on the data quality index of the historical treatment data by adopting an automatic quality measurement tool to obtain a plurality of training data quality index values;
simulating the training data quality index value based on Monte Carlo, and constructing a data quality risk prediction probability model;
performing measurement and assignment on the data quality index of the data to be treated by adopting an automatic quality measurement tool to obtain a plurality of data quality index values of the data to be treated;
summing a plurality of data quality index values of the data to be treated to obtain a quality total index value of the data to be treated;
substituting the total quality index value of the data to be treated into a data quality risk prediction probability model to obtain a data quality risk index of the data to be treated;
judging whether the data quality risk index of the data to be treated is larger than a preset value, if so, judging that the data to be treated is abnormal data, and if not, judging that the data to be treated is normal data;
and taking all normal data as a data basis for big data intelligent governance decision-making.
2. The machine learning-based big data intelligent governance decision method of claim 1, wherein the modeling training data quality index values based on monte carlo, and constructing the data quality risk prediction probability model specifically comprises:
determining a training data anchor point based on the training data quality index value, wherein the training data anchor point is composed of the most optimistic data quality index value, the highest frequency data quality index value and the most pessimistic data quality index value;
calculating the arithmetic average value of all the training data quality index values;
calculating a standard deviation of the training data quality index value based on the arithmetic mean of all the training data quality index values and the training data anchor point;
constructing a data quality index random variable value calculation model corresponding to the training data quality index value based on the standard deviation of the training data quality index value and the arithmetic average value of the training data quality index value, wherein the data quality index random variable value calculation model takes a set probability value as input and takes a random variable value of the data quality index as output;
setting a data training index;
traversing the values in the value interval of 0-1 by taking the set data training indexes as the value interval to obtain a plurality of training probability values;
substituting the training probability value into a data quality index random variable value calculation model to obtain a plurality of random variable values of the data quality index;
randomly combining all random variable values of all data quality indexes to obtain a plurality of groups of random variable value groups;
summing all random variable values in each random variable value group to obtain a plurality of random variable total values;
and carrying out statistical analysis based on the total values of the random variables to obtain a data quality risk prediction probability model.
3. The machine learning based big data intelligent governance decision method of claim 2, wherein the most optimistic data quality indicator value refers to a maximum value of training data quality indicator values;
the highest frequency data quality index value refers to the value with the highest occurrence frequency in the training data quality index values;
the pessimistic data quality index value refers to the minimum value of the training data quality index values.
4. The machine learning-based big data intelligent governance decision method of claim 3, wherein the standard deviation of the training data quality index value is calculated according to the formula:
in the method, in the process of the invention,for standard deviation of training data quality index value +.>For the arithmetic mean value of all training data quality index values, +.>For the most optimistic data quality indicator value +.>For the highest frequency data quality index value, +.>Is the most pessimistic data quality index value.
5. The machine learning-based big data intelligent governance decision method of claim 4, wherein the expression of the data quality index random variable value calculation model is:
in the method, in the process of the invention,and (3) a random variable value of the data quality index, wherein p is a set probability value, and norm is an inverse cumulative distribution function for calculating normal distribution.
6. The machine learning-based big data intelligent governance decision method of claim 5, wherein the statistical analysis based on the total value of a plurality of random variables to obtain a data quality risk prediction probability model specifically comprises:
calculating the proportion of the occurrence times of each random variable total value in the occurrence times of all random variable total values, and recording the proportion as the occurrence probability of the random variable total values;
accumulating the occurrence probability of all the random variable total values smaller than the current random variable total value based on the occurrence probability of each random variable total value, and recording the accumulated occurrence probability as the accumulated probability of the random variable total value;
taking the total value of the random variable as an x axis and the cumulative probability of the total value of the random variable as a y axis to obtain a data quality risk prediction probability curve;
and carrying out mathematical expression fitting on the data quality risk prediction probability curve to obtain a data quality risk prediction probability model.
7. A machine learning based big data intelligent governance decision system for implementing the machine learning based big data intelligent governance decision method of any of claims 1-6, comprising:
the quality measurement module is used for determining an automatic quality measurement tool, carrying out measurement assignment on the data quality index of the historical treatment data by adopting the automatic quality measurement tool, and carrying out measurement assignment on the data quality index of the treatment data by adopting the automatic quality measurement tool;
the risk model construction module is electrically connected with the quality measurement module, and the risk model construction unit is used for simulating the training data quality index value based on Monte Carlo to construct a data quality risk prediction probability model;
the data risk calculation module is electrically connected with the quality measurement module and the risk model construction module, and is used for summing a plurality of data quality index values of the data to be treated to obtain a quality total index value of the data to be treated and substituting the quality total index value of the data to be treated into the data quality risk prediction probability model to obtain a data quality risk index of the data to be treated;
the data analysis module is electrically connected with the data risk calculation module and is used for judging whether the data quality risk index of the data to be treated is larger than a preset value, if so, judging that the data to be treated is abnormal, and if not, judging that the data to be treated is normal.
8. The machine learning based big data intelligent abatement decision system of claim 7, wherein the risk model building module comprises:
a data preprocessing unit for determining a training data anchor point based on the training data quality index values, calculating an arithmetic average value of all the training data quality index values, and calculating a standard deviation of the training data quality index values based on the arithmetic average value of all the training data quality index values and the training data anchor point;
the model unit is used for constructing a data quality index random variable value calculation model corresponding to the training data quality index value based on the standard deviation of the training data quality index value and the arithmetic average value of the training data quality index value;
the training value determining unit is used for traversing the values in the value interval of 0-1 by taking the set data training indexes as the value interval to obtain a plurality of training probability values and substituting the training probability values into the data quality index random variable value calculation model to obtain a plurality of random variable values of the data quality index;
the random combination unit is used for randomly combining all random variable values of all data quality indexes to obtain a plurality of groups of random variable value groups;
the summation unit is used for summing all random variable values in each random variable value group to obtain a plurality of random variable total values;
the model fitting unit is used for calculating the proportion of the occurrence times of each random variable total value in the occurrence times of all random variable total values, recording the proportion as the occurrence probability of the random variable total values, accumulating the occurrence probability of all random variable total values smaller than the current random variable total value based on the occurrence probability of each random variable total value, recording the accumulation probability of the random variable total values as the accumulation probability of the random variable total values, taking the random variable total value as an x-axis, taking the accumulation probability of the random variable total values as a y-axis, and obtaining a data quality risk prediction probability curve and a mathematical expression for fitting the data quality risk prediction probability curve to obtain a data quality risk prediction probability model.
CN202311558280.3A 2023-11-22 2023-11-22 Big data intelligent treatment decision-making method and system based on machine learning Active CN117273552B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311558280.3A CN117273552B (en) 2023-11-22 2023-11-22 Big data intelligent treatment decision-making method and system based on machine learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311558280.3A CN117273552B (en) 2023-11-22 2023-11-22 Big data intelligent treatment decision-making method and system based on machine learning

Publications (2)

Publication Number Publication Date
CN117273552A true CN117273552A (en) 2023-12-22
CN117273552B CN117273552B (en) 2024-02-13

Family

ID=89201250

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311558280.3A Active CN117273552B (en) 2023-11-22 2023-11-22 Big data intelligent treatment decision-making method and system based on machine learning

Country Status (1)

Country Link
CN (1) CN117273552B (en)

Citations (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108764707A (en) * 2018-05-24 2018-11-06 国信优易数据有限公司 A kind of data assessment system and method
CN108829750A (en) * 2018-05-24 2018-11-16 国信优易数据有限公司 A kind of quality of data determines system and method
CN110727665A (en) * 2019-09-23 2020-01-24 江河瑞通(北京)技术有限公司 Internet of things equipment reported data quality analysis method and system
CN111159169A (en) * 2019-12-31 2020-05-15 中国联合网络通信集团有限公司 Data management method and equipment
CN111967774A (en) * 2020-08-18 2020-11-20 中国银行股份有限公司 Software quality risk prediction method and device
KR20210060978A (en) * 2019-11-19 2021-05-27 충북대학교 산학협력단 Training Data Quality Assessment Technique for Machine Learning-based Software
CN113127459A (en) * 2019-12-31 2021-07-16 贵州医渡云技术有限公司 Data governance implementation method and device, readable medium and electronic equipment
CN113361624A (en) * 2021-06-22 2021-09-07 北京邮电大学 Machine learning-based sensing data quality evaluation method
CN113469571A (en) * 2021-07-22 2021-10-01 广东电网有限责任公司广州供电局 Data quality evaluation method and device, computer equipment and readable storage medium
CN113674105A (en) * 2021-07-28 2021-11-19 国网天津市电力公司电力科学研究院 Power quality on-line monitoring data quality assessment method
US11204851B1 (en) * 2020-07-31 2021-12-21 International Business Machines Corporation Real-time data quality analysis
KR20220041600A (en) * 2020-09-25 2022-04-01 (주)디엘정보기술 Method of evaluating quality of smart factory data
CN116028489A (en) * 2022-12-19 2023-04-28 城云科技(中国)有限公司 Automatic data exploration method and application thereof
CN116187450A (en) * 2022-12-20 2023-05-30 中电信数智科技有限公司 User AI reasoning service method based on data quality intelligent evaluation
CN116703228A (en) * 2023-06-14 2023-09-05 红有软件股份有限公司 Big data quality evaluation method and system

Patent Citations (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108829750A (en) * 2018-05-24 2018-11-16 国信优易数据有限公司 A kind of quality of data determines system and method
CN108764707A (en) * 2018-05-24 2018-11-06 国信优易数据有限公司 A kind of data assessment system and method
CN110727665A (en) * 2019-09-23 2020-01-24 江河瑞通(北京)技术有限公司 Internet of things equipment reported data quality analysis method and system
KR20210060978A (en) * 2019-11-19 2021-05-27 충북대학교 산학협력단 Training Data Quality Assessment Technique for Machine Learning-based Software
CN113127459A (en) * 2019-12-31 2021-07-16 贵州医渡云技术有限公司 Data governance implementation method and device, readable medium and electronic equipment
CN111159169A (en) * 2019-12-31 2020-05-15 中国联合网络通信集团有限公司 Data management method and equipment
US11204851B1 (en) * 2020-07-31 2021-12-21 International Business Machines Corporation Real-time data quality analysis
CN111967774A (en) * 2020-08-18 2020-11-20 中国银行股份有限公司 Software quality risk prediction method and device
KR20220041600A (en) * 2020-09-25 2022-04-01 (주)디엘정보기술 Method of evaluating quality of smart factory data
CN113361624A (en) * 2021-06-22 2021-09-07 北京邮电大学 Machine learning-based sensing data quality evaluation method
CN113469571A (en) * 2021-07-22 2021-10-01 广东电网有限责任公司广州供电局 Data quality evaluation method and device, computer equipment and readable storage medium
CN113674105A (en) * 2021-07-28 2021-11-19 国网天津市电力公司电力科学研究院 Power quality on-line monitoring data quality assessment method
CN116028489A (en) * 2022-12-19 2023-04-28 城云科技(中国)有限公司 Automatic data exploration method and application thereof
CN116187450A (en) * 2022-12-20 2023-05-30 中电信数智科技有限公司 User AI reasoning service method based on data quality intelligent evaluation
CN116703228A (en) * 2023-06-14 2023-09-05 红有软件股份有限公司 Big data quality evaluation method and system

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
GEORGIOS FATOUROS等: ""Comprehensive Architecture for Data Quality Assessment in Industrial IoT"", 《2023 19TH INTERNATIONAL CONFERENCE ON DISTRIBUTED COMPUTING IN SMART SYSTEMS AND THE INTERNET OF THINGS (DCOSSIOT)》, pages 512 - 517 *
WON-JUNG JANG等: ""A Study on Data Profiling: Focusing on Attribute Value Quality Index"", 《APPL. SCI.》, vol. 9, pages 1 - 14 *
郑家朋;: "基于蒙特卡洛方法的提高原油采收率潜力风险性评价", 石油天然气学报, no. 01, pages 343 - 350 *

Also Published As

Publication number Publication date
CN117273552B (en) 2024-02-13

Similar Documents

Publication Publication Date Title
CN111046564B (en) Residual life prediction method for two-stage degraded product
CN111080502B (en) Big data identification method for regional enterprise data abnormal behaviors
CN109544399B (en) Power transmission equipment state evaluation method and device based on multi-source heterogeneous data
CN115293463B (en) Glass lens processing supervision method and system based on cutting quality prediction
CN111738308A (en) Dynamic threshold detection method for monitoring index based on clustering and semi-supervised learning
CN117235649B (en) Industrial equipment state intelligent monitoring system and method based on big data
CN116561519A (en) Electric carbon emission process monitoring method and system based on big data of power grid
CN115775110A (en) Service quality assessment method and device
CN114548494A (en) Visual cost data prediction intelligent analysis system
CN117273552B (en) Big data intelligent treatment decision-making method and system based on machine learning
CN113269378A (en) Network traffic processing method and device, electronic equipment and readable storage medium
CN113891342A (en) Base station inspection method and device, electronic equipment and storage medium
CN114881112A (en) System anomaly detection method, device, equipment and medium
CN113962558A (en) Industrial internet platform evaluation method and system based on production data management
CN109887253B (en) Correlation analysis method for petrochemical device alarm
CN115858606A (en) Method, device and equipment for detecting abnormity of time series data and storage medium
CN113313529A (en) Finished oil sales amount prediction method based on time regression sequence
CN115376691A (en) Risk level evaluation method and device, electronic equipment and storage medium
Leemis Seven habits of highly successful input modelers
CN117709592A (en) New energy consumption capability influence analysis method based on gravity center method
CN115146997A (en) Evaluation method and device based on power data, electronic equipment and storage medium
CN117609740A (en) Intelligent prediction maintenance system based on industrial large model
CN115169789A (en) Batch product composite spot inspection compatibility inspection and reliability fusion evaluation method
CN114529202A (en) Project evaluation method and device, electronic equipment and storage medium
CN116579640A (en) Power marketing service channel user experience assessment method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant