CN118332034B

CN118332034B - Data mining model construction method and system based on machine learning

Info

Publication number: CN118332034B
Application number: CN202410775200.8A
Authority: CN
Inventors: 常兆鑫; 於静芬
Original assignee: Nanchang Institute of Technology
Current assignee: Nanchang Institute of Technology
Priority date: 2024-06-17
Filing date: 2024-06-17
Publication date: 2024-08-09
Anticipated expiration: 2044-06-17
Also published as: CN118332034A

Abstract

The invention discloses a data mining model construction method and a system based on machine learning, which relate to the field of electric digital data processing, and comprise the steps of firstly carrying out data acquisition and preprocessing, then carrying out data quality assessment by adopting a multi-element comprehensive analysis model, carrying out secondary processing, then carrying out feature extraction and feature fusion processing on data, then adopting a self-adaptive optimal decision model to carry out selection of an optimal initial data mining model, training the selected initial data mining model, adopting a component dynamic simulation test mechanism and a test set sample to simulate the operation process of the constructed model, adopting a visual platform to carry out visual display on the simulated operation process, and finally adopting feedback adjustment to adjust and update model parameters.

Description

Data mining model construction method and system based on machine learning

Technical Field

The invention relates to the field of electric digital data processing, in particular to a data mining model construction method and system based on machine learning.

Background

With the advent of the internet and the big data age, the data accumulation of organizations and enterprises is expanding, the generation speed, scale and complexity of data are increasing, and the data contains rich information, but it is not easy to directly obtain valuable knowledge from the data. How to extract valuable information from these data becomes a urgent problem to be solved. The data mining technology can automatically discover useful information such as modes, associations, trends and the like from a large amount of data, and provides powerful support for decision making of enterprises. Machine learning, one of the important means of data mining, allows computer systems to automatically acquire knowledge from data by simulating the learning process of humans, and continuously optimize and improve their own performance. Therefore, a data mining model construction method and system based on machine learning are generated.

However, the data mining model constructed by the existing method and system often has the problems of poor adaptability and reliability, and mainly has the following reasons: the existing data mining model construction method and system only depend on a single or limited data source, so that the constructed data mining model is difficult to adapt to the complex data environment; along with the continuous updating of data, the data mining model is also required to be continuously updated, and the existing model construction method and system require manual intervention, so that the generalization capability of the model is weaker and cannot be suitable for data mining tasks in different scenes, and the adaptability of the model is reduced; the data often has the problems of noise, missing values, abnormal values and the like, and the reliability of the data mining model can be seriously affected by incomplete processing. And lacks a method for performing performance simulation test and feedback optimization on the constructed model.

Therefore, a method and a system for constructing a data mining model based on machine learning are needed to solve the above problems.

Disclosure of Invention

Aiming at the situation, in order to overcome the defects of the prior art, the invention aims to provide a data mining model construction method and system based on machine learning, which can realize the construction of a data mining model and improve the adaptability of the constructed model and the accuracy of a data mining result through links of data quality evaluation, feature processing, model selection, training, test evaluation and optimization.

The invention adopts the following technical scheme:

a data mining model construction method based on machine learning comprises the following steps:

The method comprises the steps that data are collected and preprocessed through a data collection module and a data preprocessing module, the data collection module collects multi-source data in real time through an application program access interface and a data sensor network, the data preprocessing module cleans and converts the collected data through a real-time stream processing engine, and the multi-source data at least comprise enterprise file data, database data and business system data;

The data quality assessment module carries out data quality assessment on the preprocessed data, adopts a multi-element comprehensive analysis model to carry out comprehensive assessment on data integrity, accuracy, consistency and relativity, carries out secondary processing on the assessed residual data through a self-supervision characteristic learning model, and adopts a multithreading parallel computing mode to improve data quality assessment efficiency;

The data characteristic processing module is used for carrying out characteristic extraction and fusion on the data subjected to secondary processing and dividing the data characteristics into a training set, a verification set and a test set, and comprises a characteristic extraction unit, a characteristic dimension reduction unit, a characteristic fusion unit and a characteristic division unit;

selecting a data mining model through a model selection module, training the selected initial data mining model through a model training module, selecting an optimal initial data mining model by the model selection module through an adaptive optimal decision model, selecting functional attributes and model parameter attributes of the initial data mining model according to data characteristics and target tasks by the adaptive optimal decision model, and training the model by a training set sample by the model training module;

Performing performance simulation test on the trained model through a model test evaluation module, wherein the model test evaluation module adopts a component dynamic simulation test mechanism and a test set sample to simulate and construct the running process of the model, and adopts a visual platform Plotly to visually display the simulated running process;

And optimizing the constructed model by a model optimization module, wherein the model optimization module adjusts and updates model parameters in a feedback adjustment mode.

Further, the multi-element comprehensive analysis model realizes comprehensive evaluation of data quality by weighting different evaluation indexes, and the data set of the preprocessed data is thatFor the number of data after the preprocessing,Represent the firstThe data after the pre-processing is used,Integrity, accuracy, consistency and correlation indexes of (a) are respectivelyAndThe indicator of the integrity is indicated as such,The index of accuracy is indicated as such,The index of the consistency is indicated as such,The method is characterized in that the method represents a correlation index, and the integrated evaluation output function formula of the integrity, accuracy, consistency and correlation index of the ith preprocessed data is as follows:

In the case of the formula (1), Represent the firstThe integrity, accuracy, consistency and correlation index comprehensive evaluation value of the preprocessed data,The weight of the integrity indicator is represented as,The weight of the accuracy index is represented,The weight of the consistency index is represented,The weight of the correlation index is represented,Represent the firstAn integrity index value for each of the preprocessed data,Represent the firstAn accuracy index value of the preprocessed data,Represent the firstA consistency index value of the preprocessed data,Represent the firstA correlation index value of the pre-processed data,For eliminating the dimensional difference between the indexes,For adjusting the sensitivity of the overall evaluation result,Is used for carrying out normalization processing on the comprehensive evaluation result,The method comprises the steps of representing an adjusting factor, balancing the importance of each index, and judging residual data when the comprehensive evaluation value of the integrity, accuracy, consistency and correlation index of the preprocessed data is smaller than an evaluation threshold value; the calculation formulas of the integrity index weight, the accuracy index weight, the consistency index weight and the correlation index weight of the preprocessed data are as follows:

In the formula (2) of the present invention, Represents the maximum value of the post-preprocessing data integrity index value,Represents the minimum value of the pre-processed data integrity index value,Represents the maximum value of the pre-processed data accuracy index value,Represents the minimum value of the pre-processed data accuracy index value,Represents the maximum value of the data consistency index value after preprocessing,Represents the minimum value of the data consistency index value after preprocessing,Represents the maximum value of the data correlation index value after preprocessing,Representing the minimum value of the data correlation index value after preprocessing.

Further, the self-supervision characteristic learning model adopts a deep confidence network to carry out secondary processing of residual data, the deep confidence network captures data characteristics and rules through layer-by-layer training on the preprocessed data, data characteristic labels are generated based on the data characteristics and rules, and the residual data is filled and replaced based on the data characteristic labels.

Further, the feature extraction unit extracts high-dimensional feature vectors of the data after secondary processing through the convolutional neural network, the feature dimension reduction unit reduces the extracted high-dimensional feature vectors to low dimensions through a principal component analysis method, the feature fusion unit realizes fusion of low-dimensional feature sequences of different modes through stacking of self encoders, the feature division unit divides the data features into a training set, a verification set and a test set by adopting layered random sampling, the output end of the feature extraction unit is connected with the input end of the feature dimension reduction unit, the output end of the feature dimension reduction unit is connected with the input end of the feature fusion unit, and the output end of the feature fusion unit is connected with the input end of the feature division unit.

Further, the adaptive optimal decision model comprises an input layer, a feature selection layer, a data reconstruction layer, an adaptive parameter adjustment layer, a functional attribute matching layer, a parameter attribute matching layer and an output layer, and the working method of the adaptive optimal decision model comprises the following steps:

inputting the data characteristics processed by the data characteristic processing module into the self-adaptive optimal decision model through an input layer;

The method comprises the steps that data feature selection is conducted through a feature selection layer, a time sequence of selection features is reconstructed through a data reconstruction layer, the feature selection layer evaluates the correlation between input features and data mining tasks through calculating correlation coefficients between the input features and response variables of the data mining tasks, feature variables with correlation larger than a threshold value are selected as selection features through chi-square inspection, the feature selection layer conducts real-time updating of newly added data features and timing deleting of historical data features through an incremental learning window and a sliding window, and the data reconstruction layer conducts time sequence reconstruction of the selection features through differential, smoothing, filtering and standardization operation;

Creating a model set according to the selected characteristics and the time sequence processed by the data reconstruction layer, and adjusting model parameters through a self-adaptive parameter adjusting layer, wherein the self-adaptive parameter adjusting layer automatically adjusts the parameter configuration of the model based on inertia weight, adjusts the data mining model parameters to be optimal through an iterative search mode, sets control parameters and target parameters according to a historical optimal solution, and dynamically adjusts a neighborhood search range based on different neighborhood search strategies;

the functional attribute matching layer performs functional attribute matching selection on the models in the model set through a regular expression, the regular expression compares and matches the identified data mining task with the model set type to complete functional attribute screening of the initial data mining model, the parameter attribute matching layer performs parameter attribute matching selection on the data mining model according to a characteristic space fitting mode, and the characteristic space fitting mode performs parameter comparison and matching on the selected data features and the screened model to complete parameter attribute screening of the initial data mining model;

and the output layer outputs the selected model as an optimal decision scheme.

Further, the incremental learning window inputs newly added data points in an incremental calculation mode, the newly added data points are added to a data set in a time window, and the data set in the time window acquires a data mining model trained by the newly added data points through updating and iterating the initial data mining model in real time.

Further, the inertia weight is adaptively updated according to the global history optimal fitness value and the iterative mode, and the model set isFor the number of models in the model set,To the first of the model setA model number of the model setThe parameter set of the individual model isCentralizing the modelThe parameter set of the individual models is set,Centralizing the modelModel numberThe parameters of the parameters are set to be,Centralizing the modelThe number of parameters of each model and the formula of the inertial weight output function are as follows:

In the formula (3) of the present invention, Is the firstThe number of weighted auxiliary values is chosen to be,As a function of the weighting aid,To obtain the maximum value of the disturbed intensity of the digital communication signal,Centralizing the modelThe maximum value of the parameter of the individual model,Centralizing the modelThe minimum value of the parameters of the individual models,Is the firstModel number oneThe inertial weight at the time of the iteration,For the initial maximum inertial weight to be the same,For the initial minimum inertial weight to be the same,For the current number of iterations,Is the total number of iterations.

Further, the component dynamic simulation test mechanism is used for decomposing the constructed data mining model into a plurality of components through a component object model, simulating the operation and response process of the data mining model by adopting a component simulator, defining an interface interaction mode and a parameter transmission process among the plurality of components in the data mining by adopting an interface definition language IDL, monitoring the execution process of the data mining model and the change state of variables by adopting a component debugger, and automatically generating an operation event report of the data mining model through a component log recorder.

Further, a machine learning-based data mining model construction system, the data mining model construction system comprising;

The data acquisition module adopts an application program access interface and a data sensor network to acquire multi-source data in real time;

The data preprocessing module is used for carrying out data cleaning and data conversion processing on the acquired data through the real-time stream processing engine;

The data quality evaluation module comprehensively evaluates the integrity, the accuracy, the consistency and the relativity of the data through a multi-element comprehensive analysis model, and carries out secondary processing on residual data through a self-supervision characterization learning model;

The data characteristic processing module comprises a characteristic extraction unit, a characteristic dimension reduction unit, a characteristic fusion unit and a characteristic division unit;

The model selection module is used for selecting an optimal initial data mining model by adopting a self-adaptive optimal decision model, and the self-adaptive optimal decision model selects functional attributes and model parameter attributes of the initial data mining model according to data characteristics and target tasks;

The model training module adopts a training set sample to carry out model training;

The model test evaluation module adopts a component dynamic simulation test mechanism and a test set sample to carry out simulation test on the constructed model, and adopts a visual platform Plotly to carry out visual display on the simulation test process of the model;

The model optimization module adopts feedback adjustment and verification set samples to adjust and update model parameters;

The output end of the data acquisition module is connected with the input end of the data preprocessing module, the output end of the data preprocessing module is connected with the input end of the data quality evaluation module, the output end of the data quality evaluation module is connected with the input end of the data characteristic processing module, the output end of the data characteristic processing module is connected with the input end of the model selection module, the output end of the data characteristic processing module is connected with the input end of the model training module, the output end of the model selection module is connected with the input end of the model training module, the output end of the model training module is connected with the input end of the model test evaluation module, the output end of the model test evaluation module is connected with the input end of the model optimization module, and the output end of the model training module is in bidirectional connection with the model optimization module.

Further, the visualization platform Plotly obtains the associated data in the model simulation test process based on the associated data model, and displays the change trend and the associated relation of the data in real time by adopting an interactive chart, a hot spot diagram and a dashboard, and the visualization platform Plotly verifies the identity of the access user by adopting a Token user identity verification mechanism.

The beneficial effects of the invention are as follows:

1. the method and the device acquire rich data from different data sources through the application program access interface and the data sensor network, improve the input dimension of the model, simultaneously, adopt the real-time stream processing engine to instantly clean and convert the acquired data, improve the efficiency and accuracy of data processing, and improve the adaptability of the model to real scenes.

2. According to the invention, the preprocessed data is comprehensively evaluated through the multivariate comprehensive analysis model, the problems in the data are effectively identified, and the residual data is subjected to secondary processing through the self-supervision characteristic learning model, so that the quality evaluation and screening of the data automation are realized, the data quality is further improved, and the reliability of the constructed model is ensured.

3. According to the invention, the performance simulation test is carried out on the constructed model through the component dynamic simulation test mechanism and the test set sample, the performance of the model under different scenes is comprehensively evaluated, the visual display is carried out on the simulation operation process by using the visual platform Plotly, and the operation condition of the model is intuitively understood and analyzed, so that the adaptability of the constructed model is enhanced.

4. According to the invention, the optimal initial data mining model is selected through the self-adaptive optimal decision model, and the initial model is continuously adjusted according to actual requirements, so that the model performance and generalization capability are improved, and the constructed data mining model is more accurate, robust and reliable.

5. According to the invention, an initial data mining model is selected based on a mining task recognition result, and the fusion characteristics are divided into a training set sample and a test set sample by adopting random sampling, so that the data is effectively utilized for model selection and training. Meanwhile, the new training set samples are updated in real time through the incremental learning window, so that the adaptability of the model to new data is ensured.

6. The method extracts meaningful and representative features from the data through the steps of feature extraction, mining task identification, feature selection and feature fusion. Redundant information is reduced, the model training effect is improved, and the method can adapt to different excavation tasks.

Drawings

FIG. 1 is a schematic flow chart of the overall method of the present invention;

FIG. 2 is a schematic diagram of a data feature processing module according to the present invention;

FIG. 3 is a schematic workflow diagram of an adaptive optimal decision model according to the present invention;

FIG. 4 is a schematic diagram of an adaptive optimal decision model according to the present invention;

FIG. 5 is a diagram illustrating an overall system architecture according to the present invention.

Detailed Description

The foregoing and other features, aspects and advantages of the present application will become more apparent from the following detailed description of the embodiments, which proceeds with reference to the accompanying figures 1 to 5. The embodiments of the present application and features in the embodiments may be combined with each other, and terms used in the specification are meanings commonly understood by those skilled in the art of the present application.

The embodiment of the invention discloses a data mining model construction method based on machine learning, which is shown in a figure 1 and comprises the following steps:

The method comprises the steps that firstly, data acquisition and preprocessing are carried out through a data acquisition module and a data preprocessing module, the data acquisition module adopts an application program access interface and a data sensor network to acquire multi-source data in real time, the data preprocessing module carries out data cleaning and data conversion on the acquired data through a real-time stream processing engine, and the multi-source data at least comprises enterprise file data, database data and service system data;

Step two, carrying out data quality assessment on the preprocessed data through a data quality assessment module, wherein the data quality assessment module adopts a multi-element comprehensive analysis model to carry out comprehensive assessment on data integrity, accuracy, consistency and relativity, carries out secondary treatment on the assessed residual data through a self-supervision characterization learning model, and adopts a multithread parallel computing mode to improve data quality assessment efficiency;

Step three, carrying out feature extraction and fusion on the data subjected to secondary processing through a data feature processing module, and dividing the data features into a training set, a verification set and a test set, wherein the data feature processing module comprises a feature extraction unit, a feature dimension reduction unit, a feature fusion unit and a feature division unit;

Selecting a data mining model through a model selection module, training the selected initial data mining model through a model training module, selecting an optimal initial data mining model by the model selection module through an adaptive optimal decision model, selecting functional attributes and model parameter attributes of the initial data mining model according to data characteristics and target tasks by the adaptive optimal decision model, and training the model through a training set sample by the model training module;

Step five, performing performance simulation test on the trained model through a model test evaluation module, wherein the model test evaluation module adopts a component dynamic simulation test mechanism and a test set sample to simulate the operation process of the constructed model, and adopts a visual platform Plotly to visually display the simulated operation process;

And step six, optimizing the constructed model through a model optimizing module, wherein the model optimizing module adjusts and updates model parameters in a feedback adjustment mode.

The multi-element comprehensive analysis model realizes the comprehensive evaluation of the data quality by weighting different evaluation indexes, and the data set of the preprocessed data is thatFor the number of data after the preprocessing,Represent the firstThe data after the pre-processing is used,Integrity, accuracy, consistency and correlation indexes of (a) are respectivelyAndThe indicator of the integrity is indicated as such,The index of accuracy is indicated as such,The index of the consistency is indicated as such,Represents the correlation index, the firstThe integrity, accuracy, consistency and correlation index comprehensive evaluation output function formula of the preprocessed data is as follows:

Specifically, the multivariate analysis-by-synthesis model is a method for weighted-synthesis evaluation based on a plurality of indicators. The core technology essence is weighted linear combination, namely, the scores of a plurality of indexes are accumulated and calculated by giving a weight to each index, so that the comprehensive evaluation value of the data is obtained. In practical application, the multivariate comprehensive analysis model also relates to the problems of index selection, weight assignment, method selection and the like. Factors such as applicability of indexes, fairness of weights, effectiveness of an evaluation method and the like need to be considered, so that a relatively accurate data quality evaluation result can be obtained. The multivariate comprehensive analysis model comprises the following hardware working environments:

And (3) a server: the multivariate analysis methods generally require extensive computation and processing, and therefore require servers with powerful computing power and storage capacity. These servers may be stand-alone servers or clustered servers that are used to undertake tasks such as model training, reasoning, and data processing.

GPU acceleration card: because tasks such as deep learning have high computational power requirements, using a GPU graphics processor for acceleration is a common option. The GPU accelerator card can be used with a server to provide faster computing speed and higher parallel processing capability.

Mass storage system: the multivariate analysis-by-synthesis model may need to handle large amounts of data and therefore needs to have sufficient storage capacity and high-speed read-write capability. Large-scale storage systems such as hard disk arrays, network attached storage, or object storage may meet this need.

Data sensor network: during the data acquisition process, a sensor network may be involved. The sensors are connected with the computing device through hardware equipment, and collected data are transmitted to the computing device for processing.

Network equipment: in order to ensure the efficiency and the safety of data transmission, a stable and reliable network environment is required to be provided in a multi-element comprehensive analysis model. This includes network devices such as routers, switches, firewalls, etc., as well as network connections that are sufficiently broadband and low latency.

In summary, the hardware working environment of the multivariate comprehensive analysis model comprises a server, a GPU accelerator card, a large-scale storage system, a data sensor network, network equipment and the like. The hardware devices cooperate together to provide a strong computing capacity, efficient data processing capacity and a stable and reliable network environment for the multi-element comprehensive analysis model.

And (3) respectively adopting a multi-element comprehensive analysis model (group A) and a data quality rule (group B) to carry out a comparison experiment, collecting data, and carrying out pretreatment to ensure that the data has certain variability and diversity. The dataset was randomly divided into two groups of 30 data each. Data characteristics of both experimental groups were ensured to be similar. And calculating the integrity, accuracy, consistency and correlation index comprehensive evaluation value of each preprocessed data according to the above-described multiple comprehensive analysis model formula (1) and formula (2) by using a group of data. Residual data is determined based on the evaluation threshold. For another set of data, defining a threshold range of integrity, accuracy, consistency and correlation according to the data quality rules, and judging whether the data is residual data according to the rules. Experiments were performed during the same time period according to the different treatments of groups a and B. The number of pieces of residual data determined for each set of data is recorded. The above experiment was repeated ten times with different data. And counting the number of the residual data judged as each group, and comparing the number with the number of the real residual data. The similarity results are reported in Table 1.

Table 1 results statistics table

According to the table, comparing the experimental results of the A group with the B group, the similarity of the A group judgment residual data and the real residual data is 80%, the similarity of the B group judgment residual data and the real residual data is 60%, and the experimental result of the A group is relatively close to the real residual data. Therefore, group a is better in the determination of residual data.

The self-supervision characteristic learning model adopts a deep confidence network to carry out secondary processing on residual data, the deep confidence network trains and captures data characteristics and rules layer by layer through the preprocessed data, data characteristic labels are generated based on the data characteristics and rules, and the residual data is filled and replaced based on the data characteristic labels.

Specifically, the self-supervision characteristic learning model is a machine learning method based on unsupervised learning, and the technical essence of the self-supervision characteristic learning model is that the characteristic learning of the original data is used for realizing filling and cleaning of the data. The model utilizes priori knowledge and self learning ability to find high-level features and structures in the data, and then recovers and corrects the missing data. In data quality assessment and secondary processing, a self-supervision token learning model can be used to process residual data in the assessment results, thereby improving the quality in terms of data integrity, accuracy and consistency. The multi-thread parallel computing mode is adopted, so that the speed and the efficiency of data quality evaluation and secondary processing can be effectively improved.

In the data mining model construction process, the comparison experiments of the addition quality evaluation and the secondary treatment step (group A) and the non-addition quality evaluation and the secondary treatment step (group B) are respectively carried out, the performances of the models constructed in the group A and the group B are respectively compared by using the accuracy, the recall rate and the F1 fraction, five experiments are respectively carried out, and the comparison results are recorded in Table 2.

Table 2 results statistics table

From the above table, the model (group a) with the addition of quality assessment and secondary treatment steps performed better in five experiments, with higher accuracy and recall than the model (group B) without the addition of these steps, and the F1 score was also significantly improved. This demonstrates that the use of both quality assessment and secondary treatment steps can improve the performance of the model in different experiments. Meanwhile, in the group A, the results of five experiments are stable and consistent; the performance of the model without adding the steps in the group B is relatively unstable, and certain fluctuation exists. In conclusion, the performance of the data mining model can be improved by adding quality evaluation and secondary processing steps, and the accuracy and stability of the model are improved.

As shown in fig. 2, the feature extraction unit extracts high-dimensional feature vectors of data after secondary processing through a convolutional neural network, the feature dimension reduction unit reduces the extracted high-dimensional feature vectors to low dimensions through a principal component analysis method, the feature fusion unit realizes fusion of low-dimensional feature sequences of different modes through stacking of self encoders, the feature division unit adopts hierarchical random sampling to divide data features into a training set, a verification set and a test set, the output end of the feature extraction unit is connected with the input end of the feature dimension reduction unit, the output end of the feature dimension reduction unit is connected with the input end of the feature fusion unit, and the output end of the feature fusion unit is connected with the input end of the feature division unit.

Specifically, in the feature extraction unit, the data after the secondary processing is first used as input, and then the filtering operation is performed through the convolution layer, so that features with different sizes and directions are extracted. And then downsampling each convolution layer output through the pooling layer to reduce the number of parameters while retaining key information. Finally, under the action of the activation function, a high-dimensional feature vector is generated. Because convolutional neural networks have the ability to automatically learn characteristic representations, and can increase the complexity and expressive power of the model by stacking multiple convolutional layers,

The feature dimension reduction unit converts the high-dimension feature vector into a low-dimension feature vector through principal component analysis of a dimension reduction algorithm. The dimension reduction can reduce the complexity and noise influence of the data, and more representative characteristics are extracted. The mining task identification unit is a method for carrying out mining task classification identification based on the feature vector after dimension reduction and a support vector machine. And training by using the feature vector after the dimension reduction as input data and using a support vector machine. The support vector machine is a supervised learning algorithm whose goal is to find an optimal hyperplane to divide the samples of different classes. During training, the support vector machine finds the best classification boundary based on the characteristics and class information of the input data. After training by the support vector machine, the model can be used for carrying out mining task classification recognition on new unknown samples. And representing the new sample as a feature vector after dimension reduction, and carrying out classification prediction through a trained support vector machine model. And determining the mining task category to which the new sample belongs according to the result output by the model.

In the feature selection unit, first, a sample set for which classification recognition of a mining task has been performed is taken as training data, and features related to the mining task are extracted at the same time. And then adopting a priority Lasso regression algorithm to perform feature selection on the training data. Lasso regression is a linear regression method, and by introducing an L1 regularization term, partial characteristic coefficients become 0, so that automatic characteristic selection is realized. In priority Lasso regression, different weights (or constraint conditions) are set according to the priority of the recognition result of the mining task, so as to ensure that the characteristics matched with the recognition result are selected. And according to the Lasso regression result, the characteristic with the non-zero coefficient is regarded as the selected important characteristic. These important features match the recognition results of the mining tasks and have high predictive power. By screening out these features, data dimensionality and noise interference can be reduced, and the effect and efficiency of subsequent mining tasks can be improved.

By adopting a priority Lasso regression mode to perform feature selection, the feature selection unit can automatically select important features matched with the mining task according to the priority of the recognition result of the mining task. This can improve the accuracy and interpretability of the mining task and save computing resources and processing time.

In the feature fusion unit, the self-encoder is an unsupervised learning algorithm, and the reconstruction of the input data is realized by compressing and decompressing the input data. The stacking of the self-encoders refers to connecting a plurality of self-encoders according to a hierarchical structure to form a deep network structure. In the self-encoder stack, each self-encoder is responsible for learning and extracting a high-level feature representation of the input data. And finally obtaining the result of reconstructing the input data of the whole network through layer-by-layer training and reconstruction. In this process, each layer self-encoder learns a different but complementary representation of the features than the other layers.

The fusion of the multi-source characteristic sequences is realized through stacking of the self-encoders, information of different source data can be fully utilized, and automatic learning and extraction can be performed through a network structure. The fusion method can effectively reduce feature dimension, reduce noise interference and capture higher-level and more abstract feature representation.

As shown in fig. 3 and fig. 4, the adaptive optimal decision model includes an input layer, a feature selection layer, a data reconstruction layer, an adaptive parameter adjustment layer, a functional attribute matching layer, a parameter attribute matching layer and an output layer, and the working method of the adaptive optimal decision model includes the following steps:

s1, inputting the data characteristics processed by the data characteristic processing module into the self-adaptive optimal decision model through an input layer;

S2, selecting data features through a feature selection layer, reconstructing a time sequence of the selected features through a data reconstruction layer, wherein the feature selection layer evaluates the correlation between the input features and the data mining task through calculating correlation coefficients between the input features and response variables of the data mining task, selects feature variables with the correlation larger than a threshold value as selected features through chi-square test, the feature selection layer carries out real-time updating of newly added data features and timing deleting of historical data features through an incremental learning window and a sliding window, and the data reconstruction layer carries out time sequence reconstruction of the selected features through differential, smoothing, filtering and standardization operation;

S3, creating a model set according to the selected characteristics and the time sequence processed by the data reconstruction layer, and adjusting model parameters through a self-adaptive parameter adjusting layer, wherein the self-adaptive parameter adjusting layer automatically adjusts the parameter configuration of the model based on inertia weight, adjusts the data mining model parameters to be optimal through an iterative search mode, and sets control parameters and target parameters according to a historical optimal solution, and dynamically adjusts a neighborhood search range based on different neighborhood search strategies;

The inertia weight is adaptively updated according to the global history optimal fitness value and the iteration mode, and the model set is as follows For the number of models in the model set,To the first of the model setA model number of the model setThe parameter set of the individual model isCentralizing the modelThe parameter set of the individual models is set,Centralizing the modelModel numberThe parameters of the parameters are set to be,Centralizing the modelThe number of parameters of each model and the formula of the inertial weight output function are as follows:

S4, matching functional attributes and parameter attributes, wherein the functional attribute matching layer performs functional attribute matching selection on the models in the model set through a regular expression, the regular expression compares and matches the identified data mining task with the model set type to complete functional attribute screening of an initial data mining model, the parameter attribute matching layer performs parameter attribute matching selection on the data mining model according to a characteristic space fitting mode, and the characteristic space fitting mode performs parameter comparison and matching on the selected data features and the screened models to complete parameter attribute screening of the initial data mining model;

And S5, the output layer outputs the selected model as an optimal decision scheme.

Specifically, the input initial population size is 100, the maximum iteration number is 200, the inertia weight is 0.8, and the neighborhood search parameter is 3. The adaptive parameter adjustment specifically comprises the following steps:

s4.1, initializing parameters including an initial speed and an initial position;

S4.2, adaptively updating the inertia weight according to the global history optimal fitness value and the iteration mode;

S4.3, performing speed update and position update, and performing boundary limitation;

S4.4, calculating and updating self-adaptation factors and social adaptation factors;

S4.5, calculating the adaptability of the current particles according to the adaptability function;

s4.6, updating the historical optimal position and the historical optimal fitness;

s4.7, when a preset termination condition or maximum iteration times are reached, jumping to S5; otherwise, the next iteration is performed.

The hardware operating environment for the adaptive optimal decision model includes, but is not limited to, the following:

1. Mass memory and high speed storage devices: the adaptive optimal decision model needs to process a large amount of data and model parameters, so that a large-scale memory and high-speed storage equipment are needed to ensure efficient operation and optimization process of the model.

2. High performance computing device: the decision model requires extensive computation and analysis and high performance computing devices to accelerate the model computation and optimization process, such as GPU accelerator cards.

3. High-speed network transmission: the decision model needs to perform network data transmission and communication, and needs to have high-speed, stable and safe network transmission capability so as to ensure efficient operation and optimization process of the model.

4. Multi-CPU/multi-core processor: the adaptive optimal decision model needs to perform a large amount of parallel computation, and needs to have the capability of high-concurrency and multi-CPU/multi-core processors so as to improve the computation efficiency and the optimization process of the model.

5. Operating system and middleware: stable and mature operating systems and middleware are required to be selected to ensure the stability and efficiency of model operation, such as Linux, windows, java, python.

After the data mining model is built, the adaptive optimal decision model is adopted to perform selection optimization operation (A group) and the rule matching model is adopted to perform selection optimization operation (B group), and the data mining models built in the A group and the B group are subjected to adaptability comparison test. Firstly, recording the data mining accuracy of the data mining model constructed in the B group on five groups of different data as a comparison standard, and then recording the data mining accuracy of the data mining model constructed in the A group on five groups of different data. The data are recorded in table 3.

Table 3 results statistics table

From the data in table 3, it can be seen that, in all five sets of data, the accuracy of the a set is higher than the reference accuracy of the B set, which indicates that the initial data mining model selection by adopting the adaptive optimal decision model can improve the accuracy of constructing the data mining model and improve the adaptability of the data mining model to the data.

The incremental learning window inputs newly added data points in an incremental calculation mode, the newly added input data points are added to a data set in a time window, and the data set in the time window acquires a data mining model trained by the newly added data points through updating and iterating the initial data mining model in real time.

In particular, when new data points are added to the data set within the time window, the new data points may first be classified or predicted using the original data mining model. And then, comparing the real classification or prediction result of the newly added data point with the original classification or prediction result to obtain an error value of classification or prediction. The error values are then used to adjust the original data mining model so that it can better handle the new data. This process may be iterated multiple times to achieve real-time updating and iteration of the data mining model. The incremental learning method can enable the data mining model to gradually adapt to new data, so that the classification or prediction accuracy of the data mining model is continuously improved. Meanwhile, as the data set in the time window is limited, the method can also control the size of the data set, avoid the situation of over fitting and improve the robustness and generalization capability of the model.

The component dynamic simulation test mechanism is used for decomposing the constructed data mining model into a plurality of components through a component object model, simulating the operation and response process of the data mining model by adopting a component simulator, defining an interface interaction mode and a parameter transmission process among the plurality of components by adopting an interface definition language IDL, monitoring the execution process of the data mining model and the change state of variables by adopting a component debugger, and automatically generating an operation event report of the data mining model through a component log recorder.

Specifically, the component dynamic simulation test mechanism is a method for decomposing a constructed data mining model into a plurality of components through a component object model and simulating the operation and response processes of the data mining model by adopting a component simulator. Simulation experiment tests are carried out in an experiment environment, an operating system of an experiment computer is windows 10, and the specific implementation mode is as follows:

metathesis into multiple components: the constructed data mining model is decomposed according to functions and tasks and is divided into a plurality of independent components. Each component is responsible for performing specific functions, such as feature processing, classification algorithms, and the like.

Component simulator: a corresponding component simulator is designed and implemented for each component. The simulator is capable of receiving input data and performing calculations and processing according to predetermined rules and algorithms to simulate the operation and response of the data mining model in a real environment.

Interface definition language IDL: an interface definition language IDL is employed to define the interface interactions and parameter passing procedures between the data mining components. IDL specifies the interface methods, input-output parameters, etc. provided by each component, ensuring that the components can communicate and cooperate properly.

Component debugger: component debuggers are used to monitor the state of changes in the execution and variables of the data mining model. Intermediate results and state changes generated by each component during running can be observed through the debugger, so that problems can be found and debugging can be facilitated.

Component log logger: to generate a running event report for the data mining model, a component logger is introduced. The recorder can automatically record the running condition, input and output data and other relevant information of each component, and is convenient for subsequent analysis and report generation.

By adopting a component dynamic simulation test mechanism, the data mining model can be flexibly and accurately tested and debugged. Each component can be independently tested, and the function verification of the whole model is realized through an interface interaction mode. Meanwhile, problems and characteristics in the model executing process can be better understood through the aid of a debugger and a log recorder.

A data mining model construction system based on machine learning, as shown in fig. 5, the data mining model construction system comprises;

The visualization platform Plotly obtains the associated data in the model simulation test process based on the associated data model, and displays the change trend and the associated relation of the data in real time by adopting an interactive chart, a hot spot diagram and a dashboard, and the visualization platform Plotly verifies the access user by adopting a Token user identity verification mechanism.

Specifically, the visualization platform Plotly obtains the associated data in the model simulation test process based on the associated data model, and displays the change trend and the associated relation of the data in real time through the interactive chart, the hot spot diagram and the instrument board. The platform employs Token user authentication mechanisms to authenticate access users. Token user authentication mechanisms are a common authentication method that verifies the identity of a user by using a Token (Token). When using Plotly visualization platforms, users need to provide valid tokens for authentication to ensure that only authorized users can access and operate related data and functions. In this way, plotly visualization platform can protect the security and privacy of data and ensure that only authorized users can view and manipulate the data. The authentication mechanism can effectively prevent unauthorized access and improve the security and the credibility of the system.

Thus far, the technical solution of the present invention has been described in connection with the preferred embodiments shown in the drawings, but it is easily understood by those skilled in the art that the scope of protection of the present invention is not limited to these specific embodiments. Equivalent modifications and substitutions for related technical features may be made by those skilled in the art without departing from the principles of the present invention, and such modifications and substitutions will fall within the scope of the present invention.

Claims

1. The data mining model construction method based on machine learning is characterized by comprising the following steps:

The multi-element comprehensive analysis model realizes the comprehensive evaluation of the data quality by weighting different evaluation indexes, and the data set of the preprocessed data is that N is the number of the preprocessed data,，Representing the data after the i-th pre-processing,Integrity, accuracy, consistency and correlation indexes of (a) are respectivelyAnd，The indicator of the integrity is indicated as such,The index of accuracy is indicated as such,The index of the consistency is indicated as such,The method is characterized in that the method represents a correlation index, and the integrated evaluation output function formula of the integrity, accuracy, consistency and correlation index of the ith preprocessed data is as follows:

In the case of the formula (1), Indicating the integrated evaluation value of the integrity, accuracy, consistency and correlation index of the ith preprocessed data,The weight of the integrity indicator is represented as,The weight of the accuracy index is represented,The weight of the consistency index is represented,The weight of the correlation index is represented,An integrity index value representing the i-th preprocessed data,An accuracy index value representing the i-th preprocessed data,A consistency index value representing the i-th preprocessed data,A correlation index value representing the i-th preprocessed data,For eliminating the dimensional difference between the indexes,For adjusting the sensitivity of the overall evaluation result,Is used for carrying out normalization processing on the comprehensive evaluation result,The method comprises the steps of representing an adjusting factor, balancing the importance of each index, and judging residual data when the comprehensive evaluation value of the integrity, accuracy, consistency and correlation index of the preprocessed data is smaller than an evaluation threshold value; the calculation formulas of the integrity index weight, the accuracy index weight, the consistency index weight and the correlation index weight of the preprocessed data are as follows:

In the formula (2) of the present invention, Represents the maximum value of the post-preprocessing data integrity index value,Represents the minimum value of the pre-processed data integrity index value,Represents the maximum value of the pre-processed data accuracy index value,Represents the minimum value of the pre-processed data accuracy index value,Represents the maximum value of the data consistency index value after preprocessing,Represents the minimum value of the data consistency index value after preprocessing,Represents the maximum value of the data correlation index value after preprocessing,A minimum value representing a data correlation index value after preprocessing;

2. The machine learning-based data mining model construction method of claim 1, wherein the self-supervised characterization learning model performs secondary processing of residual data using a deep belief network that captures data features and rules by layer-by-layer training of the preprocessed data and generates data characteristic labels based on the data features and rules, the residual data being filled and replaced based on the data characteristic labels.

3. The machine learning-based data mining model construction method according to claim 1, wherein the feature extraction unit extracts high-dimensional feature vectors of the data after secondary processing through a convolutional neural network, the feature dimension reduction unit reduces the extracted high-dimensional feature vectors to a low dimension through a principal component analysis method, the feature fusion unit realizes fusion of low-dimensional feature sequences of different modes through self-encoder stacking, the feature division unit divides the data features into a training set, a verification set and a test set by adopting hierarchical random sampling, an output end of the feature extraction unit is connected with an input end of the feature dimension reduction unit, an output end of the feature dimension reduction unit is connected with an input end of the feature fusion unit, and an output end of the feature fusion unit is connected with an input end of the feature division unit.

4. The method for constructing a data mining model based on machine learning according to claim 1, wherein the adaptive optimal decision model comprises an input layer, a feature selection layer, a data reconstruction layer, an adaptive parameter adjustment layer, a functional attribute matching layer, a parameter attribute matching layer and an output layer, and the working method of the adaptive optimal decision model comprises the following steps:

and the output layer outputs the selected model as an optimal decision scheme.

5. The machine learning-based data mining model construction method of claim 4, wherein the incremental learning window inputs new data points in an incremental calculation mode, and the new input data points are added to a data set in a time window, and the data set in the time window acquires the data mining model trained by the new data points by updating and iterating the initial data mining model in real time.

6. The method for constructing a machine learning based data mining model according to claim 4, wherein the inertial weights are adaptively updated according to a global historical optimal fitness value and an iterative manner, and the model set is，For the number of models in the model set,For the jth model in the model set, the parameter set of the jth model in the model set is as follows，For the parameter set of the jth model in the model set,For the kth parameter of the jth model in the model set,For the number of parameters of the jth model in the model set, the inertial weight output function formula is as follows:

In the formula (3) of the present invention, For the i-th weighted auxiliary value,As a function of the weighting aid,To obtain the maximum value of the disturbed intensity of the digital communication signal,For the parameter maximum of the jth model in the model set,For the minimum of the parameters of the jth model in the model set, W (t) is the inertial weight of the jth iteration of the jth model,For the initial maximum inertial weight to be the same,For initial minimum inertial weight, T is the current iteration number, and T is the total iteration number.

7. The machine learning-based data mining model construction method according to claim 1, wherein the component dynamic simulation test mechanism is configured to double-decompose the constructed data mining model into a plurality of components through a component object model, simulate operation and response processes of the data mining model by using a component simulator, define interface interaction modes and parameter transfer processes between the plurality of components of the data mining model by using an interface definition language IDL, monitor the execution process of the data mining model and the change states of variables by using a component debugger, and automatically generate an operation event report of the data mining model by using a component log recorder.

8. A data mining model construction system based on machine learning, applied to the data mining model construction method based on machine learning as claimed in any one of claims 1 to 7, characterized in that the data mining model construction system comprises;

9. The machine learning based data mining model construction system according to claim 8, wherein the visualization platform Plotly obtains association data in a model simulation test process based on an association data model, and displays a variation trend and an association relationship of the data in real time by using an interactive chart, a hot spot diagram and a dashboard, and the visualization platform Plotly verifies the identity of the accessing user by using a Token user identity verification mechanism.