CN117056834A - Big data analysis method based on decision tree - Google Patents

Big data analysis method based on decision tree Download PDF

Info

Publication number
CN117056834A
CN117056834A CN202311050733.1A CN202311050733A CN117056834A CN 117056834 A CN117056834 A CN 117056834A CN 202311050733 A CN202311050733 A CN 202311050733A CN 117056834 A CN117056834 A CN 117056834A
Authority
CN
China
Prior art keywords
data
decision tree
model
algorithm
result
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN202311050733.1A
Other languages
Chinese (zh)
Inventor
索强
于天宇
任舟
汪智鹏
潘彦
郑晓晨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Shuzi Technology Co ltd
Original Assignee
Shanghai Shuzi Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Shuzi Technology Co ltd filed Critical Shanghai Shuzi Technology Co ltd
Priority to CN202311050733.1A priority Critical patent/CN117056834A/en
Publication of CN117056834A publication Critical patent/CN117056834A/en
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/211Selection of the most significant subset of features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to the technical field of big data analysis, in particular to a big data analysis method based on a decision tree. According to the invention, through integrating multi-modal data and unstructured data and performing deep learning feature extraction, various types of data can be utilized more comprehensively, the SMOTE sampling method is used for processing unbalanced data, the classification precision of few samples is improved, a decision tree algorithm is adopted for feature selection, the precision of a model is improved, a decision tree and a deep learning model are fused, the advantages of the decision tree and the deep learning model can be combined, a stronger, efficient and robust model is built, an abnormality detection algorithm based on the decision tree has an important role in preventing and finding problems, and an easy-to-understand model interpretation tool is used for providing.

Description

Big data analysis method based on decision tree
Technical Field
The invention relates to the technical field of big data analysis methods, in particular to a big data analysis method based on a decision tree.
Background
Big data analysis methods refer to processes for extracting, processing, analyzing, and understanding large-scale data sets by using various techniques and tools, including data preprocessing, data visualization, statistical analysis, machine learning, natural language processing, data mining, and time-series analysis. Big data analysis methods play an important role in the rapid development of the information age today. With the increasing proliferation of the internet, sensor technology, social media, and other large data sources, the generation of large-scale data sets has exploded. The large-scale data contains massive information and knowledge, and can bring great commercial value to organizations and enterprises under the correct analysis and utilization. By comprehensively applying the big data analysis methods, hidden information in the data can be revealed, business hole finding is provided, decision and strategy are optimized, and innovation and development are promoted.
In the actual use process of the big data analysis method, the traditional data analysis method only processes single type or structured data, and has weaker processing capability on multi-mode and unstructured data. The conventional method often ignores few class samples when processing unbalanced data, so that the classification result is biased towards most class samples. Feature selection relies primarily on human experience and may ignore some important features. Traditional decision tree models may not be able to process complex and high dimensional data and are easily overfitted. Conventional approaches typically only provide model results, and no explanation is given, which may lead to a reduced user's understanding and confidence in the results.
Disclosure of Invention
The invention aims to solve the defects existing in the prior art, and provides a big data analysis method based on a decision tree.
In order to achieve the above purpose, the present invention adopts the following technical scheme: the big data analysis method based on the decision tree comprises the following steps:
integrating multi-modal data, cleaning and normalizing, and extracting features of the multi-modal data by adopting a deep learning convolutional neural network and a natural language processing technology to obtain a preprocessing data set;
integrating unstructured data, and analyzing the unstructured data by adopting an NLP algorithm and a clustering algorithm to obtain an unstructured analysis result;
using an SMOTE sampling method, identifying and processing unbalanced data based on the preprocessed data set and an unstructured analysis result, and obtaining a balanced data set;
selecting features from the balanced dataset by using a decision tree algorithm comprising information gain and Gini coefficients, and obtaining a selected feature set;
constructing a basic decision tree model based on the selection feature set by using a CART algorithm;
the method for fusion learning of the integrated random forest and the deep neural network comprises the steps of fusing the basic decision tree model with the deep learning model to obtain a fused decision tree model;
in the big data analysis process, an online decision tree algorithm is adopted to analyze newly generated data in real time, and an online analysis result is generated;
based on an abnormality detection algorithm of the fusion decision tree model, performing abnormality detection on the online analysis result to generate an abnormality report;
and using an interpretability tool, specifically SHAP, to visually display the exception report and the fusion decision tree model, and simultaneously providing interpretation of the fusion decision tree model, and integrating to generate a final report.
As a further aspect of the present invention, the multimodal data includes image data, audio data, text data;
the integrated multi-modal data is subjected to cleaning and normalization, and the multi-modal data is subjected to feature extraction by adopting a deep learning convolutional neural network and a natural language processing technology, so that a preprocessing data set is obtained specifically by the following steps:
collecting the multi-modal data, and aligning each modal data in the multi-modal data with other modal data in time and space in the data integration process;
performing data cleaning on the integrated multi-mode data, including outlier detection, data filling and data denoising;
normalizing each mode data in the multi-mode data to a unified interval;
performing feature extraction on the image data by adopting the convolutional neural network, performing feature extraction on the text data by adopting the natural language processing technology, performing feature extraction on the audio data by adopting an MFCC, and acquiring a feature vector based on the feature extraction;
and merging the feature vectors of different modes by using a multi-mode fusion technology to acquire the preprocessing data set.
As a further scheme of the invention, the integrated unstructured data is analyzed by adopting an NLP algorithm and a clustering algorithm, and the step of obtaining the unstructured analysis result comprises the following steps:
collecting the unstructured data, and aligning the unstructured data in time and space in the data integration process;
in the analysis process of the NLP algorithm, based on text word segmentation, named entity recognition, emotion analysis, topic modeling and text classification operation, classifying texts in unstructured data into predefined categories, and acquiring word segmentation results, emotion tendencies, topic recognition and classification;
adopting a clustering algorithm, specifically k-means, to obtain clustering results including text clustering, image clustering and audio clustering;
and integrating the results of the NLP algorithm and the clustering algorithm to obtain a clustering analysis result of the unstructured data, and taking the clustering analysis result as the unstructured analysis result.
As a further scheme of the present invention, the step of identifying and processing unbalanced data based on the preprocessed data set and the unstructured analysis result by using the SMOTE sampling method, and obtaining a balanced data set specifically includes:
counting the pretreatment data set and the unstructured analysis result, and counting the number of samples of each category to obtain a data category counting result;
setting a category and an enhancement strategy to be enhanced based on the data category statistical result, and generating enhancement strategy details;
based on the enhancement policy details, generating a synthetic sample for the category needing enhancement by applying the SMOTE algorithm, and finding K nearest neighbors of the category needing enhancement as a synthetic sample set;
combining the preprocessed data set and the unstructured analysis result with the synthesized sample set to form a preliminary balance data set;
based on the preliminary balance data set, the steps are circulated until the number of samples in each category reaches the balance target, and a final balance data set is obtained.
As a further aspect of the present invention, the step of selecting features from the balanced dataset by using a decision tree algorithm including information gain and Gini coefficients, and obtaining a selected feature set specifically includes:
invoking the balance data set, calculating the statistical abstracts of all the features, including average values and standard deviations, and generating feature statistical abstracts;
calculating the information gain of each feature based on the feature statistical abstract, and obtaining an information gain result;
based on the feature statistical abstract, calculating the Gini coefficient of each feature, and obtaining a Gini coefficient result;
and synthesizing the information gain result and the Gini coefficient result to generate a selection feature set.
As a further scheme of the present invention, the step of constructing a basic decision tree model based on the selection feature set using CART algorithm specifically includes:
splitting the balance data set corresponding to the selected feature set into a training set and a test set as training test data;
training a training set in training test data by using a CART algorithm to obtain a CART model;
and verifying on a test set in training test data by using the CART model to obtain a CART verification result.
As a further scheme of the invention, the integrated random forest and deep neural network fusion learning method fuses the basic decision tree model and the deep learning model, and the step of obtaining the fused decision tree model comprises the following steps:
training a model by using a random forest algorithm based on the selected feature set to obtain a random forest model;
constructing and training a deep neural network model based on the selected feature set;
and fusing the CART model, the random forest model and the deep neural network model by adopting a fusion algorithm to obtain a fused decision tree model.
As a further scheme of the invention, in the big data analysis process, the online decision tree algorithm is adopted to analyze the newly generated data in real time, and the step of generating the online analysis result comprises the following steps:
in the big data analysis process, receiving real-time newly generated data as a real-time data stream;
cleaning, normalizing and extracting features of the real-time data stream to obtain a preprocessed data stream;
real-time analysis is carried out on the preprocessed data stream by using an online decision tree algorithm, and an online decision tree analysis result is obtained;
and comparing the analysis result of the online decision tree with a real data label, and evaluating the real-time performance of the model to obtain an online performance evaluation result.
As a further scheme of the present invention, the abnormality detection algorithm based on the fused decision tree model performs abnormality detection on the online analysis result, and the step of generating an abnormality report specifically includes:
loading the pre-trained fusion decision tree model to be used as a pre-loading fusion model;
performing anomaly detection on the analysis result of the online decision tree by using the preloaded fusion model to obtain a preliminary anomaly detection result;
and marking and classifying abnormal data points in the preliminary abnormal detection result, obtaining marked abnormal data, and generating an abnormal report.
As a further scheme of the present invention, the step of using an interpretable tool, specifically SHAP, to visually display the exception report and the fused decision tree model, and simultaneously providing an interpretation of the fused decision tree model, and integrating and generating a final report specifically includes:
loading a SHAP library and dependent resources thereof as a SHAP resource set;
generating an explanation for the fusion decision tree model by using the SHAP resource set, and acquiring a fusion model explanation;
using the SHAP resource set to carry out visual display on the abnormal report as an abnormal data visual result;
and integrating the fusion model interpretation and the abnormal data visualization result to obtain a comprehensive analysis report.
Compared with the prior art, the invention has the advantages and positive effects that:
in the present invention,
by integrating multi-modal data and unstructured data and performing deep learning feature extraction, various types of data can be more comprehensively utilized, and richer information can be extracted. The unbalance data is processed by using the SMOTE sampling method, so that the prejudice of a model can be reduced, and the classification precision of few samples can be improved. By adopting the decision tree algorithm to select the features, the key features affecting the result can be found out more accurately, and the precision of the model is improved. The decision tree and the deep learning model are fused, and the advantages of the decision tree and the deep learning model can be combined to construct a more powerful, efficient and robust model. The online decision tree algorithm is adopted for real-time analysis, so that newly generated data can be responded quickly, and the timeliness of analysis is improved. The abnormal data can be accurately identified based on the abnormal detection algorithm of the decision tree, and the method has an important effect on preventing and finding problems. Using the SHAP interpretability tool, a user may be provided with easy-to-understand model interpretation, enhancing the user's understanding and trust of the results.
Drawings
FIG. 1 is a schematic diagram showing the main steps of a big data analysis method based on decision tree according to the present invention;
FIG. 2 is a detailed schematic diagram of step 1 of the big data analysis method based on decision tree;
FIG. 3 is a detailed schematic diagram of step 2 of the big data analysis method based on decision tree according to the present invention;
FIG. 4 is a detailed schematic diagram of step 3 of the big data analysis method based on decision tree according to the present invention;
FIG. 5 is a detailed schematic diagram of step 4 of the big data analysis method based on decision tree according to the present invention;
FIG. 6 is a detailed schematic diagram of step 5 of the big data analysis method based on decision tree according to the present invention;
FIG. 7 is a detailed schematic diagram of step 6 of the big data analysis method based on decision tree according to the present invention;
FIG. 8 is a detailed schematic diagram of step 7 of the big data analysis method based on decision tree according to the present invention;
FIG. 9 is a detailed schematic diagram of step 8 of the big data analysis method based on decision tree according to the present invention;
fig. 10 is a detailed schematic diagram of step 9 of the big data analysis method based on decision tree according to the present invention.
Detailed Description
The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.
In the description of the present invention, it should be understood that the terms "length," "width," "upper," "lower," "front," "rear," "left," "right," "vertical," "horizontal," "top," "bottom," "inner," "outer," and the like indicate orientations or positional relationships based on the orientation or positional relationships shown in the drawings, merely to facilitate describing the present invention and simplify the description, and do not indicate or imply that the devices or elements referred to must have a specific orientation, be configured and operated in a specific orientation, and therefore should not be construed as limiting the present invention. Furthermore, in the description of the present invention, the meaning of "a plurality" is two or more, unless explicitly defined otherwise.
Example 1
Referring to fig. 1, the present invention provides a technical solution: the big data analysis method based on the decision tree comprises the following steps:
integrating multi-modal data, cleaning and normalizing, and extracting features of the multi-modal data by adopting a deep learning convolutional neural network and a natural language processing technology to obtain a preprocessing data set;
integrating unstructured data, and analyzing the unstructured data by adopting an NLP algorithm and a clustering algorithm to obtain an unstructured analysis result;
using an SMOTE sampling method, identifying and processing unbalanced data based on a preprocessed data set and an unstructured analysis result, and obtaining a balanced data set;
selecting features from the balanced dataset by using a decision tree algorithm comprising information gain and Gini coefficients, and acquiring a selected feature set;
constructing a basic decision tree model based on the selection feature set by using a CART algorithm;
the method comprises the steps of integrating a random forest and a deep neural network, fusing a basic decision tree model and a deep learning model, and obtaining a fused decision tree model;
in the big data analysis process, an online decision tree algorithm is adopted to analyze newly generated data in real time, and an online analysis result is generated;
based on an abnormality detection algorithm fused with the decision tree model, performing abnormality detection on an online analysis result to generate an abnormality report;
and using an interpretability tool, specifically SHAP, performing visual display on the exception report and the fusion decision tree model, simultaneously providing interpretation of the fusion decision tree model, and integrating to generate a final report.
Through the steps of cleaning and normalization, noise and inconsistency in the multi-mode data can be reduced, and the quality and consistency of the data are improved. The characteristic extraction is carried out through the convolutional neural network and the natural language processing technology, so that rich and representative characteristic information can be extracted from the multi-modal data, and various information of the multi-modal data can be fully utilized. The unstructured data is analyzed by using an NLP algorithm and a clustering algorithm, useful information and modes can be extracted from the data such as text, images and the like, and supplementary and rich features are provided for the subsequent decision tree model. The unbalance data is processed by using the SMOTE sampling method, so that sample class distribution can be balanced, the classification performance of the model in a few classes is improved, and the robustness and accuracy of the model are ensured. The decision tree algorithm of information gain and Gini coefficient is used to select the most representative features from the balance data set, reduce feature dimension, improve model efficiency and help explain the decision process of the model. The basic decision tree model and the deep learning model are subjected to fusion learning, so that the advantages of the basic decision tree model and the deep learning model can be fully utilized, and the generalization capability and accuracy of the model are improved. In addition, the prediction result of the fusion decision tree model can be interpreted by using an interpretability tool such as SHAP, so that the interpretability and the credibility of the model are improved. The online decision tree algorithm is adopted to analyze the newly generated data in real time, so that the analysis process has instantaneity and real-time performance, and meanwhile, the abnormal situation can be quickly found and reported by using the abnormal detection algorithm based on the fusion decision tree model. The visual tool displays the abnormal report and fuses the decision tree model, so that a user is helped to intuitively understand the analysis result and the model decision process, a comprehensive and accurate final report is provided, and powerful support is provided for decision making.
Referring to fig. 2, the multimodal data includes image data, audio data, text data;
integrating multi-modal data, cleaning and normalizing, and performing feature extraction on the multi-modal data by adopting a deep learning convolutional neural network and a natural language processing technology, wherein the step of acquiring a preprocessing data set comprises the following steps:
collecting multi-modal data, and aligning each modal data in the multi-modal data with other modal data in time and space in the data integration process;
the integrated multi-mode data is subjected to data cleaning, including abnormal value detection, data filling and data denoising;
normalizing each mode data in the multi-mode data to a unified interval;
performing feature extraction on the image data by adopting a convolutional neural network, performing feature extraction on the text data by adopting a natural language processing technology, performing feature extraction on the audio data by adopting an MFCC, and acquiring feature vectors based on feature extraction;
and merging the feature vectors of different modes by using a multi-mode fusion technology to acquire a preprocessing data set.
Referring to fig. 3, the steps of integrating unstructured data and analyzing the unstructured data by using an NLP algorithm and a clustering algorithm to obtain unstructured analysis results are specifically as follows:
collecting unstructured data, and aligning the unstructured data in time and space in the data integration process;
in the analysis process of the NLP algorithm, based on text word segmentation, named entity recognition, emotion analysis, topic modeling and text classification operation, dividing texts in unstructured data into predefined categories, and acquiring word segmentation results, emotion tendencies, topic recognition and classification;
adopting a clustering algorithm, specifically k-means, to obtain clustering results including text clustering, image clustering and audio clustering;
and integrating the results of the NLP algorithm and the clustering algorithm to obtain a clustering analysis result of the unstructured data, and taking the clustering analysis result as the unstructured analysis result.
First, in the data integration process, the multi-modal data are aligned to ensure that they are consistent in time and space. Secondly, data cleaning is performed, including outlier detection, data filling and denoising, to improve data quality and reduce outlier interference. In addition, the data of different modes are normalized to a unified interval, the deviation of the scale and the range is eliminated, and the data comparability is ensured. Next, representative feature vectors are extracted from images, texts, and audio through convolutional neural networks, natural language processing, and audio feature extraction, etc. And finally, merging the modal feature vectors by utilizing a multi-modal fusion technology to obtain a comprehensive preprocessing data set. The flow can furthest utilize the information of the multi-mode data, improve the data quality and consistency, and provide more accurate and comprehensive pretreatment data sets for subsequent tasks. By integrating the steps, the richness of the multi-mode data can be fully utilized, and the effects of data analysis and model construction are improved.
Referring to fig. 4, using the SMOTE sampling method, based on the preprocessed data set and the unstructured analysis result, the steps of identifying and processing unbalanced data, and obtaining the balanced data set are specifically:
counting the pretreatment data set and the unstructured analysis result, and counting the number of samples of each category to obtain a data category counting result;
setting categories and enhancement strategies to be enhanced based on the data category statistical result, and generating enhancement strategy details;
based on the enhancement policy details, generating a synthetic sample for the category needing enhancement by applying an SMOTE algorithm, and finding K nearest neighbors of the category needing enhancement to be used as a synthetic sample set;
combining the preprocessed data set and the unstructured analysis result with the synthesized sample set to form a preliminary balance data set;
based on the preliminary balance data set, the steps are circulated until the number of samples in each category reaches the balance target, and a final balance data set is obtained.
Firstly, counting a pretreatment data set and an unstructured analysis result to obtain the statistics of the number of samples of each category. And setting the category needing enhancement and a corresponding enhancement strategy according to the statistical result. Then, the SMOTE algorithm is applied to generate a composite sample for the class that needs enhancement, and a composite sample set is generated by selecting K nearest neighbors of the class. Next, the preprocessed data set, the unstructured analysis results, and the composite sample set are combined to form a preliminary balanced data set. And (3) continuously and iteratively generating a balance data set by cycling the steps until the number of samples of each category reaches a balance target, and acquiring a final balance data set. Through the integration step, unbalanced data can be effectively processed, sample distribution among various categories is balanced, learning ability of a model for few categories is improved, and influence of sample category deviation on model training and performance evaluation is reduced. The finally obtained balance data set can improve the robustness, accuracy and overall prediction capability of the model.
Referring to fig. 5, using a decision tree algorithm including information gain and Gini coefficients, selecting features from a balanced dataset, the step of obtaining a selected feature set is specifically:
invoking a balance data set, calculating statistical summaries of all the features, including average values and standard deviations, and generating feature statistical summaries;
calculating the information gain of each feature based on the feature statistics abstract, and obtaining an information gain result;
based on the feature statistics abstract, calculating the Gini coefficient of each feature, and obtaining a Gini coefficient result;
and integrating the information gain result and the Gini coefficient result to generate a selection feature set.
By calculating statistical summaries of all features in the balanced dataset, including mean and standard deviation information, descriptive statistical information about the features can be obtained, providing a basis for subsequent feature selection. Based on the feature statistical summary, the degree of contribution of each feature to the target variable can be estimated by calculating the information gain of each feature. The information gain may help identify features with higher predictive capabilities for constructing decision tree models with better classification capabilities. Based on the feature statistical summary, gini coefficients for each feature are calculated for measuring the purity or the non-purity of the feature. The Gini coefficient can measure the degree of confusion after feature splitting, and the classification accuracy of the decision tree can be improved by selecting the feature with lower Gini coefficient. And comprehensively analyzing the information gain result and the Gini coefficient result to generate a selection feature set. From the selection of the feature set, it can be determined which features should be used as decision nodes in constructing the decision tree model.
Referring to fig. 6, using CART algorithm, the steps for constructing the basic decision tree model based on the selection feature set are specifically:
splitting a balance data set corresponding to the selected feature set into a training set and a test set to serve as training test data;
training a training set in training test data by using a CART algorithm to obtain a CART model;
and verifying on a test set in the training test data by using the CART model to obtain a CART verification result.
Firstly, splitting a balance data set corresponding to the selected feature set into a training set and a test set for building and verifying a model. The training set is then model trained using the CART algorithm, by recursively partitioning the features, generating a decision tree model with nodes and leaf nodes. On each node, the CART algorithm selects an optimal partitioning strategy according to the data characteristics, and establishes a decision rule so that a decision tree can classify the data. And then, verifying on a test set by using the CART model obtained through training, predicting by using the model, and comparing the predicted result with a real label to obtain a CART verification result. Through the verification result, the performance of the model on unseen data can be evaluated, and the generalization capability and classification accuracy of the model can be verified. Such implementation steps enable the construction of basic decision tree models and the improvement of the classification ability and accuracy of the models by verification and evaluation. The CART algorithm utilizes the selection feature set to carry out optimal division, becomes a simple and effective classification and regression method, and has wide implementation effect and application value.
Referring to fig. 7, in the fusion learning method of integrating a random forest and a deep neural network, a basic decision tree model and a deep learning model are fused, and the steps of obtaining the fusion decision tree model are specifically as follows:
training a model by using a random forest algorithm based on the selected feature set to obtain a random forest model;
constructing and training a deep neural network model based on the selection feature set;
and fusing the CART model, the random forest model and the deep neural network model by adopting a fusion algorithm to obtain a fusion decision tree model.
Based on the selection feature set, training data is first model trained using a random forest algorithm. Random forest is an integrated learning method, by randomly selecting a subset of features and data, constructing a plurality of decision trees, and integrating through strategies such as voting or averaging. The trained random forest model can synthesize the prediction results of a plurality of decision trees, and the classification accuracy and the robustness of the model are improved. Next, a deep neural network model is constructed and trained based on the selection feature set. Deep neural networks are powerful learning models that can learn higher-level abstract features from data and perform complex classification or regression tasks. By constructing a proper network structure and an optimization algorithm, the deep neural network model obtained by training has strong pattern recognition capability and generalization capability. And finally, fusing the basic decision tree model, the random forest model and the deep neural network model by adopting a fusion algorithm. The fusion algorithm can obtain a fusion decision tree model by combining prediction results of a plurality of models by utilizing the idea of integrated learning. Common fusion approaches include voting, weighted averaging, stacking, etc.
Referring to fig. 8, in the big data analysis process, the online decision tree algorithm is adopted to analyze the newly generated data in real time, and the steps of generating the online analysis result are specifically as follows:
in the big data analysis process, receiving real-time newly generated data as a real-time data stream;
cleaning, normalizing and extracting features of the real-time data stream to obtain a preprocessed data stream;
real-time analysis is carried out on the preprocessed data stream by using an online decision tree algorithm, and an online decision tree analysis result is obtained;
and comparing the analysis result of the online decision tree with the real data label, and evaluating the real-time performance of the model to obtain an online performance evaluation result.
In large data analysis, it is first necessary to receive data that is newly generated in real time, which may be implemented by a data stream processing framework or a streaming data processing system. The received data contains the latest information, and can be analyzed and decided in real time. And carrying out pretreatment steps such as cleaning, standardization, feature extraction and the like on the real-time data stream. The cleaning data can remove noise and abnormal values, the standardization can convert the data according to a certain specification, the feature extraction can extract meaningful features from the original data, and the input is provided for a subsequent online decision tree algorithm. And analyzing the preprocessed data stream in real time by using an online decision tree algorithm. The online decision tree algorithm has the characteristic of adapting to the data flow, and can dynamically update and adjust the decision tree according to new data. It has high efficiency and real-time performance in processing large-scale data and real-time data streams. And comparing the real data label with the real data label according to the analysis result of the online decision tree, and evaluating the real-time performance of the model. This helps to verify the accuracy and reliability of the model in a real-time environment, as well as the ability to adapt quickly to new data. Through the real-time performance evaluation result, the problem of the model can be found in time and adjusted and improved.
Referring to fig. 9, based on an anomaly detection algorithm of a fused decision tree model, anomaly detection is performed on an online analysis result, and the step of generating an anomaly report specifically includes:
loading a pre-trained fusion decision tree model as a pre-loaded fusion model;
performing anomaly detection on the analysis result of the online decision tree by using a preloaded fusion model to obtain a preliminary anomaly detection result;
labeling and classifying abnormal data points in the preliminary abnormal detection result, acquiring labeled abnormal data, and generating an abnormal report.
First, a pre-trained fused decision tree model needs to be loaded as a pre-loaded model. The fusion model can be loaded to facilitate subsequent abnormality detection operation. And carrying out anomaly detection on the online analysis result by utilizing the preloaded fusion decision tree model. And inputting the online analysis result into the fusion model, and judging whether the data points are abnormal or not according to the prediction result of the model. This process can detect potential outlier data points by comparing the on-line analysis results with the predicted results. And marking and classifying abnormal data points in the preliminary abnormal detection result. This step may label, sort and group outlier data points for subsequent generation of exception reports and further processing, according to particular needs. The labeling and categorizing process may be based on the characteristics of the outliers and the context information, such as anomaly type, severity, etc. And generating an exception report according to the marked exception data points. The anomaly reports may include detailed information of anomaly data points, such as data values, time stamps, anomaly types, and the like, as well as associated statistical and analytical results. The generated exception report can help the user to quickly know the exception condition and take corresponding countermeasures.
Referring to fig. 10, using an interpretability tool, specifically SHAP, to visually display the exception report and the fused decision tree model, and to provide an interpretation of the fused decision tree model, the steps of integrating and generating the final report are specifically:
loading a SHAP library and dependent resources thereof as a SHAP resource set;
generating an explanation for the fusion decision tree model by using the SHAP resource set, and acquiring a fusion model explanation;
performing visual display on the abnormal report by using the SHAP resource set to serve as an abnormal data visual result;
and integrating the fusion model interpretation and the abnormal data visualization result to obtain a comprehensive analysis report.
First, the SHAP library and its associated dependent resources need to be loaded in order to use the SHAP's functionality and tools. This includes installing the SHAP library, loading the TreeExplaner interpreter, and other dependent resources required for processing. And generating an explanation for the fusion decision tree model by using the prepared SHAP resource set. SHAP provides an explanation and understanding of the model by calculating the importance of features and the degree of contribution to model predictions. This can help us understand the importance and impact of each feature in the fusion model, knowing the reasons why the model makes predictions. And visually displaying the abnormal report by using the SHAP resource set. Through visualization tools and techniques, outlier data points, eigenvalues, and other related information are presented in a visual form that enables a user to intuitively understand and analyze the outlier. And integrating the fusion model interpretation generated before and the visualization result of the abnormal data. The interpretation results are combined with the visual results, so that a comprehensive analysis report with more comprehensive and accuracy can be provided, and the user is helped to understand and interpret abnormal conditions deeply.
Working principle: data integration and preprocessing are one of the key stages in data analysis. The goal of this stage is to collect multi-modality data, such as image data, audio data, and text data, and to ensure that the various modality data are aligned in time and space during the integration process. And then, cleaning and normalizing the integrated multi-mode data to improve the data quality and consistency. Feature extraction is another important step in the preprocessing process, where convolutional neural networks and natural language processing techniques (e.g., word embedding and text feature extraction) are employed to extract useful features from multimodal data, thereby obtaining a preprocessed dataset. At the same time, unstructured data are collected and integrated, and processed in the data integration and cleaning stage. In order to process unbalanced data, an SMOTE sampling method is adopted for identification and processing on the basis of a preprocessed data set and an unstructured analysis result, and a balanced data set is generated. Based on the balanced data set, the information gain, gini coefficient and other methods in the decision tree algorithm are utilized to perform feature selection, and a selected feature set is obtained. The basic decision tree model construction adopts a CART algorithm, and a model is constructed by using a selection feature set. The model performs splitting of a training set and a testing set, performs training on the training set by using a CART algorithm, and verifies performance of the model on the testing set. And in the construction stage of the fusion decision tree model, training is carried out by using a random forest algorithm and a deep neural network model. Further, the basic decision tree model, the random forest model and the deep neural network model are subjected to fusion learning, and a fusion decision tree model is obtained. In the process of online analysis and anomaly detection, data generated in real time is processed as a data stream. The data stream is subjected to preprocessing steps such as cleaning, standardization, feature extraction and the like to obtain a preprocessed data stream. And carrying out real-time analysis on the preprocessed data stream by using an online decision tree algorithm to generate an online analysis result. And carrying out anomaly detection on the online analysis result by utilizing an anomaly detection algorithm based on the fusion decision tree model, and generating an anomaly report. Finally, visual display and explanation are carried out. The SHAP library and its associated resources are loaded using a specialized interpretive tool (e.g., SHAP). And generating an explanation for the fusion decision tree model through the SHAP resource set, and acquiring an explanation result of the fusion model. And carrying out visual display on the abnormal report by utilizing the SHAP resource set to obtain a visual result of the abnormal data. And integrating the interpretation result of the fusion model with the visualization result of the abnormal data to generate a comprehensive analysis report.
The present invention is not limited to the above embodiments, and any equivalent embodiments which can be changed or modified by the technical disclosure described above can be applied to other fields, but any simple modification, equivalent changes and modification made to the above embodiments according to the technical matter of the present invention will still fall within the scope of the technical disclosure.

Claims (10)

1. The big data analysis method based on the decision tree is characterized by comprising the following steps:
integrating multi-modal data, cleaning and normalizing, and extracting features of the multi-modal data by adopting a deep learning convolutional neural network and a natural language processing technology to obtain a preprocessing data set;
integrating unstructured data, and analyzing the unstructured data by adopting an NLP algorithm and a clustering algorithm to obtain an unstructured analysis result;
using an SMOTE sampling method, identifying and processing unbalanced data based on the preprocessed data set and an unstructured analysis result, and obtaining a balanced data set;
selecting features from the balanced dataset by using a decision tree algorithm comprising information gain and Gini coefficients, and obtaining a selected feature set;
constructing a basic decision tree model based on the selection feature set by using a CART algorithm;
the method for fusion learning of the integrated random forest and the deep neural network comprises the steps of fusing the basic decision tree model with the deep learning model to obtain a fused decision tree model;
in the big data analysis process, an online decision tree algorithm is adopted to analyze newly generated data in real time, and an online analysis result is generated;
based on an abnormality detection algorithm of the fusion decision tree model, performing abnormality detection on the online analysis result to generate an abnormality report;
and using an interpretability tool, specifically SHAP, to visually display the exception report and the fusion decision tree model, and simultaneously providing interpretation of the fusion decision tree model, and integrating to generate a final report.
2. The decision tree based big data analysis method of claim 1, wherein the multi-modal data comprises image data, audio data, text data;
the integrated multi-modal data is subjected to cleaning and normalization, and the multi-modal data is subjected to feature extraction by adopting a deep learning convolutional neural network and a natural language processing technology, so that a preprocessing data set is obtained specifically by the following steps:
collecting the multi-modal data, and aligning each modal data in the multi-modal data with other modal data in time and space in the data integration process;
performing data cleaning on the integrated multi-mode data, including outlier detection, data filling and data denoising;
normalizing each mode data in the multi-mode data to a unified interval;
performing feature extraction on the image data by adopting the convolutional neural network, performing feature extraction on the text data by adopting the natural language processing technology, performing feature extraction on the audio data by adopting an MFCC, and acquiring a feature vector based on the feature extraction;
and merging the feature vectors of different modes by using a multi-mode fusion technology to acquire the preprocessing data set.
3. The big data analysis method based on decision tree according to claim 1, wherein the step of integrating unstructured data and analyzing the unstructured data by using NLP algorithm and clustering algorithm to obtain unstructured analysis results specifically comprises the following steps:
collecting the unstructured data, and aligning the unstructured data in time and space in the data integration process;
in the analysis process of the NLP algorithm, based on text word segmentation, named entity recognition, emotion analysis, topic modeling and text classification operation, classifying texts in unstructured data into predefined categories, and acquiring word segmentation results, emotion tendencies, topic recognition and classification;
adopting a clustering algorithm, specifically k-means, to obtain clustering results including text clustering, image clustering and audio clustering;
and integrating the results of the NLP algorithm and the clustering algorithm to obtain a clustering analysis result of the unstructured data, and taking the clustering analysis result as the unstructured analysis result.
4. The big data analysis method based on decision tree according to claim 1, wherein the step of using SMOTE sampling method to identify and process unbalanced data based on the preprocessed data set and the unstructured analysis result, and to obtain balanced data set is specifically:
counting the pretreatment data set and the unstructured analysis result, and counting the number of samples of each category to obtain a data category counting result;
setting a category and an enhancement strategy to be enhanced based on the data category statistical result, and generating enhancement strategy details;
based on the enhancement policy details, generating a synthetic sample for the category needing enhancement by applying the SMOTE algorithm, and finding K nearest neighbors of the category needing enhancement as a synthetic sample set;
combining the preprocessed data set and the unstructured analysis result with the synthesized sample set to form a preliminary balance data set;
based on the preliminary balance data set, the steps are circulated until the number of samples in each category reaches the balance target, and a final balance data set is obtained.
5. The big data analysis method based on decision tree according to claim 1, wherein the step of selecting features from the balanced dataset using a decision tree algorithm including information gain, gini coefficients, and obtaining a selected feature set is specifically:
invoking the balance data set, calculating the statistical abstracts of all the features, including average values and standard deviations, and generating feature statistical abstracts;
calculating the information gain of each feature based on the feature statistical abstract, and obtaining an information gain result;
based on the feature statistical abstract, calculating the Gini coefficient of each feature, and obtaining a Gini coefficient result;
and synthesizing the information gain result and the Gini coefficient result to generate a selection feature set.
6. The big data analysis method based on decision tree according to claim 1, wherein the step of constructing a basic decision tree model based on the selected feature set using CART algorithm specifically comprises:
splitting the balance data set corresponding to the selected feature set into a training set and a test set as training test data;
training a training set in training test data by using a CART algorithm to obtain a CART model;
and verifying on a test set in training test data by using the CART model to obtain a CART verification result.
7. The big data analysis method based on decision tree according to claim 1, wherein the method for fusion learning of the integrated random forest and the deep neural network fuses the basic decision tree model and the deep learning model, and the step of obtaining the fused decision tree model specifically comprises:
training a model by using a random forest algorithm based on the selected feature set to obtain a random forest model;
constructing and training a deep neural network model based on the selected feature set;
and fusing the CART model, the random forest model and the deep neural network model by adopting a fusion algorithm to obtain a fused decision tree model.
8. The big data analysis method based on decision tree according to claim 1, wherein in the big data analysis process, the step of adopting an online decision tree algorithm to analyze the newly generated data in real time and generating an online analysis result specifically comprises the following steps:
in the big data analysis process, receiving real-time newly generated data as a real-time data stream;
cleaning, normalizing and extracting features of the real-time data stream to obtain a preprocessed data stream;
real-time analysis is carried out on the preprocessed data stream by using an online decision tree algorithm, and an online decision tree analysis result is obtained;
and comparing the analysis result of the online decision tree with a real data label, and evaluating the real-time performance of the model to obtain an online performance evaluation result.
9. The big data analysis method based on decision tree according to claim 1, wherein the step of generating an anomaly report by performing anomaly detection on the online analysis result by the anomaly detection algorithm based on the fused decision tree model specifically comprises:
loading the pre-trained fusion decision tree model to be used as a pre-loading fusion model;
performing anomaly detection on the analysis result of the online decision tree by using the preloaded fusion model to obtain a preliminary anomaly detection result;
and marking and classifying abnormal data points in the preliminary abnormal detection result, obtaining marked abnormal data, and generating an abnormal report.
10. The big data analysis method based on decision tree according to claim 1, wherein the step of using an interpretive tool, specifically SHAP, to visually display the exception report and the fused decision tree model while providing an interpretation of the fused decision tree model, and integrating and generating a final report specifically includes:
loading a SHAP library and dependent resources thereof as a SHAP resource set;
generating an explanation for the fusion decision tree model by using the SHAP resource set, and acquiring a fusion model explanation;
using the SHAP resource set to carry out visual display on the abnormal report as an abnormal data visual result;
and integrating the fusion model interpretation and the abnormal data visualization result to obtain a comprehensive analysis report.
CN202311050733.1A 2023-08-18 2023-08-18 Big data analysis method based on decision tree Withdrawn CN117056834A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311050733.1A CN117056834A (en) 2023-08-18 2023-08-18 Big data analysis method based on decision tree

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311050733.1A CN117056834A (en) 2023-08-18 2023-08-18 Big data analysis method based on decision tree

Publications (1)

Publication Number Publication Date
CN117056834A true CN117056834A (en) 2023-11-14

Family

ID=88662283

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311050733.1A Withdrawn CN117056834A (en) 2023-08-18 2023-08-18 Big data analysis method based on decision tree

Country Status (1)

Country Link
CN (1) CN117056834A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117273670A (en) * 2023-11-23 2023-12-22 深圳市云图华祥科技有限公司 Engineering data management system with learning function
CN117349782A (en) * 2023-12-06 2024-01-05 湖南嘉创信息科技发展有限公司 Intelligent data early warning decision tree analysis method and system
CN117873837A (en) * 2024-03-11 2024-04-12 国网四川省电力公司信息通信公司 Analysis method for capacity depletion trend of storage device
CN118314379A (en) * 2024-03-29 2024-07-09 深圳市心研医疗科技有限公司 Scatter diagram classification device

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117273670A (en) * 2023-11-23 2023-12-22 深圳市云图华祥科技有限公司 Engineering data management system with learning function
CN117273670B (en) * 2023-11-23 2024-03-12 深圳市云图华祥科技有限公司 Engineering data management system with learning function
CN117349782A (en) * 2023-12-06 2024-01-05 湖南嘉创信息科技发展有限公司 Intelligent data early warning decision tree analysis method and system
CN117349782B (en) * 2023-12-06 2024-02-20 湖南嘉创信息科技发展有限公司 Intelligent data early warning decision tree analysis method and system
CN117873837A (en) * 2024-03-11 2024-04-12 国网四川省电力公司信息通信公司 Analysis method for capacity depletion trend of storage device
CN118314379A (en) * 2024-03-29 2024-07-09 深圳市心研医疗科技有限公司 Scatter diagram classification device

Similar Documents

Publication Publication Date Title
US11816078B2 (en) Automatic entity resolution with rules detection and generation system
CN117056834A (en) Big data analysis method based on decision tree
CN112756759B (en) Spot welding robot workstation fault judgment method
CN110019074A (en) Analysis method, device, equipment and the medium of access path
CN110780965B (en) Vision-based process automation method, equipment and readable storage medium
CN110442523B (en) Cross-project software defect prediction method
CN112069069A (en) Defect automatic positioning analysis method, device and readable storage medium
CN116662817B (en) Asset identification method and system of Internet of things equipment
CN114218998A (en) Power system abnormal behavior analysis method based on hidden Markov model
CN110717090A (en) Network public praise evaluation method and system for scenic spots and electronic equipment
CN109002810A (en) Model evaluation method, Radar Signal Recognition method and corresponding intrument
CN107016416A (en) The data classification Forecasting Methodology merged based on neighborhood rough set and PCA
CN107908807B (en) Small subsample reliability evaluation method based on Bayesian theory
CN113722719A (en) Information generation method and artificial intelligence system for security interception big data analysis
Soukup et al. Towards evaluating quality of datasets for network traffic domain
CN115577357A (en) Android malicious software detection method based on stacking integration technology
CN118296164A (en) Automatic agricultural product information acquisition and updating method and system based on knowledge graph
CN110956543A (en) Method for detecting abnormal transaction
CN111967501B (en) Method and system for judging load state driven by telemetering original data
CN102103502A (en) Method and system for analyzing a legacy system based on trails through the legacy system
CN110879821A (en) Method, device, equipment and storage medium for generating rating card model derivative label
CN111896609A (en) Method for analyzing mass spectrum data based on artificial intelligence
CN115455407A (en) Machine learning-based GitHub sensitive information leakage monitoring method
CN113722230B (en) Integrated evaluation method and device for vulnerability mining capability of fuzzy test tool
CN111382191A (en) Machine learning identification method based on deep learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
WW01 Invention patent application withdrawn after publication
WW01 Invention patent application withdrawn after publication

Application publication date: 20231114