CN117216490A - Intelligent big data acquisition system - Google Patents
Intelligent big data acquisition system Download PDFInfo
- Publication number
- CN117216490A CN117216490A CN202311474709.0A CN202311474709A CN117216490A CN 117216490 A CN117216490 A CN 117216490A CN 202311474709 A CN202311474709 A CN 202311474709A CN 117216490 A CN117216490 A CN 117216490A
- Authority
- CN
- China
- Prior art keywords
- data
- abnormal
- railway
- feature
- value
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 230000002159 abnormal effect Effects 0.000 claims abstract description 94
- 238000000034 method Methods 0.000 claims abstract description 48
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 28
- 238000007637 random forest analysis Methods 0.000 claims abstract description 28
- 238000007781 pre-processing Methods 0.000 claims abstract description 12
- 238000012549 training Methods 0.000 claims description 60
- XEEYBQQBJWHFJM-UHFFFAOYSA-N Iron Chemical compound [Fe] XEEYBQQBJWHFJM-UHFFFAOYSA-N 0.000 claims description 25
- 238000005457 optimization Methods 0.000 claims description 21
- 238000003066 decision tree Methods 0.000 claims description 18
- 229910052742 iron Inorganic materials 0.000 claims description 12
- 230000003993 interaction Effects 0.000 claims description 9
- 239000000779 smoke Substances 0.000 claims description 9
- 241000209094 Oryza Species 0.000 claims description 6
- 235000007164 Oryza sativa Nutrition 0.000 claims description 6
- 230000005856 abnormality Effects 0.000 claims description 6
- 235000009566 rice Nutrition 0.000 claims description 6
- 238000012360 testing method Methods 0.000 claims description 6
- 241000283973 Oryctolagus cuniculus Species 0.000 claims description 3
- 238000007405 data analysis Methods 0.000 claims description 3
- 239000011159 matrix material Substances 0.000 claims description 3
- 238000005215 recombination Methods 0.000 claims description 3
- 230000006798 recombination Effects 0.000 claims description 3
- 230000011218 segmentation Effects 0.000 claims description 3
- 238000012795 verification Methods 0.000 claims description 3
- 238000013480 data collection Methods 0.000 claims 1
- 208000011580 syndromic disease Diseases 0.000 claims 1
- 238000007726 management method Methods 0.000 description 7
- 238000011156 evaluation Methods 0.000 description 6
- 238000004458 analytical method Methods 0.000 description 2
- 230000000052 comparative effect Effects 0.000 description 2
- 230000002547 anomalous effect Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
Landscapes
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention discloses an intelligent big data acquisition system, which comprises the steps of acquiring railway data, preprocessing the railway data, identifying the preprocessed railway data, acquiring abnormal data and contrast data, constructing a data restoration model, optimizing the data restoration model, extracting abnormal characteristics of the abnormal data, inputting the abnormal data and the contrast data into the data restoration model, filling the abnormal data by adopting a random decision forest algorithm according to the abnormal characteristics and the contrast data, and outputting the filled data as a result. The method not only can improve the precision of big data acquisition, but also has better interpretability, and can be directly applied to an intelligent big data acquisition system.
Description
Technical Field
The invention relates to the field of big data, in particular to an intelligent big data acquisition system.
Background
At present, the planning and statistical informatization level of the national iron group is in the scattered stage of professional division, decision systematicness and standardization are to be perfected, the problems of obvious data quality exist, each railway business information system is independently built and self-organized, information resources among systems and even in the systems cannot be effectively integrated, each application system is independent to form a network, a database is isolated, application software is special, data formats, technical specifications, interface standards, log files and the like are lack of consistency, compatibility is poor, information exchange among systems is difficult, information sharing degree is low, information is difficult to comprehensively utilize, and difficulty in system construction and dimension is large.
The acquisition technology is widely applied in the field of big data, and can help managers of a big data acquisition system to analyze data timely and efficiently, so that analysis and management of the data are realized. At present, the large data acquisition method has more uncertain factors due to the characteristics of huge information quantity, various types, large information density and the like of railway data, so that the large data acquisition method has larger uncertainty. Although some intelligent big data acquisition methods and systems have been invented, the problem of uncertainty of the big data acquisition method cannot be effectively solved.
Disclosure of Invention
The invention aims to provide an intelligent big data acquisition system.
In order to achieve the above purpose, the invention is implemented according to the following technical scheme:
the invention comprises the following steps:
a, acquiring railway data and preprocessing the railway data;
b, recognizing the preprocessed railway data to obtain abnormal data and contrast data;
c, constructing a data restoration model, and optimizing the data restoration model; the method for constructing the data restoration model comprises the following steps:
constructing the preprocessed railway data as a data set, taking Boosting as an integrated model of a random decision forest, and taking deviation, variance and noise as generalization errors of the random decision forest:
wherein the generalization error is E, and the predicted value of the ith railway data isThe standard value of the railway data is ∈ ->Noise of railway data is +.>The number of railway data is n;
inputting the railway data set into a data restoration model for training, optimizing the generalization capability of the data restoration model by adopting a first optimization algorithm, and optimizing the precision of the data restoration model by adopting a second optimization algorithm;
and D, inputting the abnormal data and the comparison data into the data restoration model, extracting the abnormal characteristics of the abnormal data, filling the abnormal data by adopting a random decision forest algorithm according to the abnormal characteristics and the comparison data, and outputting the filled data as a result.
Further, the preprocessing in the step A comprises removing repeated data, removing abnormal data, integrating data, converting data and normalizing data.
Further, the method for identifying the railway data after preprocessing comprises the following steps:
dividing railway data into subsequences, capturing the dependency relationship of the subsequences, and calculating an objective function in the subsequences:
wherein the mth subsequence isSub-sequence->Is reconstructed as->Calculating an objective function between subsequences:
wherein the local dependency set isThe global dependency set is +.>Calculating an anomaly score for the subsequence:
wherein the degree of influence of the intra-sequence and inter-sequence abnormal reconstruction errors on the overall error is proportional to、/>、/>SubsequencesIs +.>Given a set of subsequences for which anomaly scores have been calculated, a set of edge anomaly candidates is given:
wherein the edge anomaly candidate set isThe set of subsequences for which the abnormality score was calculated is +.>Subsequence->Assembly of components->Is +.>The threshold value of abnormality degree is->The number of manually-marked samples is M, the local dependence factor is k, and the center point of the edge anomaly candidate set is +.>The s-th subsequence is->The number of the subsequences is j, and a weighted error generated by interaction is calculated:
wherein the error influencing factor isThe sub-sequence generated by the t-1 round of interaction +.>The weight of (2) is +.>The t-th interaction produces a weighted error of +.>And continuously iterating until the weighted error increment is smaller than a given threshold value, and outputting the rest data as contrast data and the rest data as abnormal data.
Further, the method for optimizing the generalization capability of the data restoration model by adopting the first optimization algorithm comprises the following steps:
generating iron nodes according to the data of the training set, if the railway data in the training set belong to a class forest, marking the iron nodes as leaf nodes of the class forest, if the values of the railway data in the training set on the attribute set are the same, marking rabbit nodes as leaf nodes, and marking class waves of the leaf nodes as classes with the largest sample number in the railway data;
selecting the optimized attribute loss from the attribute set, and taking the category with the highest proportion of categories in the nodes as the tendency category preference of the iron nodes;
acquiring the number classified as happiness in the verification set as Q, initializing the count, repeating each value of the optimizing attribute loss, and selecting a data set with the value of the railway data as the optimizing attribute loss as fire;
selecting data which is concentrated in the attribute and takes the value as the optimal attribute loss as the rice, if the fire is empty, selecting the category with the highest proportion of categories in the nodes in the attribute as the tendency category smoke of the iron node, otherwise, acquiring the number of the category smoke classified as the tendency category smoke in the rice, if Q is more than or equal to the initialized count, marking the branch node as the leaf node, and marking the category as the most category in the railway data.
Further, the method for optimizing the precision of the data restoration model by adopting the second optimization algorithm comprises the following steps:
calculating the empirical entropy of the training data set:
wherein the empirical entropy of the training data set w isThe s-th class is->The number of categories is Q, and the empirical conditional entropy of the feature pair training set is calculated:
wherein the training set data of the ith feature is vThe number of features is m, the s-th class of the i-th feature is +.>The empirical conditional entropy of the feature v on the training dataset w is +.>Calculating the information gain of the training data set:
wherein the information gain of the w characteristic v of the training data set isCalculating the information gain ratio of the training data set:
wherein the information gain ratio of the w characteristic v of the training data set isThe information gain ratios of the features are ordered in descending order, and the feature with the largest information gain ratio is selected as the optimal segmentation feature.
Further, the method for extracting the abnormal characteristics of the abnormal data comprises the following steps:
setting a population scale, initializing the population and understanding space, and calculating an fitness function according to the following formula:
wherein the number of feature subsets of the classification is c, and the subset mean vector of the ith class isThe average vector of the feature set is +.>The vector of the feature j of class i is +.>The number of features j is +.>The feature set of the ith class is +.>Calculating an fitness function value of the abnormal data according to the fitness function, and adjusting the characteristic weight of the abnormal data:
wherein the difference degree of any two abnormal data in the j-th characteristic dimension isThe difference value of the individual anomaly data in the dimension of feature j is +.>、/>Characteristics->The maximum value of the weight is +.>Characteristics->The minimum value of the weight is +.>Calculating the abnormal data abnormal feature selection probability:
wherein the features areThe weight value of (2) is +.>Of individuals of the t th generationThe fitness function value is->Selecting two individuals from the updated population according to the selection probability to perform cross recombination to obtain a new generation population, and extracting the characteristic with the maximum fitness as an important characteristic;
comparing the fitness value of the important features with a fitness threshold value, terminating iteration if the fitness value of the important features is larger than the fitness threshold value, and outputting the first 3 features of the fitness rank as abnormal features; otherwise, recalculating the fitness value until the fitness value of the important feature is larger than the fitness threshold;
and deleting the rest abnormal data which cannot be filled.
Further, the method for filling the abnormal data by adopting a random forest algorithm according to the abnormal characteristics and the comparison number comprises the following steps:
randomly selecting N pieces of abnormal data to form a training set T, wherein the training set has d characteristics, and k (k < d) pieces of decision trees are selected each time;
randomly selecting N pieces of abnormal data from the training set T, selecting N pieces of abnormal data to train a decision tree, and taking the decision tree as a sample at a root node of the decision tree;
when each sample has M attributes, and each node of the decision tree needs to be split, randomly selecting M attributes from the M attributes, meeting the condition M < < M, and selecting 1 attribute from the M attributes as the splitting attribute of the node by adopting information gain;
repeatedly selecting splitting attributes until the splitting cannot be performed again, combining the constructed decision trees to form a random forest, and obtaining a classification result according to a slicing result;
a. filling in a large number of missing features
Taking the value without missing the feature M as a training label Y_train, taking the other n-1 features corresponding to the value as training features X_train, and building a random forest regression tree for training; taking n-1 features corresponding to the feature M missing values as a test set X_test, and carrying out prediction by a trainer to finally obtain predicted values of the M feature missing values;
b. filling of multiple feature missing data
Traversing all the features, filling from the feature with the least missing, replacing the missing value of other features with 0 or mode when filling one feature, putting the predicted value into the original feature matrix after finishing one regression prediction, and continuously filling the next feature.
In a second aspect, an intelligent big data acquisition system comprises:
and a data analysis module: the method comprises the steps of identifying the railway data after pretreatment to obtain abnormal data and contrast data;
modeling optimization module: the method comprises the steps of constructing a data restoration model and optimizing the data restoration model; the method for constructing the data restoration model comprises the following steps:
constructing the preprocessed railway data as a data set, taking Boosting as an integrated model of a random decision forest, and taking deviation, variance and noise as generalization errors of the random decision forest:
wherein the generalization error is E, and the predicted value of the ith railway data isThe standard value of the railway data is ∈ ->Noise of railway data is +.>The number of railway data is n;
inputting the railway data set into a data restoration model for training, optimizing the generalization capability of the data restoration model by adopting a first optimization algorithm, and optimizing the precision of the data restoration model by adopting a second optimization algorithm;
and the data filling module is used for: the method comprises the steps of inputting the abnormal data and the comparison data into the data restoration model, extracting the abnormal characteristics of the abnormal data, filling the abnormal data by adopting a random decision forest algorithm according to the abnormal characteristics and the comparison data, and outputting the filled data as a result.
The beneficial effects of the invention are as follows:
compared with the prior art, the intelligent big data acquisition method has the following technical effects:
the method can improve the accuracy of intelligent big data acquisition management by preprocessing, modeling, optimizing a model, processing data, extracting features and filling data, thereby improving the accuracy of intelligent big data acquisition management, realizing the automatic analysis and management of big data, carrying out feature extraction and data filling on railway data in real time, having important significance on intelligent big data acquisition management, adapting to intelligent big data acquisition management of different standards and intelligent big data acquisition management of different systems, and having certain universality.
Drawings
FIG. 1 is a flow chart of steps of an intelligent big data acquisition method of the present invention.
Detailed Description
The invention is further described by the following specific examples, which are presented to illustrate, but not to limit, the invention.
The invention discloses an intelligent big data acquisition method which comprises the following steps:
as shown in fig. 1, in this embodiment, the steps include:
a, acquiring railway data and preprocessing the railway data;
in the actual evaluation, railroad mileage data in units of years is given:
631 km in 2008, 2009 loss, 1828 km in 2010, 9999 km in 2011, 3084 km in 2012, 3559.1234 km in 2013, 10000 km in 2014, 2015 loss, 2337 km in 2016, 1856 km in 2017, 4050 km in 2018, 4285.5678 km in 579, 2520 km in 2020, 2149 km in 2021;
b, recognizing the preprocessed railway data to obtain abnormal data and contrast data;
in actual evaluation, the anomaly data is: 9999 km in 2011, 10000 km in 2014, 4285.5678 km in 2019, 2009 and 2015;
the comparative data are: 631 km in 2008, 1828 km in 2010, 3084 km in 2012, 3559.1234 km in 2013, 2337 km in 2016, 1856 km in 2017, 4050 km in 2018, 2520 km in 2020, 2149 km in 2021;
c, constructing a data restoration model, and optimizing the data restoration model; the method for constructing the data restoration model comprises the following steps:
constructing the preprocessed railway data as a data set, taking Boosting as an integrated model of a random decision forest, and taking deviation, variance and noise as generalization errors of the random decision forest:
wherein the generalization error is E, and the predicted value of the ith railway data isThe standard value of the railway data is ∈ ->Noise of railway data is +.>The number of railway data is n;
inputting the railway data set into a data restoration model for training, optimizing the generalization capability of the data restoration model by adopting a first optimization algorithm, and optimizing the precision of the data restoration model by adopting a second optimization algorithm;
inputting the abnormal data and the contrast data into the data restoration model, extracting the abnormal characteristics of the abnormal data, filling the abnormal data by adopting a random decision forest algorithm according to the abnormal characteristics and the contrast data, and outputting the filled data as a result;
in the actual evaluation, the data filled in 2009 is 2345 km, and the data filled in 2015 is 4748 km.
In this embodiment, the preprocessing in step a includes removing duplicate data, removing anomalous data, data integration, data conversion, and data normalization.
In this embodiment, the method for identifying the railway data after preprocessing includes:
dividing railway data into subsequences, capturing the dependency relationship of the subsequences, and calculating an objective function in the subsequences:
wherein the mth subsequence isSub-sequence->Is reconstructed as->Calculating an objective function between subsequences:
wherein the local dependency set isThe global dependency set is +.>Calculating an anomaly score for the subsequence:
wherein the degree of influence of the intra-sequence and inter-sequence abnormal reconstruction errors on the overall error is proportional to、/>、/>SubsequencesIs +.>Given a set of subsequences for which anomaly scores have been calculated, a set of edge anomaly candidates is given:
wherein the edge anomaly candidate set isThe set of subsequences for which the abnormality score was calculated is +.>Subsequence->Assembly of components->Is +.>The threshold value of abnormality degree is->The number of manually-marked samples is M, the local dependence factor is k, and the center point of the edge anomaly candidate set is +.>The s-th subsequence is->The number of the subsequences is j, and a weighted error generated by interaction is calculated:
wherein the error influencing factor isThe sub-sequence generated by the t-1 round of interaction +.>The weight of (2) is +.>The t-th interaction produces a weighted error of +.>Continuously iterating until the weighted error increment is smaller than a given threshold value, and stopping to lead out the rest data as contrast data and the rest data as abnormal data;
in actual evaluation, the anomaly data is: 9999 km in 2011, 10000 km in 2014, 4285.5678 km in 2019, 2009 and 2015;
the comparative data are: 631 km in 2008, 1828 km in 2010, 3084 km in 2012, 3559.1234 km in 2013, 2337 km in 2016, 1856 km in 2017, 4050 km in 2018, 2520 km in 2020, 2149 km in 2021.
In this embodiment, a method for optimizing generalization capability of a data repair model by using a first optimization algorithm includes:
generating iron nodes according to the data of the training set, if the railway data in the training set belong to a class forest, marking the iron nodes as leaf nodes of the class forest, if the values of the railway data in the training set on the attribute set are the same, marking rabbit nodes as leaf nodes, and marking class waves of the leaf nodes as classes with the largest sample number in the railway data;
selecting the optimized attribute loss from the attribute set, and taking the category with the highest proportion of categories in the nodes as the tendency category preference of the iron nodes;
acquiring the number classified as happiness in the verification set as Q, initializing the count, repeating each value of the optimizing attribute loss, and selecting a data set with the value of the railway data as the optimizing attribute loss as fire;
selecting data which is concentrated in the attribute and takes the value as the optimal attribute loss as the rice, if the fire is empty, selecting the category with the highest proportion of categories in the nodes in the attribute as the tendency category smoke of the iron node, otherwise, acquiring the number of the category smoke classified as the tendency category smoke in the rice, if Q is more than or equal to the initialized count, marking the branch node as the leaf node, and marking the category as the most category in the railway data.
In this embodiment, the method for optimizing the accuracy of the data repair model by using the second optimization algorithm includes:
calculating the empirical entropy of the training data set:
wherein the empirical entropy of the training data set w isThe s-th class is->The number of categories is Q, and the empirical conditional entropy of the feature pair training set is calculated:
wherein the training set data of the ith feature is vThe number of features is m, the s-th class of the i-th feature is +.>The empirical conditional entropy of the feature v on the training dataset w is +.>Calculating the information gain of the training data set:
wherein the information gain of the w characteristic v of the training data set isCalculating the information gain ratio of the training data set:
wherein the information gain ratio of the w characteristic v of the training data set isThe information gain ratios of the features are ordered in descending order, and the feature with the largest information gain ratio is selected as the optimal segmentation feature.
In this embodiment, the method for extracting the abnormal feature of the abnormal data includes:
setting a population scale, initializing the population and understanding space, and calculating an fitness function according to the following formula:
wherein the number of feature subsets of the classification is c, and the subset mean vector of the ith class isThe average vector of the feature set is +.>The vector of the feature j of class i is +.>The number of features j is +.>The feature set of the ith class is +.>Calculating an fitness function value of the abnormal data according to the fitness function, and adjusting the characteristic weight of the abnormal data:
wherein the difference degree of any two abnormal data in the j-th characteristic dimension isThe difference value of the individual anomaly data in the dimension of feature j is +.>、/>Characteristics->The maximum value of the weight is +.>Characteristics->The minimum value of the weight is +.>Calculating the abnormal data abnormal feature selection probability:
wherein the features areThe weight value of (2) is +.>The fitness function value of the t-th generation individual is +.>Selecting two individuals from the updated population according to the selection probability to perform cross recombination to obtain a new generation population, and extracting the characteristic with the maximum fitness as an important characteristic;
comparing the fitness value of the important features with a fitness threshold value, terminating iteration if the fitness value of the important features is larger than the fitness threshold value, and outputting the first 3 features of the fitness rank as abnormal features; otherwise, recalculating the fitness value until the fitness value of the important feature is larger than the fitness threshold;
deleting the rest abnormal data which cannot be filled;
in the actual evaluation, the extracted features are:
outliers: data in 2011 is 9999 km, which is far higher than data in other years, and is regarded as an abnormal value;
outliers: data in 2014 is 10000 ten thousand km, which is far higher than data in other years, and is regarded as an outlier;
noise data: the data in 2019 contains values after decimal points, and noise exists compared with the data in other years;
missing values: data were missing in 2009 and 2015, with no values available;
the processed output data are 2009, 2015, 2008, 631, 1828, 2012, 3084, 2013, 3559.1234, 2016, 2337, 1856, 4050, 2020, 2520, 2021, 2149.
In this embodiment, the method for filling the abnormal data with a random forest algorithm according to the abnormal feature and the comparison number includes:
randomly selecting N pieces of abnormal data to form a training set T, wherein the training set has d characteristics, and k (k < d) pieces of decision trees are selected each time;
randomly selecting N pieces of abnormal data from the training set T, selecting N pieces of abnormal data to train a decision tree, and taking the decision tree as a sample at a root node of the decision tree;
when each sample has M attributes, and each node of the decision tree needs to be split, randomly selecting M attributes from the M attributes, meeting the condition M < < M, and selecting 1 attribute from the M attributes as the splitting attribute of the node by adopting information gain;
repeatedly selecting splitting attributes until the splitting cannot be performed again, combining the constructed decision trees to form a random forest, and obtaining a classification result according to a slicing result;
a. filling in a large number of missing features
Taking the value without missing the feature M as a training label Y_train, taking the other n-1 features corresponding to the value as training features X_train, and building a random forest regression tree for training; taking n-1 features corresponding to the feature M missing values as a test set X_test, and carrying out prediction by a trainer to finally obtain predicted values of the M feature missing values;
b. filling of multiple feature missing data
Traversing all the features, filling from the feature with the least missing, replacing the missing value of other features with 0 or mode when filling one feature, putting the predicted value into the original feature matrix after finishing one regression prediction, and continuously filling the next feature;
in the actual evaluation, the data filled in 2009 is 2345 km, and the data filled in 2015 is 4748 km.
In a second aspect, an intelligent big data acquisition system comprises:
and a data acquisition module: the method comprises the steps of acquiring railway data and preprocessing the railway data;
and a data analysis module: the method comprises the steps of identifying the railway data after pretreatment to obtain abnormal data and contrast data;
modeling optimization module: the method comprises the steps of constructing a data restoration model and optimizing the data restoration model; the method for constructing the data restoration model comprises the following steps:
constructing the preprocessed railway data as a data set, taking Boosting as an integrated model of a random decision forest, and taking deviation, variance and noise as generalization errors of the random decision forest:
wherein the generalization error is E, and the predicted value of the ith railway data isThe standard value of the railway data is ∈ ->Noise of railway data is +.>The number of railway data is n;
inputting the railway data set into a data restoration model for training, optimizing the generalization capability of the data restoration model by adopting a first optimization algorithm, and optimizing the precision of the data restoration model by adopting a second optimization algorithm;
and the data filling module is used for: the method comprises the steps of inputting the abnormal data and the comparison data into the data restoration model, extracting the abnormal characteristics of the abnormal data, filling the abnormal data by adopting a random decision forest algorithm according to the abnormal characteristics and the comparison data, and outputting the filled data as a result.
The foregoing description of the preferred embodiments of the invention is not intended to limit the invention to the precise form disclosed, and any such modifications, equivalents, and alternatives falling within the spirit and scope of the invention are intended to be included within the scope of the invention.
Claims (8)
1. The intelligent big data acquisition method is characterized by comprising the following steps of:
a, acquiring railway data and preprocessing the railway data;
b, recognizing the preprocessed railway data to obtain abnormal data and contrast data;
c, constructing a data restoration model, and optimizing the data restoration model; the method for constructing the data restoration model comprises the following steps:
constructing the preprocessed railway data as a data set, taking Boosting as an integrated model taking Boosting as a random decision forest, and taking deviation, variance and noise as generalization errors of the random decision forest:
wherein the generalization error is E, and the predicted value of the ith railway data isThe standard value of the railway data is ∈ ->Noise of railway data is +.>The number of railway data is n;
inputting the railway data set into a data restoration model for training, optimizing the generalization capability of the data restoration model by adopting a first optimization algorithm, and optimizing the precision of the data restoration model by adopting a second optimization algorithm;
and D, inputting the abnormal data and the comparison data into the data restoration model, extracting the abnormal characteristics of the abnormal data, filling the abnormal data by adopting a random decision forest algorithm according to the abnormal characteristics and the comparison data, and outputting the filled data as a result.
2. The intelligent big data collection method according to claim 1, wherein the preprocessing in the step a includes removing duplicate data, removing abnormal data, integrating data, converting data, and normalizing data.
3. The intelligent big data acquisition method according to claim 1, wherein the method for identifying the preprocessed railway data comprises the following steps:
dividing railway data into subsequences, capturing the dependency relationship of the subsequences, and calculating an objective function in the subsequences:
wherein the mth subsequence isSub-sequence->Is reconstructed as->Calculating an objective function between subsequences:
wherein the local dependency set isThe global dependency set is +.>Calculating an anomaly score for the subsequence:
wherein the degree of influence of the intra-sequence and inter-sequence abnormal reconstruction errors on the overall error is proportional to、/>、/>Subsequence->Is +.>Given a set of subsequences for which anomaly scores have been calculated, a set of edge anomaly candidates is given:
wherein the edge anomaly candidate set isThe set of subsequences for which the abnormality score was calculated is +.>Subsequence->Assembly of components->Is abnormal of (a)Score->The threshold value of abnormality degree is->The number of manually-marked samples is M, the local dependence factor is k, and the center point of the edge anomaly candidate set is +.>The s-th subsequence is->The number of the subsequences is j, and a weighted error generated by interaction is calculated:
wherein the error influencing factor isThe sub-sequence generated by the t-1 round of interaction +.>The weight of (2) is +.>The t-th interaction produces a weighted error of +.>And continuously iterating until the weighted error increment is smaller than a given threshold value, and outputting the rest data as contrast data and the rest data as abnormal data.
4. The intelligent big data acquisition method according to claim 1, wherein the method for optimizing the generalization ability of the data restoration model by adopting the first optimization algorithm comprises the following steps:
generating iron nodes according to the data of the training set, if the railway data in the training set belong to a class forest, marking the iron nodes as leaf nodes of the class forest, if the values of the railway data in the training set on the attribute set are the same, marking rabbit nodes as leaf nodes, and marking class waves of the leaf nodes as classes with the largest sample number in the railway data;
selecting the optimized attribute loss from the attribute set, and taking the category with the highest proportion of categories in the nodes as the tendency category preference of the iron nodes;
acquiring the number classified as happiness in the verification set as Q, initializing the count, repeating each value of the optimizing attribute loss, and selecting a data set with the value of the railway data as the optimizing attribute loss as fire;
selecting data which is concentrated in the attribute and takes the value as the optimal attribute loss as the rice, if the fire is empty, selecting the category with the highest proportion of categories in the nodes in the attribute as the tendency category smoke of the iron node, otherwise, acquiring the number of the category smoke classified as the tendency category smoke in the rice, if Q is more than or equal to the initialized count, marking the branch node as the leaf node, and marking the category as the most category in the railway data.
5. The intelligent big data acquisition method according to claim 1, wherein the method for optimizing the accuracy of the data restoration model by using the second optimization algorithm comprises the following steps:
calculating the empirical entropy of the training data set:
wherein the empirical entropy of the training data set w isThe s-th class is->The number of categories is Q, and the empirical conditional entropy of the feature pair training set is calculated:
wherein the training set data of the ith feature is vThe number of features is m, the s-th class of the i-th feature is +.>The empirical conditional entropy of the feature v on the training dataset w is +.>Calculating the information gain of the training data set:
wherein the information gain of the w characteristic v of the training data set isCalculating the information gain ratio of the training data set:
wherein the information gain ratio of the w characteristic v of the training data set isThe information gain ratios of the features are ordered in descending order, and the feature with the largest information gain ratio is selected as the optimal segmentation feature.
6. The intelligent big data acquisition method according to claim 1, wherein the method for extracting the abnormal characteristics of the abnormal data comprises the steps of:
setting a population scale, initializing the population and understanding space, and calculating an fitness function according to the following formula:
wherein the number of feature subsets of the classification is c, and the subset mean vector of the ith class isThe average vector of the feature set is +.>The vector of the feature j of class i is +.>The number of features j is +.>The feature set of the ith class is +.>Calculating an fitness function value of the abnormal data according to the fitness function, and adjusting the characteristic weight of the abnormal data:
wherein the difference degree of any two abnormal data in the j-th characteristic dimension isThe difference value of the individual anomaly data in the dimension of feature j is +.>、/>Special (special)Syndrome of->The maximum value of the weight is +.>Characteristics->The minimum value of the weight is +.>Calculating the abnormal data abnormal feature selection probability:
wherein the features areThe weight value of (2) is +.>The fitness function value of the t-th generation individual is +.>Selecting two individuals from the updated population according to the selection probability to perform cross recombination to obtain a new generation population, and extracting the characteristic with the maximum fitness as an important characteristic;
comparing the fitness value of the important features with a fitness threshold value, terminating iteration if the fitness value of the important features is larger than the fitness threshold value, and outputting the first 3 features of the fitness rank as abnormal features; otherwise, recalculating the fitness value until the fitness value of the important feature is larger than the fitness threshold;
and deleting the rest abnormal data which cannot be filled.
7. The intelligent big data acquisition method according to claim 1, wherein the method for filling the abnormal data by adopting a random forest algorithm according to the abnormal characteristics and the comparison number comprises the following steps:
randomly selecting N pieces of abnormal data to form a training set T, wherein the training set has d characteristics, and k (k < d) pieces of decision trees are selected each time;
randomly selecting N pieces of abnormal data from the training set T, selecting N pieces of abnormal data to train a decision tree, and taking the decision tree as a sample at a root node of the decision tree;
when each sample has M attributes, and each node of the decision tree needs to be split, randomly selecting M attributes from the M attributes, meeting the condition M < < M, and selecting 1 attribute from the M attributes as the splitting attribute of the node by adopting information gain;
repeatedly selecting splitting attributes until the splitting cannot be performed again, combining the constructed decision trees to form a random forest, and obtaining a classification result according to a slicing result;
a. filling in a large number of missing features
Taking the value without missing the feature M as a training label Y_train, taking the other n-1 features corresponding to the value as training features X_train, and building a random forest regression tree for training; taking n-1 features corresponding to the feature M missing values as a test set X_test, and carrying out prediction by a trainer to finally obtain predicted values of the M feature missing values;
b. filling of multiple feature missing data
Traversing all the features, filling from the feature with the least missing, replacing the missing value of other features with 0 or mode when filling one feature, putting the predicted value into the original feature matrix after finishing one regression prediction, and continuously filling the next feature.
8. An intelligent big data acquisition system, comprising:
and a data acquisition module: the method comprises the steps of acquiring railway data and preprocessing the railway data;
and a data analysis module: the method comprises the steps of identifying the railway data after pretreatment to obtain abnormal data and contrast data;
modeling optimization module: the method comprises the steps of constructing a data restoration model and optimizing the data restoration model; the method for constructing the data restoration model comprises the following steps:
constructing the preprocessed railway data as a data set, taking Boosting as an integrated model of a random decision forest, and taking deviation, variance and noise as generalization errors of the random decision forest:
wherein the generalization error is E, and the predicted value of the ith railway data isThe standard value of the railway data is ∈ ->Noise of railway data is +.>The number of railway data is n;
inputting the railway data set into a data restoration model for training, optimizing the generalization capability of the data restoration model by adopting a first optimization algorithm, and optimizing the precision of the data restoration model by adopting a second optimization algorithm;
and the data filling module is used for: the method comprises the steps of inputting the abnormal data and the comparison data into the data restoration model, extracting the abnormal characteristics of the abnormal data, filling the abnormal data by adopting a random decision forest algorithm according to the abnormal characteristics and the comparison data, and outputting the filled data as a result.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311474709.0A CN117216490B (en) | 2023-11-08 | 2023-11-08 | Intelligent big data acquisition system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311474709.0A CN117216490B (en) | 2023-11-08 | 2023-11-08 | Intelligent big data acquisition system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN117216490A true CN117216490A (en) | 2023-12-12 |
CN117216490B CN117216490B (en) | 2024-01-19 |
Family
ID=89035674
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311474709.0A Active CN117216490B (en) | 2023-11-08 | 2023-11-08 | Intelligent big data acquisition system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117216490B (en) |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113744869A (en) * | 2021-09-07 | 2021-12-03 | 中国医科大学附属盛京医院 | Method for establishing early screening of light chain amyloidosis based on machine learning and application thereof |
CN114169631A (en) * | 2021-12-15 | 2022-03-11 | 中国石油大学胜利学院 | Oil field power load management and control system based on data analysis |
CN114238293A (en) * | 2021-12-01 | 2022-03-25 | 国网福建省电力有限公司莆田供电公司 | Transformer oil paper insulation FDS data restoration method based on random forest |
US20220292239A1 (en) * | 2021-03-15 | 2022-09-15 | KuantSol Inc. | Smart time series and machine learning end-to-end (e2e) model development enhancement and analytic software |
CN115420690A (en) * | 2022-04-29 | 2022-12-02 | 中遥环境(西安)股份有限公司 | Near-surface trace gas concentration inversion model and inversion method |
CN116316599A (en) * | 2023-03-28 | 2023-06-23 | 广东电网有限责任公司东莞供电局 | Intelligent electricity load prediction method |
-
2023
- 2023-11-08 CN CN202311474709.0A patent/CN117216490B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20220292239A1 (en) * | 2021-03-15 | 2022-09-15 | KuantSol Inc. | Smart time series and machine learning end-to-end (e2e) model development enhancement and analytic software |
CN113744869A (en) * | 2021-09-07 | 2021-12-03 | 中国医科大学附属盛京医院 | Method for establishing early screening of light chain amyloidosis based on machine learning and application thereof |
CN114238293A (en) * | 2021-12-01 | 2022-03-25 | 国网福建省电力有限公司莆田供电公司 | Transformer oil paper insulation FDS data restoration method based on random forest |
CN114169631A (en) * | 2021-12-15 | 2022-03-11 | 中国石油大学胜利学院 | Oil field power load management and control system based on data analysis |
CN115420690A (en) * | 2022-04-29 | 2022-12-02 | 中遥环境(西安)股份有限公司 | Near-surface trace gas concentration inversion model and inversion method |
CN116316599A (en) * | 2023-03-28 | 2023-06-23 | 广东电网有限责任公司东莞供电局 | Intelligent electricity load prediction method |
Non-Patent Citations (1)
Title |
---|
汤洪涛 等: "基于工业大数据的柔性作业车间动态调度", 《计算机集成制造系统》, vol. 26, no. 9, pages 2497 - 2510 * |
Also Published As
Publication number | Publication date |
---|---|
CN117216490B (en) | 2024-01-19 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111798312A (en) | Financial transaction system abnormity identification method based on isolated forest algorithm | |
CN113051291A (en) | Work order information processing method, device, equipment and storage medium | |
CN113343640B (en) | Method and device for classifying customs commodity HS codes | |
CN114722014B (en) | Batch data time sequence transmission method and system based on database log file | |
CN111581193A (en) | Data processing method, device, computer system and storage medium | |
CN115147155A (en) | Railway freight customer loss prediction method based on ensemble learning | |
CN113420887A (en) | Prediction model construction method and device, computer equipment and readable storage medium | |
CN117453764A (en) | Data mining analysis method | |
CN117556339B (en) | Network illegal behavior risk and risk level assessment method | |
CN112052990B (en) | CNN-BilSTM hybrid model-based next activity prediction method for multi-angle business process | |
CN113920366A (en) | Comprehensive weighted main data identification method based on machine learning | |
CN113743453A (en) | Population quantity prediction method based on random forest | |
CN115481841A (en) | Material demand prediction method based on feature extraction and improved random forest | |
CN117216490B (en) | Intelligent big data acquisition system | |
CN117349243A (en) | Coding and displaying method for integrated management of standard file and archive | |
CN117371861A (en) | Digital-based household service quality intelligent analysis method and system | |
Yi-bin et al. | Improvement of ID3 algorithm based on simplified information entropy and coordination degree | |
CN111428821A (en) | Asset classification method based on decision tree | |
CN116452353A (en) | Financial data management method and system | |
CN116432835A (en) | Customer loss early warning and attributing method, device, computer equipment and storage medium | |
CN114741515A (en) | Social network user attribute prediction method and system based on graph generation | |
CN113850483A (en) | Enterprise credit risk rating system | |
CN113379212A (en) | Block chain-based logistics information platform default risk assessment method, device, equipment and medium | |
CN117762758B (en) | Performance efficiency consistency test method and system for web system | |
CN117113045B (en) | Method for evaluating effectiveness of automatic driving positioning system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |