CN117216490A - Intelligent big data acquisition system - Google Patents

Intelligent big data acquisition system Download PDF

Info

Publication number
CN117216490A
CN117216490A CN202311474709.0A CN202311474709A CN117216490A CN 117216490 A CN117216490 A CN 117216490A CN 202311474709 A CN202311474709 A CN 202311474709A CN 117216490 A CN117216490 A CN 117216490A
Authority
CN
China
Prior art keywords
data
abnormal
railway
feature
value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202311474709.0A
Other languages
Chinese (zh)
Other versions
CN117216490B (en
Inventor
阎胜勇
郑慧亚
甄津
常灿
田珊
凡凯乐
李生杰
王少华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Computing Technologies of CARS
Beijing Jingwei Information Technology Co Ltd
Original Assignee
Institute of Computing Technologies of CARS
Beijing Jingwei Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Computing Technologies of CARS, Beijing Jingwei Information Technology Co Ltd filed Critical Institute of Computing Technologies of CARS
Priority to CN202311474709.0A priority Critical patent/CN117216490B/en
Publication of CN117216490A publication Critical patent/CN117216490A/en
Application granted granted Critical
Publication of CN117216490B publication Critical patent/CN117216490B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses an intelligent big data acquisition system, which comprises the steps of acquiring railway data, preprocessing the railway data, identifying the preprocessed railway data, acquiring abnormal data and contrast data, constructing a data restoration model, optimizing the data restoration model, extracting abnormal characteristics of the abnormal data, inputting the abnormal data and the contrast data into the data restoration model, filling the abnormal data by adopting a random decision forest algorithm according to the abnormal characteristics and the contrast data, and outputting the filled data as a result. The method not only can improve the precision of big data acquisition, but also has better interpretability, and can be directly applied to an intelligent big data acquisition system.

Description

Intelligent big data acquisition system
Technical Field
The invention relates to the field of big data, in particular to an intelligent big data acquisition system.
Background
At present, the planning and statistical informatization level of the national iron group is in the scattered stage of professional division, decision systematicness and standardization are to be perfected, the problems of obvious data quality exist, each railway business information system is independently built and self-organized, information resources among systems and even in the systems cannot be effectively integrated, each application system is independent to form a network, a database is isolated, application software is special, data formats, technical specifications, interface standards, log files and the like are lack of consistency, compatibility is poor, information exchange among systems is difficult, information sharing degree is low, information is difficult to comprehensively utilize, and difficulty in system construction and dimension is large.
The acquisition technology is widely applied in the field of big data, and can help managers of a big data acquisition system to analyze data timely and efficiently, so that analysis and management of the data are realized. At present, the large data acquisition method has more uncertain factors due to the characteristics of huge information quantity, various types, large information density and the like of railway data, so that the large data acquisition method has larger uncertainty. Although some intelligent big data acquisition methods and systems have been invented, the problem of uncertainty of the big data acquisition method cannot be effectively solved.
Disclosure of Invention
The invention aims to provide an intelligent big data acquisition system.
In order to achieve the above purpose, the invention is implemented according to the following technical scheme:
the invention comprises the following steps:
a, acquiring railway data and preprocessing the railway data;
b, recognizing the preprocessed railway data to obtain abnormal data and contrast data;
c, constructing a data restoration model, and optimizing the data restoration model; the method for constructing the data restoration model comprises the following steps:
constructing the preprocessed railway data as a data set, taking Boosting as an integrated model of a random decision forest, and taking deviation, variance and noise as generalization errors of the random decision forest:
wherein the generalization error is E, and the predicted value of the ith railway data isThe standard value of the railway data is ∈ ->Noise of railway data is +.>The number of railway data is n;
inputting the railway data set into a data restoration model for training, optimizing the generalization capability of the data restoration model by adopting a first optimization algorithm, and optimizing the precision of the data restoration model by adopting a second optimization algorithm;
and D, inputting the abnormal data and the comparison data into the data restoration model, extracting the abnormal characteristics of the abnormal data, filling the abnormal data by adopting a random decision forest algorithm according to the abnormal characteristics and the comparison data, and outputting the filled data as a result.
Further, the preprocessing in the step A comprises removing repeated data, removing abnormal data, integrating data, converting data and normalizing data.
Further, the method for identifying the railway data after preprocessing comprises the following steps:
dividing railway data into subsequences, capturing the dependency relationship of the subsequences, and calculating an objective function in the subsequences:
wherein the mth subsequence isSub-sequence->Is reconstructed as->Calculating an objective function between subsequences:
wherein the local dependency set isThe global dependency set is +.>Calculating an anomaly score for the subsequence:
wherein the degree of influence of the intra-sequence and inter-sequence abnormal reconstruction errors on the overall error is proportional to、/>、/>SubsequencesIs +.>Given a set of subsequences for which anomaly scores have been calculated, a set of edge anomaly candidates is given:
wherein the edge anomaly candidate set isThe set of subsequences for which the abnormality score was calculated is +.>Subsequence->Assembly of components->Is +.>The threshold value of abnormality degree is->The number of manually-marked samples is M, the local dependence factor is k, and the center point of the edge anomaly candidate set is +.>The s-th subsequence is->The number of the subsequences is j, and a weighted error generated by interaction is calculated:
wherein the error influencing factor isThe sub-sequence generated by the t-1 round of interaction +.>The weight of (2) is +.>The t-th interaction produces a weighted error of +.>And continuously iterating until the weighted error increment is smaller than a given threshold value, and outputting the rest data as contrast data and the rest data as abnormal data.
Further, the method for optimizing the generalization capability of the data restoration model by adopting the first optimization algorithm comprises the following steps:
generating iron nodes according to the data of the training set, if the railway data in the training set belong to a class forest, marking the iron nodes as leaf nodes of the class forest, if the values of the railway data in the training set on the attribute set are the same, marking rabbit nodes as leaf nodes, and marking class waves of the leaf nodes as classes with the largest sample number in the railway data;
selecting the optimized attribute loss from the attribute set, and taking the category with the highest proportion of categories in the nodes as the tendency category preference of the iron nodes;
acquiring the number classified as happiness in the verification set as Q, initializing the count, repeating each value of the optimizing attribute loss, and selecting a data set with the value of the railway data as the optimizing attribute loss as fire;
selecting data which is concentrated in the attribute and takes the value as the optimal attribute loss as the rice, if the fire is empty, selecting the category with the highest proportion of categories in the nodes in the attribute as the tendency category smoke of the iron node, otherwise, acquiring the number of the category smoke classified as the tendency category smoke in the rice, if Q is more than or equal to the initialized count, marking the branch node as the leaf node, and marking the category as the most category in the railway data.
Further, the method for optimizing the precision of the data restoration model by adopting the second optimization algorithm comprises the following steps:
calculating the empirical entropy of the training data set:
wherein the empirical entropy of the training data set w isThe s-th class is->The number of categories is Q, and the empirical conditional entropy of the feature pair training set is calculated:
wherein the training set data of the ith feature is vThe number of features is m, the s-th class of the i-th feature is +.>The empirical conditional entropy of the feature v on the training dataset w is +.>Calculating the information gain of the training data set:
wherein the information gain of the w characteristic v of the training data set isCalculating the information gain ratio of the training data set:
wherein the information gain ratio of the w characteristic v of the training data set isThe information gain ratios of the features are ordered in descending order, and the feature with the largest information gain ratio is selected as the optimal segmentation feature.
Further, the method for extracting the abnormal characteristics of the abnormal data comprises the following steps:
setting a population scale, initializing the population and understanding space, and calculating an fitness function according to the following formula:
wherein the number of feature subsets of the classification is c, and the subset mean vector of the ith class isThe average vector of the feature set is +.>The vector of the feature j of class i is +.>The number of features j is +.>The feature set of the ith class is +.>Calculating an fitness function value of the abnormal data according to the fitness function, and adjusting the characteristic weight of the abnormal data:
wherein the difference degree of any two abnormal data in the j-th characteristic dimension isThe difference value of the individual anomaly data in the dimension of feature j is +.>、/>Characteristics->The maximum value of the weight is +.>Characteristics->The minimum value of the weight is +.>Calculating the abnormal data abnormal feature selection probability:
wherein the features areThe weight value of (2) is +.>Of individuals of the t th generationThe fitness function value is->Selecting two individuals from the updated population according to the selection probability to perform cross recombination to obtain a new generation population, and extracting the characteristic with the maximum fitness as an important characteristic;
comparing the fitness value of the important features with a fitness threshold value, terminating iteration if the fitness value of the important features is larger than the fitness threshold value, and outputting the first 3 features of the fitness rank as abnormal features; otherwise, recalculating the fitness value until the fitness value of the important feature is larger than the fitness threshold;
and deleting the rest abnormal data which cannot be filled.
Further, the method for filling the abnormal data by adopting a random forest algorithm according to the abnormal characteristics and the comparison number comprises the following steps:
randomly selecting N pieces of abnormal data to form a training set T, wherein the training set has d characteristics, and k (k < d) pieces of decision trees are selected each time;
randomly selecting N pieces of abnormal data from the training set T, selecting N pieces of abnormal data to train a decision tree, and taking the decision tree as a sample at a root node of the decision tree;
when each sample has M attributes, and each node of the decision tree needs to be split, randomly selecting M attributes from the M attributes, meeting the condition M < < M, and selecting 1 attribute from the M attributes as the splitting attribute of the node by adopting information gain;
repeatedly selecting splitting attributes until the splitting cannot be performed again, combining the constructed decision trees to form a random forest, and obtaining a classification result according to a slicing result;
a. filling in a large number of missing features
Taking the value without missing the feature M as a training label Y_train, taking the other n-1 features corresponding to the value as training features X_train, and building a random forest regression tree for training; taking n-1 features corresponding to the feature M missing values as a test set X_test, and carrying out prediction by a trainer to finally obtain predicted values of the M feature missing values;
b. filling of multiple feature missing data
Traversing all the features, filling from the feature with the least missing, replacing the missing value of other features with 0 or mode when filling one feature, putting the predicted value into the original feature matrix after finishing one regression prediction, and continuously filling the next feature.
In a second aspect, an intelligent big data acquisition system comprises:
and a data analysis module: the method comprises the steps of identifying the railway data after pretreatment to obtain abnormal data and contrast data;
modeling optimization module: the method comprises the steps of constructing a data restoration model and optimizing the data restoration model; the method for constructing the data restoration model comprises the following steps:
constructing the preprocessed railway data as a data set, taking Boosting as an integrated model of a random decision forest, and taking deviation, variance and noise as generalization errors of the random decision forest:
wherein the generalization error is E, and the predicted value of the ith railway data isThe standard value of the railway data is ∈ ->Noise of railway data is +.>The number of railway data is n;
inputting the railway data set into a data restoration model for training, optimizing the generalization capability of the data restoration model by adopting a first optimization algorithm, and optimizing the precision of the data restoration model by adopting a second optimization algorithm;
and the data filling module is used for: the method comprises the steps of inputting the abnormal data and the comparison data into the data restoration model, extracting the abnormal characteristics of the abnormal data, filling the abnormal data by adopting a random decision forest algorithm according to the abnormal characteristics and the comparison data, and outputting the filled data as a result.
The beneficial effects of the invention are as follows:
compared with the prior art, the intelligent big data acquisition method has the following technical effects:
the method can improve the accuracy of intelligent big data acquisition management by preprocessing, modeling, optimizing a model, processing data, extracting features and filling data, thereby improving the accuracy of intelligent big data acquisition management, realizing the automatic analysis and management of big data, carrying out feature extraction and data filling on railway data in real time, having important significance on intelligent big data acquisition management, adapting to intelligent big data acquisition management of different standards and intelligent big data acquisition management of different systems, and having certain universality.
Drawings
FIG. 1 is a flow chart of steps of an intelligent big data acquisition method of the present invention.
Detailed Description
The invention is further described by the following specific examples, which are presented to illustrate, but not to limit, the invention.
The invention discloses an intelligent big data acquisition method which comprises the following steps:
as shown in fig. 1, in this embodiment, the steps include:
a, acquiring railway data and preprocessing the railway data;
in the actual evaluation, railroad mileage data in units of years is given:
631 km in 2008, 2009 loss, 1828 km in 2010, 9999 km in 2011, 3084 km in 2012, 3559.1234 km in 2013, 10000 km in 2014, 2015 loss, 2337 km in 2016, 1856 km in 2017, 4050 km in 2018, 4285.5678 km in 579, 2520 km in 2020, 2149 km in 2021;
b, recognizing the preprocessed railway data to obtain abnormal data and contrast data;
in actual evaluation, the anomaly data is: 9999 km in 2011, 10000 km in 2014, 4285.5678 km in 2019, 2009 and 2015;
the comparative data are: 631 km in 2008, 1828 km in 2010, 3084 km in 2012, 3559.1234 km in 2013, 2337 km in 2016, 1856 km in 2017, 4050 km in 2018, 2520 km in 2020, 2149 km in 2021;
c, constructing a data restoration model, and optimizing the data restoration model; the method for constructing the data restoration model comprises the following steps:
constructing the preprocessed railway data as a data set, taking Boosting as an integrated model of a random decision forest, and taking deviation, variance and noise as generalization errors of the random decision forest:
wherein the generalization error is E, and the predicted value of the ith railway data isThe standard value of the railway data is ∈ ->Noise of railway data is +.>The number of railway data is n;
inputting the railway data set into a data restoration model for training, optimizing the generalization capability of the data restoration model by adopting a first optimization algorithm, and optimizing the precision of the data restoration model by adopting a second optimization algorithm;
inputting the abnormal data and the contrast data into the data restoration model, extracting the abnormal characteristics of the abnormal data, filling the abnormal data by adopting a random decision forest algorithm according to the abnormal characteristics and the contrast data, and outputting the filled data as a result;
in the actual evaluation, the data filled in 2009 is 2345 km, and the data filled in 2015 is 4748 km.
In this embodiment, the preprocessing in step a includes removing duplicate data, removing anomalous data, data integration, data conversion, and data normalization.
In this embodiment, the method for identifying the railway data after preprocessing includes:
dividing railway data into subsequences, capturing the dependency relationship of the subsequences, and calculating an objective function in the subsequences:
wherein the mth subsequence isSub-sequence->Is reconstructed as->Calculating an objective function between subsequences:
wherein the local dependency set isThe global dependency set is +.>Calculating an anomaly score for the subsequence:
wherein the degree of influence of the intra-sequence and inter-sequence abnormal reconstruction errors on the overall error is proportional to、/>、/>SubsequencesIs +.>Given a set of subsequences for which anomaly scores have been calculated, a set of edge anomaly candidates is given:
wherein the edge anomaly candidate set isThe set of subsequences for which the abnormality score was calculated is +.>Subsequence->Assembly of components->Is +.>The threshold value of abnormality degree is->The number of manually-marked samples is M, the local dependence factor is k, and the center point of the edge anomaly candidate set is +.>The s-th subsequence is->The number of the subsequences is j, and a weighted error generated by interaction is calculated:
wherein the error influencing factor isThe sub-sequence generated by the t-1 round of interaction +.>The weight of (2) is +.>The t-th interaction produces a weighted error of +.>Continuously iterating until the weighted error increment is smaller than a given threshold value, and stopping to lead out the rest data as contrast data and the rest data as abnormal data;
in actual evaluation, the anomaly data is: 9999 km in 2011, 10000 km in 2014, 4285.5678 km in 2019, 2009 and 2015;
the comparative data are: 631 km in 2008, 1828 km in 2010, 3084 km in 2012, 3559.1234 km in 2013, 2337 km in 2016, 1856 km in 2017, 4050 km in 2018, 2520 km in 2020, 2149 km in 2021.
In this embodiment, a method for optimizing generalization capability of a data repair model by using a first optimization algorithm includes:
generating iron nodes according to the data of the training set, if the railway data in the training set belong to a class forest, marking the iron nodes as leaf nodes of the class forest, if the values of the railway data in the training set on the attribute set are the same, marking rabbit nodes as leaf nodes, and marking class waves of the leaf nodes as classes with the largest sample number in the railway data;
selecting the optimized attribute loss from the attribute set, and taking the category with the highest proportion of categories in the nodes as the tendency category preference of the iron nodes;
acquiring the number classified as happiness in the verification set as Q, initializing the count, repeating each value of the optimizing attribute loss, and selecting a data set with the value of the railway data as the optimizing attribute loss as fire;
selecting data which is concentrated in the attribute and takes the value as the optimal attribute loss as the rice, if the fire is empty, selecting the category with the highest proportion of categories in the nodes in the attribute as the tendency category smoke of the iron node, otherwise, acquiring the number of the category smoke classified as the tendency category smoke in the rice, if Q is more than or equal to the initialized count, marking the branch node as the leaf node, and marking the category as the most category in the railway data.
In this embodiment, the method for optimizing the accuracy of the data repair model by using the second optimization algorithm includes:
calculating the empirical entropy of the training data set:
wherein the empirical entropy of the training data set w isThe s-th class is->The number of categories is Q, and the empirical conditional entropy of the feature pair training set is calculated:
wherein the training set data of the ith feature is vThe number of features is m, the s-th class of the i-th feature is +.>The empirical conditional entropy of the feature v on the training dataset w is +.>Calculating the information gain of the training data set:
wherein the information gain of the w characteristic v of the training data set isCalculating the information gain ratio of the training data set:
wherein the information gain ratio of the w characteristic v of the training data set isThe information gain ratios of the features are ordered in descending order, and the feature with the largest information gain ratio is selected as the optimal segmentation feature.
In this embodiment, the method for extracting the abnormal feature of the abnormal data includes:
setting a population scale, initializing the population and understanding space, and calculating an fitness function according to the following formula:
wherein the number of feature subsets of the classification is c, and the subset mean vector of the ith class isThe average vector of the feature set is +.>The vector of the feature j of class i is +.>The number of features j is +.>The feature set of the ith class is +.>Calculating an fitness function value of the abnormal data according to the fitness function, and adjusting the characteristic weight of the abnormal data:
wherein the difference degree of any two abnormal data in the j-th characteristic dimension isThe difference value of the individual anomaly data in the dimension of feature j is +.>、/>Characteristics->The maximum value of the weight is +.>Characteristics->The minimum value of the weight is +.>Calculating the abnormal data abnormal feature selection probability:
wherein the features areThe weight value of (2) is +.>The fitness function value of the t-th generation individual is +.>Selecting two individuals from the updated population according to the selection probability to perform cross recombination to obtain a new generation population, and extracting the characteristic with the maximum fitness as an important characteristic;
comparing the fitness value of the important features with a fitness threshold value, terminating iteration if the fitness value of the important features is larger than the fitness threshold value, and outputting the first 3 features of the fitness rank as abnormal features; otherwise, recalculating the fitness value until the fitness value of the important feature is larger than the fitness threshold;
deleting the rest abnormal data which cannot be filled;
in the actual evaluation, the extracted features are:
outliers: data in 2011 is 9999 km, which is far higher than data in other years, and is regarded as an abnormal value;
outliers: data in 2014 is 10000 ten thousand km, which is far higher than data in other years, and is regarded as an outlier;
noise data: the data in 2019 contains values after decimal points, and noise exists compared with the data in other years;
missing values: data were missing in 2009 and 2015, with no values available;
the processed output data are 2009, 2015, 2008, 631, 1828, 2012, 3084, 2013, 3559.1234, 2016, 2337, 1856, 4050, 2020, 2520, 2021, 2149.
In this embodiment, the method for filling the abnormal data with a random forest algorithm according to the abnormal feature and the comparison number includes:
randomly selecting N pieces of abnormal data to form a training set T, wherein the training set has d characteristics, and k (k < d) pieces of decision trees are selected each time;
randomly selecting N pieces of abnormal data from the training set T, selecting N pieces of abnormal data to train a decision tree, and taking the decision tree as a sample at a root node of the decision tree;
when each sample has M attributes, and each node of the decision tree needs to be split, randomly selecting M attributes from the M attributes, meeting the condition M < < M, and selecting 1 attribute from the M attributes as the splitting attribute of the node by adopting information gain;
repeatedly selecting splitting attributes until the splitting cannot be performed again, combining the constructed decision trees to form a random forest, and obtaining a classification result according to a slicing result;
a. filling in a large number of missing features
Taking the value without missing the feature M as a training label Y_train, taking the other n-1 features corresponding to the value as training features X_train, and building a random forest regression tree for training; taking n-1 features corresponding to the feature M missing values as a test set X_test, and carrying out prediction by a trainer to finally obtain predicted values of the M feature missing values;
b. filling of multiple feature missing data
Traversing all the features, filling from the feature with the least missing, replacing the missing value of other features with 0 or mode when filling one feature, putting the predicted value into the original feature matrix after finishing one regression prediction, and continuously filling the next feature;
in the actual evaluation, the data filled in 2009 is 2345 km, and the data filled in 2015 is 4748 km.
In a second aspect, an intelligent big data acquisition system comprises:
and a data acquisition module: the method comprises the steps of acquiring railway data and preprocessing the railway data;
and a data analysis module: the method comprises the steps of identifying the railway data after pretreatment to obtain abnormal data and contrast data;
modeling optimization module: the method comprises the steps of constructing a data restoration model and optimizing the data restoration model; the method for constructing the data restoration model comprises the following steps:
constructing the preprocessed railway data as a data set, taking Boosting as an integrated model of a random decision forest, and taking deviation, variance and noise as generalization errors of the random decision forest:
wherein the generalization error is E, and the predicted value of the ith railway data isThe standard value of the railway data is ∈ ->Noise of railway data is +.>The number of railway data is n;
inputting the railway data set into a data restoration model for training, optimizing the generalization capability of the data restoration model by adopting a first optimization algorithm, and optimizing the precision of the data restoration model by adopting a second optimization algorithm;
and the data filling module is used for: the method comprises the steps of inputting the abnormal data and the comparison data into the data restoration model, extracting the abnormal characteristics of the abnormal data, filling the abnormal data by adopting a random decision forest algorithm according to the abnormal characteristics and the comparison data, and outputting the filled data as a result.
The foregoing description of the preferred embodiments of the invention is not intended to limit the invention to the precise form disclosed, and any such modifications, equivalents, and alternatives falling within the spirit and scope of the invention are intended to be included within the scope of the invention.

Claims (8)

1. The intelligent big data acquisition method is characterized by comprising the following steps of:
a, acquiring railway data and preprocessing the railway data;
b, recognizing the preprocessed railway data to obtain abnormal data and contrast data;
c, constructing a data restoration model, and optimizing the data restoration model; the method for constructing the data restoration model comprises the following steps:
constructing the preprocessed railway data as a data set, taking Boosting as an integrated model taking Boosting as a random decision forest, and taking deviation, variance and noise as generalization errors of the random decision forest:
wherein the generalization error is E, and the predicted value of the ith railway data isThe standard value of the railway data is ∈ ->Noise of railway data is +.>The number of railway data is n;
inputting the railway data set into a data restoration model for training, optimizing the generalization capability of the data restoration model by adopting a first optimization algorithm, and optimizing the precision of the data restoration model by adopting a second optimization algorithm;
and D, inputting the abnormal data and the comparison data into the data restoration model, extracting the abnormal characteristics of the abnormal data, filling the abnormal data by adopting a random decision forest algorithm according to the abnormal characteristics and the comparison data, and outputting the filled data as a result.
2. The intelligent big data collection method according to claim 1, wherein the preprocessing in the step a includes removing duplicate data, removing abnormal data, integrating data, converting data, and normalizing data.
3. The intelligent big data acquisition method according to claim 1, wherein the method for identifying the preprocessed railway data comprises the following steps:
dividing railway data into subsequences, capturing the dependency relationship of the subsequences, and calculating an objective function in the subsequences:
wherein the mth subsequence isSub-sequence->Is reconstructed as->Calculating an objective function between subsequences:
wherein the local dependency set isThe global dependency set is +.>Calculating an anomaly score for the subsequence:
wherein the degree of influence of the intra-sequence and inter-sequence abnormal reconstruction errors on the overall error is proportional to、/>、/>Subsequence->Is +.>Given a set of subsequences for which anomaly scores have been calculated, a set of edge anomaly candidates is given:
wherein the edge anomaly candidate set isThe set of subsequences for which the abnormality score was calculated is +.>Subsequence->Assembly of components->Is abnormal of (a)Score->The threshold value of abnormality degree is->The number of manually-marked samples is M, the local dependence factor is k, and the center point of the edge anomaly candidate set is +.>The s-th subsequence is->The number of the subsequences is j, and a weighted error generated by interaction is calculated:
wherein the error influencing factor isThe sub-sequence generated by the t-1 round of interaction +.>The weight of (2) is +.>The t-th interaction produces a weighted error of +.>And continuously iterating until the weighted error increment is smaller than a given threshold value, and outputting the rest data as contrast data and the rest data as abnormal data.
4. The intelligent big data acquisition method according to claim 1, wherein the method for optimizing the generalization ability of the data restoration model by adopting the first optimization algorithm comprises the following steps:
generating iron nodes according to the data of the training set, if the railway data in the training set belong to a class forest, marking the iron nodes as leaf nodes of the class forest, if the values of the railway data in the training set on the attribute set are the same, marking rabbit nodes as leaf nodes, and marking class waves of the leaf nodes as classes with the largest sample number in the railway data;
selecting the optimized attribute loss from the attribute set, and taking the category with the highest proportion of categories in the nodes as the tendency category preference of the iron nodes;
acquiring the number classified as happiness in the verification set as Q, initializing the count, repeating each value of the optimizing attribute loss, and selecting a data set with the value of the railway data as the optimizing attribute loss as fire;
selecting data which is concentrated in the attribute and takes the value as the optimal attribute loss as the rice, if the fire is empty, selecting the category with the highest proportion of categories in the nodes in the attribute as the tendency category smoke of the iron node, otherwise, acquiring the number of the category smoke classified as the tendency category smoke in the rice, if Q is more than or equal to the initialized count, marking the branch node as the leaf node, and marking the category as the most category in the railway data.
5. The intelligent big data acquisition method according to claim 1, wherein the method for optimizing the accuracy of the data restoration model by using the second optimization algorithm comprises the following steps:
calculating the empirical entropy of the training data set:
wherein the empirical entropy of the training data set w isThe s-th class is->The number of categories is Q, and the empirical conditional entropy of the feature pair training set is calculated:
wherein the training set data of the ith feature is vThe number of features is m, the s-th class of the i-th feature is +.>The empirical conditional entropy of the feature v on the training dataset w is +.>Calculating the information gain of the training data set:
wherein the information gain of the w characteristic v of the training data set isCalculating the information gain ratio of the training data set:
wherein the information gain ratio of the w characteristic v of the training data set isThe information gain ratios of the features are ordered in descending order, and the feature with the largest information gain ratio is selected as the optimal segmentation feature.
6. The intelligent big data acquisition method according to claim 1, wherein the method for extracting the abnormal characteristics of the abnormal data comprises the steps of:
setting a population scale, initializing the population and understanding space, and calculating an fitness function according to the following formula:
wherein the number of feature subsets of the classification is c, and the subset mean vector of the ith class isThe average vector of the feature set is +.>The vector of the feature j of class i is +.>The number of features j is +.>The feature set of the ith class is +.>Calculating an fitness function value of the abnormal data according to the fitness function, and adjusting the characteristic weight of the abnormal data:
wherein the difference degree of any two abnormal data in the j-th characteristic dimension isThe difference value of the individual anomaly data in the dimension of feature j is +.>、/>Special (special)Syndrome of->The maximum value of the weight is +.>Characteristics->The minimum value of the weight is +.>Calculating the abnormal data abnormal feature selection probability:
wherein the features areThe weight value of (2) is +.>The fitness function value of the t-th generation individual is +.>Selecting two individuals from the updated population according to the selection probability to perform cross recombination to obtain a new generation population, and extracting the characteristic with the maximum fitness as an important characteristic;
comparing the fitness value of the important features with a fitness threshold value, terminating iteration if the fitness value of the important features is larger than the fitness threshold value, and outputting the first 3 features of the fitness rank as abnormal features; otherwise, recalculating the fitness value until the fitness value of the important feature is larger than the fitness threshold;
and deleting the rest abnormal data which cannot be filled.
7. The intelligent big data acquisition method according to claim 1, wherein the method for filling the abnormal data by adopting a random forest algorithm according to the abnormal characteristics and the comparison number comprises the following steps:
randomly selecting N pieces of abnormal data to form a training set T, wherein the training set has d characteristics, and k (k < d) pieces of decision trees are selected each time;
randomly selecting N pieces of abnormal data from the training set T, selecting N pieces of abnormal data to train a decision tree, and taking the decision tree as a sample at a root node of the decision tree;
when each sample has M attributes, and each node of the decision tree needs to be split, randomly selecting M attributes from the M attributes, meeting the condition M < < M, and selecting 1 attribute from the M attributes as the splitting attribute of the node by adopting information gain;
repeatedly selecting splitting attributes until the splitting cannot be performed again, combining the constructed decision trees to form a random forest, and obtaining a classification result according to a slicing result;
a. filling in a large number of missing features
Taking the value without missing the feature M as a training label Y_train, taking the other n-1 features corresponding to the value as training features X_train, and building a random forest regression tree for training; taking n-1 features corresponding to the feature M missing values as a test set X_test, and carrying out prediction by a trainer to finally obtain predicted values of the M feature missing values;
b. filling of multiple feature missing data
Traversing all the features, filling from the feature with the least missing, replacing the missing value of other features with 0 or mode when filling one feature, putting the predicted value into the original feature matrix after finishing one regression prediction, and continuously filling the next feature.
8. An intelligent big data acquisition system, comprising:
and a data acquisition module: the method comprises the steps of acquiring railway data and preprocessing the railway data;
and a data analysis module: the method comprises the steps of identifying the railway data after pretreatment to obtain abnormal data and contrast data;
modeling optimization module: the method comprises the steps of constructing a data restoration model and optimizing the data restoration model; the method for constructing the data restoration model comprises the following steps:
constructing the preprocessed railway data as a data set, taking Boosting as an integrated model of a random decision forest, and taking deviation, variance and noise as generalization errors of the random decision forest:
wherein the generalization error is E, and the predicted value of the ith railway data isThe standard value of the railway data is ∈ ->Noise of railway data is +.>The number of railway data is n;
inputting the railway data set into a data restoration model for training, optimizing the generalization capability of the data restoration model by adopting a first optimization algorithm, and optimizing the precision of the data restoration model by adopting a second optimization algorithm;
and the data filling module is used for: the method comprises the steps of inputting the abnormal data and the comparison data into the data restoration model, extracting the abnormal characteristics of the abnormal data, filling the abnormal data by adopting a random decision forest algorithm according to the abnormal characteristics and the comparison data, and outputting the filled data as a result.
CN202311474709.0A 2023-11-08 2023-11-08 Intelligent big data acquisition system Active CN117216490B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311474709.0A CN117216490B (en) 2023-11-08 2023-11-08 Intelligent big data acquisition system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311474709.0A CN117216490B (en) 2023-11-08 2023-11-08 Intelligent big data acquisition system

Publications (2)

Publication Number Publication Date
CN117216490A true CN117216490A (en) 2023-12-12
CN117216490B CN117216490B (en) 2024-01-19

Family

ID=89035674

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311474709.0A Active CN117216490B (en) 2023-11-08 2023-11-08 Intelligent big data acquisition system

Country Status (1)

Country Link
CN (1) CN117216490B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113744869A (en) * 2021-09-07 2021-12-03 中国医科大学附属盛京医院 Method for establishing early screening of light chain amyloidosis based on machine learning and application thereof
CN114169631A (en) * 2021-12-15 2022-03-11 中国石油大学胜利学院 Oil field power load management and control system based on data analysis
CN114238293A (en) * 2021-12-01 2022-03-25 国网福建省电力有限公司莆田供电公司 Transformer oil paper insulation FDS data restoration method based on random forest
US20220292239A1 (en) * 2021-03-15 2022-09-15 KuantSol Inc. Smart time series and machine learning end-to-end (e2e) model development enhancement and analytic software
CN115420690A (en) * 2022-04-29 2022-12-02 中遥环境(西安)股份有限公司 Near-surface trace gas concentration inversion model and inversion method
CN116316599A (en) * 2023-03-28 2023-06-23 广东电网有限责任公司东莞供电局 Intelligent electricity load prediction method

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220292239A1 (en) * 2021-03-15 2022-09-15 KuantSol Inc. Smart time series and machine learning end-to-end (e2e) model development enhancement and analytic software
CN113744869A (en) * 2021-09-07 2021-12-03 中国医科大学附属盛京医院 Method for establishing early screening of light chain amyloidosis based on machine learning and application thereof
CN114238293A (en) * 2021-12-01 2022-03-25 国网福建省电力有限公司莆田供电公司 Transformer oil paper insulation FDS data restoration method based on random forest
CN114169631A (en) * 2021-12-15 2022-03-11 中国石油大学胜利学院 Oil field power load management and control system based on data analysis
CN115420690A (en) * 2022-04-29 2022-12-02 中遥环境(西安)股份有限公司 Near-surface trace gas concentration inversion model and inversion method
CN116316599A (en) * 2023-03-28 2023-06-23 广东电网有限责任公司东莞供电局 Intelligent electricity load prediction method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
汤洪涛 等: "基于工业大数据的柔性作业车间动态调度", 《计算机集成制造系统》, vol. 26, no. 9, pages 2497 - 2510 *

Also Published As

Publication number Publication date
CN117216490B (en) 2024-01-19

Similar Documents

Publication Publication Date Title
CN111798312A (en) Financial transaction system abnormity identification method based on isolated forest algorithm
CN113051291A (en) Work order information processing method, device, equipment and storage medium
CN113343640B (en) Method and device for classifying customs commodity HS codes
CN114722014B (en) Batch data time sequence transmission method and system based on database log file
CN111581193A (en) Data processing method, device, computer system and storage medium
CN115147155A (en) Railway freight customer loss prediction method based on ensemble learning
CN113420887A (en) Prediction model construction method and device, computer equipment and readable storage medium
CN117453764A (en) Data mining analysis method
CN117556339B (en) Network illegal behavior risk and risk level assessment method
CN112052990B (en) CNN-BilSTM hybrid model-based next activity prediction method for multi-angle business process
CN113920366A (en) Comprehensive weighted main data identification method based on machine learning
CN113743453A (en) Population quantity prediction method based on random forest
CN115481841A (en) Material demand prediction method based on feature extraction and improved random forest
CN117216490B (en) Intelligent big data acquisition system
CN117349243A (en) Coding and displaying method for integrated management of standard file and archive
CN117371861A (en) Digital-based household service quality intelligent analysis method and system
Yi-bin et al. Improvement of ID3 algorithm based on simplified information entropy and coordination degree
CN111428821A (en) Asset classification method based on decision tree
CN116452353A (en) Financial data management method and system
CN116432835A (en) Customer loss early warning and attributing method, device, computer equipment and storage medium
CN114741515A (en) Social network user attribute prediction method and system based on graph generation
CN113850483A (en) Enterprise credit risk rating system
CN113379212A (en) Block chain-based logistics information platform default risk assessment method, device, equipment and medium
CN117762758B (en) Performance efficiency consistency test method and system for web system
CN117113045B (en) Method for evaluating effectiveness of automatic driving positioning system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant