CN117216490A

CN117216490A - Intelligent big data acquisition system

Info

Publication number: CN117216490A
Application number: CN202311474709.0A
Authority: CN
Inventors: 阎胜勇; 郑慧亚; 甄津; 常灿; 田珊; 凡凯乐; 李生杰; 王少华
Original assignee: Institute of Computing Technologies of CARS; Beijing Jingwei Information Technology Co Ltd
Current assignee: Institute of Computing Technologies of CARS; Beijing Jingwei Information Technology Co Ltd
Priority date: 2023-11-08
Filing date: 2023-11-08
Publication date: 2023-12-12
Anticipated expiration: 2043-11-08
Also published as: CN117216490B

Abstract

The invention discloses an intelligent big data acquisition system, which comprises the steps of acquiring railway data, preprocessing the railway data, identifying the preprocessed railway data, acquiring abnormal data and contrast data, constructing a data restoration model, optimizing the data restoration model, extracting abnormal characteristics of the abnormal data, inputting the abnormal data and the contrast data into the data restoration model, filling the abnormal data by adopting a random decision forest algorithm according to the abnormal characteristics and the contrast data, and outputting the filled data as a result. The method not only can improve the precision of big data acquisition, but also has better interpretability, and can be directly applied to an intelligent big data acquisition system.

Description

Intelligent big data acquisition system

Technical Field

The invention relates to the field of big data, in particular to an intelligent big data acquisition system.

Background

At present, the planning and statistical informatization level of the national iron group is in the scattered stage of professional division, decision systematicness and standardization are to be perfected, the problems of obvious data quality exist, each railway business information system is independently built and self-organized, information resources among systems and even in the systems cannot be effectively integrated, each application system is independent to form a network, a database is isolated, application software is special, data formats, technical specifications, interface standards, log files and the like are lack of consistency, compatibility is poor, information exchange among systems is difficult, information sharing degree is low, information is difficult to comprehensively utilize, and difficulty in system construction and dimension is large.

The acquisition technology is widely applied in the field of big data, and can help managers of a big data acquisition system to analyze data timely and efficiently, so that analysis and management of the data are realized. At present, the large data acquisition method has more uncertain factors due to the characteristics of huge information quantity, various types, large information density and the like of railway data, so that the large data acquisition method has larger uncertainty. Although some intelligent big data acquisition methods and systems have been invented, the problem of uncertainty of the big data acquisition method cannot be effectively solved.

Disclosure of Invention

The invention aims to provide an intelligent big data acquisition system.

In order to achieve the above purpose, the invention is implemented according to the following technical scheme:

the invention comprises the following steps:

a, acquiring railway data and preprocessing the railway data;

b, recognizing the preprocessed railway data to obtain abnormal data and contrast data;

c, constructing a data restoration model, and optimizing the data restoration model; the method for constructing the data restoration model comprises the following steps:

constructing the preprocessed railway data as a data set, taking Boosting as an integrated model of a random decision forest, and taking deviation, variance and noise as generalization errors of the random decision forest:

wherein the generalization error is E, and the predicted value of the ith railway data isThe standard value of the railway data is ∈ ->Noise of railway data is +.>The number of railway data is n;

inputting the railway data set into a data restoration model for training, optimizing the generalization capability of the data restoration model by adopting a first optimization algorithm, and optimizing the precision of the data restoration model by adopting a second optimization algorithm;

and D, inputting the abnormal data and the comparison data into the data restoration model, extracting the abnormal characteristics of the abnormal data, filling the abnormal data by adopting a random decision forest algorithm according to the abnormal characteristics and the comparison data, and outputting the filled data as a result.

Further, the preprocessing in the step A comprises removing repeated data, removing abnormal data, integrating data, converting data and normalizing data.

Further, the method for identifying the railway data after preprocessing comprises the following steps:

dividing railway data into subsequences, capturing the dependency relationship of the subsequences, and calculating an objective function in the subsequences:

wherein the mth subsequence isSub-sequence->Is reconstructed as->Calculating an objective function between subsequences:

wherein the local dependency set isThe global dependency set is +.>Calculating an anomaly score for the subsequence:

wherein the degree of influence of the intra-sequence and inter-sequence abnormal reconstruction errors on the overall error is proportional to、/>、/>SubsequencesIs +.>Given a set of subsequences for which anomaly scores have been calculated, a set of edge anomaly candidates is given:

wherein the edge anomaly candidate set isThe set of subsequences for which the abnormality score was calculated is +.>Subsequence->Assembly of components->Is +.>The threshold value of abnormality degree is->The number of manually-marked samples is M, the local dependence factor is k, and the center point of the edge anomaly candidate set is +.>The s-th subsequence is->The number of the subsequences is j, and a weighted error generated by interaction is calculated:

wherein the error influencing factor isThe sub-sequence generated by the t-1 round of interaction +.>The weight of (2) is +.>The t-th interaction produces a weighted error of +.>And continuously iterating until the weighted error increment is smaller than a given threshold value, and outputting the rest data as contrast data and the rest data as abnormal data.

Further, the method for optimizing the generalization capability of the data restoration model by adopting the first optimization algorithm comprises the following steps:

generating iron nodes according to the data of the training set, if the railway data in the training set belong to a class forest, marking the iron nodes as leaf nodes of the class forest, if the values of the railway data in the training set on the attribute set are the same, marking rabbit nodes as leaf nodes, and marking class waves of the leaf nodes as classes with the largest sample number in the railway data;

selecting the optimized attribute loss from the attribute set, and taking the category with the highest proportion of categories in the nodes as the tendency category preference of the iron nodes;

acquiring the number classified as happiness in the verification set as Q, initializing the count, repeating each value of the optimizing attribute loss, and selecting a data set with the value of the railway data as the optimizing attribute loss as fire;

selecting data which is concentrated in the attribute and takes the value as the optimal attribute loss as the rice, if the fire is empty, selecting the category with the highest proportion of categories in the nodes in the attribute as the tendency category smoke of the iron node, otherwise, acquiring the number of the category smoke classified as the tendency category smoke in the rice, if Q is more than or equal to the initialized count, marking the branch node as the leaf node, and marking the category as the most category in the railway data.

Further, the method for optimizing the precision of the data restoration model by adopting the second optimization algorithm comprises the following steps:

calculating the empirical entropy of the training data set:

wherein the empirical entropy of the training data set w isThe s-th class is->The number of categories is Q, and the empirical conditional entropy of the feature pair training set is calculated:

wherein the training set data of the ith feature is vThe number of features is m, the s-th class of the i-th feature is +.>The empirical conditional entropy of the feature v on the training dataset w is +.>Calculating the information gain of the training data set:

wherein the information gain of the w characteristic v of the training data set isCalculating the information gain ratio of the training data set:

wherein the information gain ratio of the w characteristic v of the training data set isThe information gain ratios of the features are ordered in descending order, and the feature with the largest information gain ratio is selected as the optimal segmentation feature.

Further, the method for extracting the abnormal characteristics of the abnormal data comprises the following steps:

setting a population scale, initializing the population and understanding space, and calculating an fitness function according to the following formula:

wherein the number of feature subsets of the classification is c, and the subset mean vector of the ith class isThe average vector of the feature set is +.>The vector of the feature j of class i is +.>The number of features j is +.>The feature set of the ith class is +.>Calculating an fitness function value of the abnormal data according to the fitness function, and adjusting the characteristic weight of the abnormal data:

wherein the difference degree of any two abnormal data in the j-th characteristic dimension isThe difference value of the individual anomaly data in the dimension of feature j is +.>、/>Characteristics->The maximum value of the weight is +.>Characteristics->The minimum value of the weight is +.>Calculating the abnormal data abnormal feature selection probability:

wherein the features areThe weight value of (2) is +.>Of individuals of the t th generationThe fitness function value is->Selecting two individuals from the updated population according to the selection probability to perform cross recombination to obtain a new generation population, and extracting the characteristic with the maximum fitness as an important characteristic;

comparing the fitness value of the important features with a fitness threshold value, terminating iteration if the fitness value of the important features is larger than the fitness threshold value, and outputting the first 3 features of the fitness rank as abnormal features; otherwise, recalculating the fitness value until the fitness value of the important feature is larger than the fitness threshold;

and deleting the rest abnormal data which cannot be filled.

Further, the method for filling the abnormal data by adopting a random forest algorithm according to the abnormal characteristics and the comparison number comprises the following steps:

randomly selecting N pieces of abnormal data to form a training set T, wherein the training set has d characteristics, and k (k < d) pieces of decision trees are selected each time;

randomly selecting N pieces of abnormal data from the training set T, selecting N pieces of abnormal data to train a decision tree, and taking the decision tree as a sample at a root node of the decision tree;

when each sample has M attributes, and each node of the decision tree needs to be split, randomly selecting M attributes from the M attributes, meeting the condition M < < M, and selecting 1 attribute from the M attributes as the splitting attribute of the node by adopting information gain;

repeatedly selecting splitting attributes until the splitting cannot be performed again, combining the constructed decision trees to form a random forest, and obtaining a classification result according to a slicing result;

a. filling in a large number of missing features

Taking the value without missing the feature M as a training label Y_train, taking the other n-1 features corresponding to the value as training features X_train, and building a random forest regression tree for training; taking n-1 features corresponding to the feature M missing values as a test set X_test, and carrying out prediction by a trainer to finally obtain predicted values of the M feature missing values;

b. filling of multiple feature missing data

Traversing all the features, filling from the feature with the least missing, replacing the missing value of other features with 0 or mode when filling one feature, putting the predicted value into the original feature matrix after finishing one regression prediction, and continuously filling the next feature.

In a second aspect, an intelligent big data acquisition system comprises:

and a data analysis module: the method comprises the steps of identifying the railway data after pretreatment to obtain abnormal data and contrast data;

modeling optimization module: the method comprises the steps of constructing a data restoration model and optimizing the data restoration model; the method for constructing the data restoration model comprises the following steps:

and the data filling module is used for: the method comprises the steps of inputting the abnormal data and the comparison data into the data restoration model, extracting the abnormal characteristics of the abnormal data, filling the abnormal data by adopting a random decision forest algorithm according to the abnormal characteristics and the comparison data, and outputting the filled data as a result.

The beneficial effects of the invention are as follows:

compared with the prior art, the intelligent big data acquisition method has the following technical effects:

the method can improve the accuracy of intelligent big data acquisition management by preprocessing, modeling, optimizing a model, processing data, extracting features and filling data, thereby improving the accuracy of intelligent big data acquisition management, realizing the automatic analysis and management of big data, carrying out feature extraction and data filling on railway data in real time, having important significance on intelligent big data acquisition management, adapting to intelligent big data acquisition management of different standards and intelligent big data acquisition management of different systems, and having certain universality.

Drawings

FIG. 1 is a flow chart of steps of an intelligent big data acquisition method of the present invention.

Detailed Description

The invention is further described by the following specific examples, which are presented to illustrate, but not to limit, the invention.

The invention discloses an intelligent big data acquisition method which comprises the following steps:

as shown in fig. 1, in this embodiment, the steps include:

a, acquiring railway data and preprocessing the railway data;

in the actual evaluation, railroad mileage data in units of years is given:

631 km in 2008, 2009 loss, 1828 km in 2010, 9999 km in 2011, 3084 km in 2012, 3559.1234 km in 2013, 10000 km in 2014, 2015 loss, 2337 km in 2016, 1856 km in 2017, 4050 km in 2018, 4285.5678 km in 579, 2520 km in 2020, 2149 km in 2021;

in actual evaluation, the anomaly data is: 9999 km in 2011, 10000 km in 2014, 4285.5678 km in 2019, 2009 and 2015;

the comparative data are: 631 km in 2008, 1828 km in 2010, 3084 km in 2012, 3559.1234 km in 2013, 2337 km in 2016, 1856 km in 2017, 4050 km in 2018, 2520 km in 2020, 2149 km in 2021;

inputting the abnormal data and the contrast data into the data restoration model, extracting the abnormal characteristics of the abnormal data, filling the abnormal data by adopting a random decision forest algorithm according to the abnormal characteristics and the contrast data, and outputting the filled data as a result;

in the actual evaluation, the data filled in 2009 is 2345 km, and the data filled in 2015 is 4748 km.

In this embodiment, the preprocessing in step a includes removing duplicate data, removing anomalous data, data integration, data conversion, and data normalization.

In this embodiment, the method for identifying the railway data after preprocessing includes:

wherein the error influencing factor isThe sub-sequence generated by the t-1 round of interaction +.>The weight of (2) is +.>The t-th interaction produces a weighted error of +.>Continuously iterating until the weighted error increment is smaller than a given threshold value, and stopping to lead out the rest data as contrast data and the rest data as abnormal data;

the comparative data are: 631 km in 2008, 1828 km in 2010, 3084 km in 2012, 3559.1234 km in 2013, 2337 km in 2016, 1856 km in 2017, 4050 km in 2018, 2520 km in 2020, 2149 km in 2021.

In this embodiment, a method for optimizing generalization capability of a data repair model by using a first optimization algorithm includes:

In this embodiment, the method for optimizing the accuracy of the data repair model by using the second optimization algorithm includes:

calculating the empirical entropy of the training data set:

In this embodiment, the method for extracting the abnormal feature of the abnormal data includes:

wherein the features areThe weight value of (2) is +.>The fitness function value of the t-th generation individual is +.>Selecting two individuals from the updated population according to the selection probability to perform cross recombination to obtain a new generation population, and extracting the characteristic with the maximum fitness as an important characteristic;

deleting the rest abnormal data which cannot be filled;

in the actual evaluation, the extracted features are:

outliers: data in 2011 is 9999 km, which is far higher than data in other years, and is regarded as an abnormal value;

outliers: data in 2014 is 10000 ten thousand km, which is far higher than data in other years, and is regarded as an outlier;

noise data: the data in 2019 contains values after decimal points, and noise exists compared with the data in other years;

missing values: data were missing in 2009 and 2015, with no values available;

the processed output data are 2009, 2015, 2008, 631, 1828, 2012, 3084, 2013, 3559.1234, 2016, 2337, 1856, 4050, 2020, 2520, 2021, 2149.

In this embodiment, the method for filling the abnormal data with a random forest algorithm according to the abnormal feature and the comparison number includes:

a. filling in a large number of missing features

b. filling of multiple feature missing data

Traversing all the features, filling from the feature with the least missing, replacing the missing value of other features with 0 or mode when filling one feature, putting the predicted value into the original feature matrix after finishing one regression prediction, and continuously filling the next feature;

In a second aspect, an intelligent big data acquisition system comprises:

and a data acquisition module: the method comprises the steps of acquiring railway data and preprocessing the railway data;

The foregoing description of the preferred embodiments of the invention is not intended to limit the invention to the precise form disclosed, and any such modifications, equivalents, and alternatives falling within the spirit and scope of the invention are intended to be included within the scope of the invention.

Claims

1. The intelligent big data acquisition method is characterized by comprising the following steps of:

a, acquiring railway data and preprocessing the railway data;

constructing the preprocessed railway data as a data set, taking Boosting as an integrated model taking Boosting as a random decision forest, and taking deviation, variance and noise as generalization errors of the random decision forest:

2. The intelligent big data collection method according to claim 1, wherein the preprocessing in the step a includes removing duplicate data, removing abnormal data, integrating data, converting data, and normalizing data.

3. The intelligent big data acquisition method according to claim 1, wherein the method for identifying the preprocessed railway data comprises the following steps:

wherein the degree of influence of the intra-sequence and inter-sequence abnormal reconstruction errors on the overall error is proportional to、/>、/>Subsequence->Is +.>Given a set of subsequences for which anomaly scores have been calculated, a set of edge anomaly candidates is given:

wherein the edge anomaly candidate set isThe set of subsequences for which the abnormality score was calculated is +.>Subsequence->Assembly of components->Is abnormal of (a)Score->The threshold value of abnormality degree is->The number of manually-marked samples is M, the local dependence factor is k, and the center point of the edge anomaly candidate set is +.>The s-th subsequence is->The number of the subsequences is j, and a weighted error generated by interaction is calculated:

4. The intelligent big data acquisition method according to claim 1, wherein the method for optimizing the generalization ability of the data restoration model by adopting the first optimization algorithm comprises the following steps:

5. The intelligent big data acquisition method according to claim 1, wherein the method for optimizing the accuracy of the data restoration model by using the second optimization algorithm comprises the following steps:

calculating the empirical entropy of the training data set:

6. The intelligent big data acquisition method according to claim 1, wherein the method for extracting the abnormal characteristics of the abnormal data comprises the steps of:

wherein the difference degree of any two abnormal data in the j-th characteristic dimension isThe difference value of the individual anomaly data in the dimension of feature j is +.>、/>Special (special)Syndrome of->The maximum value of the weight is +.>Characteristics->The minimum value of the weight is +.>Calculating the abnormal data abnormal feature selection probability:

and deleting the rest abnormal data which cannot be filled.

7. The intelligent big data acquisition method according to claim 1, wherein the method for filling the abnormal data by adopting a random forest algorithm according to the abnormal characteristics and the comparison number comprises the following steps:

a. filling in a large number of missing features

b. filling of multiple feature missing data

8. An intelligent big data acquisition system, comprising: