CN117393149A

CN117393149A - Time sequence data prediction method for lung nodule pathogenesis

Info

Publication number: CN117393149A
Application number: CN202311409606.6A
Authority: CN
Inventors: 黄勃; 周禹宣; 虞益军; 粟波; 臧振森
Original assignee: Shanghai Zhongyu Industrial Internet Research Institute; Shanghai University of Engineering Science; Shanghai Pulmonary Hospital
Current assignee: Shanghai Zhongyu Industrial Internet Research Institute; Shanghai University of Engineering Science; Shanghai Pulmonary Hospital
Priority date: 2023-10-27
Filing date: 2023-10-27
Publication date: 2024-01-12

Abstract

The invention relates to the technical neighborhood of artificial intelligence, and discloses a time sequence data prediction method for lung nodule onset, which comprises the steps of acquiring a time sequence data set containing a plurality of characteristics of lung nodule onset, and carrying out translation interaction processing on the time sequence data of each characteristic to obtain an initial time sequence data set; respectively carrying out data noise reduction and data enhancement processing on the time sequence data of each feature in the initial time sequence data set to obtain a final time sequence data set; and finally, carrying out regression processing on the final time sequence data set by using the extreme random tree to obtain a prediction result. The prediction method of the invention has high precision in predicting the data set of the public tuberculosis, whether in processing the ultralong sequence or in parallelism.

Description

Time sequence data prediction method for lung nodule pathogenesis

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a time sequence data method for lung nodule onset.

Background

Time series is a type of data that measures changes in things over time, while timing predictions use the amount of data in a past period of time to predict the amount of information in a future period of time, and modern people believe that timing analysis originates from an autoregressive model as proposed by the United kingdom statistician G.u.yule since 1927. Since then, especially after the popularity of computer software and hardware, more and more researchers are paying attention to the importance of time sequence prediction, the application range of the time sequence prediction is expanding in industry and academia, people are also raising another wave of time sequence prediction research, and the main time sequence prediction method and the technical problems thereof at present are as follows:

1) Based on traditional time series prediction models, such as smoothing, trend fitting, combining, AR, MA, ARMA, ARIMA, etc., the series of algorithms requires that the time series data should be stable and insensitive to the relationship between nonlinearities.

2) Based on intelligent algorithms, such as propset model, which was proposed by Facebook in 2017, propset uses the idea of time series decomposition to build regression models on time axis respectively, and optimize its parameters through bayesian framework, the series of algorithms often need to provide them with sufficiently correct information, and a priori distribution of parameters.

3) Based on the method of the cyclic neural network, the concept of time sequence is introduced by utilizing the requirement of the long-short-term memory neural network in the NLP field on the natural language context prediction, but the LSTM algorithm has poor effect on the extremely large sequence prediction.

Disclosure of Invention

The invention provides a time sequence data prediction method for lung nodule onset, which adopts a time sequence data translation interaction mechanism to carry out translation interaction processing on a time sequence data set, and solves the technical problem of insufficient connection of time sequence data sequences before and after.

The invention can be realized by the following technical scheme:

a time series data prediction method for lung nodule pathogenesis comprises the following steps

Acquiring a time sequence data set containing a plurality of characteristics of lung nodule onset, and carrying out translation interaction processing on the time sequence data of each characteristic to acquire an initial time sequence data set;

respectively carrying out data noise reduction and data enhancement processing on the time sequence data of each feature in the initial time sequence data set to obtain a final time sequence data set;

and finally, carrying out regression processing on the final time sequence data set by using the extreme random tree to obtain a prediction result.

Further, the time sequence data set is set into a matrix structure taking time sequence data of each feature as a column vector, and the steps of carrying out translation interaction processing on the time sequence data of each feature are as follows:

1) Deep copying time-series data set D to obtain D _shift ；

2) If the translation times n _diff Not 0, then each feature column vector in the time series data set D is translated by a feature unit, and n is the same time _diff Subtracting 1;

3) Filling nan with null values generated by translation, adding new features generated after translation into D according to corresponding dimensions _shift ；

4) Repeating S2 and S3 until n _diff Is 0;

5) In D _shift Performing linear operation according to all new features generated by data translation at each moment and the features at the original moment;

6) A final data set is obtained as an initial data set.

Further, judging each line of data one by one, if the characteristic value is NAN, marking the data segment at the moment, and finally deleting all marked data segments.

Further, the EEMD integrated empirical mode decomposition method is utilized to conduct noise reduction processing on the initial time sequence data set, then the TimeGAN is utilized to generate countermeasure network to conduct multi-dimensional data enhancement on time sequence data of all the features in the initial time sequence data set after noise reduction processing, and a final time sequence data set is obtained.

A time series data prediction device based on the time series data prediction method facing the occurrence of pulmonary nodules comprises

A time sequence data set adopts a matrix structure taking time sequence data of each characteristic as a column vector;

the time sequence feature translation interaction module is used for carrying out data translation and interaction processing on each feature column vector in the time sequence data set to obtain an initial time sequence data set;

the data noise reduction module is used for carrying out noise reduction processing on the initial time sequence data set;

the data enhancement module is used for carrying out data enhancement on the initial time sequence data set after noise reduction to obtain a final time sequence data set;

the prediction module is used for predicting the final time sequence data set.

Furthermore, the data denoising module adopts an EEMD integrated empirical mode decomposition method to perform data denoising processing on the initial time sequence data set, the data enhancement module adopts TimeGAN to generate the initial time sequence data set subjected to noise reduction of the countermeasure network to perform data enhancement processing, and the prediction module adopts an extreme random tree to predict the final time sequence data set.

The beneficial technical effects of the invention are as follows:

1) The effect of the prediction method of the invention on predicting the data set of the public tuberculosis is higher than the model precision in the aspect of processing the ultra-long sequence or parallelism.

2) On the problem of insufficient connection of front and rear data sequences, time sequence characteristic translation interaction is adopted to solve the problem, time sequence characteristics favorable for prediction are obtained, meanwhile, an EEMD method of characteristic decomposition is used for eliminating noise, then an antagonism network is generated through TimeGAN to carry out data enhancement, and then the data are input into a prediction model, and the regression mode of an ET extreme random tree of the integration model is used for prediction, so that accuracy is high and effects are obvious.

Drawings

FIG. 1 is an overall flow chart of the present invention;

FIG. 2 is a schematic diagram of results before and after a translation interaction by the time series data translation interaction mechanism of the present invention;

FIG. 3 is a diagram showing a process of data noise reduction by EEMD method;

fig. 4 is a schematic diagram of a structure of a time series data prediction device for a time series data prediction method for occurrence of pulmonary nodule according to the present invention.

Detailed Description

The invention will be described in further detail with reference to the drawings and the specific examples.

As shown in fig. 1, the invention provides a time series data prediction method for lung nodule pathogenesis, which mainly comprises the following steps:

in step S101, a timing sequence feature translation interaction mechanism is used to improve the problem that the extremely random tree model is incomplete in using the timing sequence features, and the prior knowledge of the timing sequence is increased.

Specifically, collecting time series data about each feature of pulmonary nodule disease to construct a time series data set, namely D, dividing the data into N dimensions according to time series sequence, wherein N is the number of time series data strips, each square in each column represents a feature, each row represents a feature of the whole time period, as shown in figure 2, firstly deeply copying the time series data set, translating N times in column vector unit, then translating the newly obtained feature, namely the original feature after decomposition, into N steps to generate new feature, adding the data set after deep copying according to the sequence of corresponding dimensions, and expanding the total content strip number of the data set after deep copying to N ² As shown in FIG. 2, D' _N For D _N The result after one feature unit translation is carried out, and finally D after expansion _shift In accordance with the data translation at each momentAnd carrying out linear operation on the new features and the features at the original time to obtain a final data set.

The process of performing data translation interactions is as follows:

1) Deep copying time-series data set D to obtain D _shift ；

4) Repeating S2 and S3 until n _diff Is 0;

6) A final data set is obtained, and the final data set is taken as an initial data set to participate in subsequent calculation.

In step S102, a time-series data discriminator is used to determine whether or not there is a null value in the data of all the time periods of the current feature, and if there is a null value, it is deleted.

Specifically, each line of data is judged one by one, if the characteristic value is NAN, the data segment at the moment is marked, and finally all marked data segments are deleted.

In step S103, the initial time series data is noise-reduced.

Because the dimension of the input time sequence is inconsistent, when the dimension of the input time sequence is smaller and noise exists, the EEMD integrated empirical mode decomposition method is selected to further eliminate the noise and strengthen the characteristics.

Specifically, in this embodiment, the basic flow of the EEMD integrated empirical mode decomposition method is as follows:

firstly, determining the number of times T of noise addition, and adding white noise w to each dimension characteristic of the input according to the number of times _i The white noise algorithm is equation (1), for each dimension of the feature s _T (t) obtaining each component A [ s ] of the P-th order by adopting an empirical mode decomposition algorithm _T (t)]Then, the average residual component r of the order P is obtained by using a mean value formula _P The specific algorithm is formula (2), formula (3), and the average IMF (Intrinsic Mode Functions, IMF) component C of the P-th order is obtained from formula (4) _P Finally let s (t) =r _P By analogy, the IMF components of 1 to P-1 order can be sequentially obtained until r _P Can not be decomposed, i.e. r _P The termination condition is satisfied and the signal type is a monotone signal.

The result of the analysis algorithm can be C from equation (5), where C is a set of IMF components of order 1 to P-1, and C is taken as input into the data enhancement module.

w _i ＝(-1) ^q εv ⁱ (t) (1)

q∈[1,2]

i∈[1,T]

Wherein w is _i,i∈[1,T] To add random white noise, ε is the standard white noise, v ⁱ And (t) white noise is simultaneously in conformity with normal distribution.

s _T (t)＝s(t)+β _P-1 E _P (w _T ) (2)

Wherein E is _P The P-stage empirical mode decomposition component and beta are generated by adopting an empirical mode decomposition algorithm _{i,i∈[1,P-1]} Is a constant coefficient.

Where a is the k-th order residual value of the empirical mode decomposition k-th order IMF component map.

C _P (t)＝s(t)-r _P (4)

C＝C _{j,j∈[1,P-1]} (5)

In step S104, the initial time series data set after noise reduction is data enhanced by using the TimeGAN generation countermeasure network, and the data is enriched, so as to obtain the final time series data set.

The method comprises the steps of adopting TimeGAN generation to enhance data against a network algorithm, adopting original data as progressive supervision loss by TimeGAN, further enabling a model to capture conditional distribution in time sequence data, simultaneously introducing Embedding Network to provide reversible mapping between potential characterization, further reducing high dimensionality of Generative Adversrial Learning, and finally utilizing supervised sharing and joint training embedded network to generate real time sequence data.

Specifically, in the present embodiment, parameters to be provided to the TimeGAN generation countermeasure network are appropriately defined according to the requirements, as shown in table 1 below.

TABLE 1

In step S105, prediction is performed using the regression mode of the extreme random tree.

Specifically, in this embodiment, because the extremely random tree is sensitive to high-dimensional features, parallel computation is realized and useful features can be fully selected, because random sampling exists, the trained model variance is small, the model generalization capability is good, the implementation mode on codes is simple, before the regression task mode algorithm of the extremely random tree is executed, the data subjected to enhancement and cleaning through the steps is divided into a training set and a test set and then standardized, the extremely random tree is subjected to random attribute P of a single tree node, the minimum number of samples N of the tree node splitting and finally the influence of model rule modulus S is large, therefore, grid search is used in parameter optimization, the influence degree of mean square error is realized through different parameters, the larger error is considered to be better, finally, the implementation core is that the steps of sample selection, feature selection, decision tree construction and extremely random tree prediction are completed, three steps before extremely random tree prediction are performed to form a forest, the tested samples are input into the formed forest, the extremely random tree is subjected to the iteration prediction by the extremely random tree, and finally, the test result is obtained by carrying out the extremely random tree prediction by using the test results. The results are shown in Table 2.

TABLE 2

In summary, compared with the existing time sequence prediction method, the method has the advantages that:

1) The model is more accurate than the above-mentioned model in predicting the data set of the open tuberculosis, both in processing very long sequences and in parallelism.

2) On the problem of insufficient connection of front and rear data sequences, time sequence characteristic interaction is adopted to solve the problem, time sequence characteristics favorable for prediction are obtained, an EMD method of characteristic decomposition is used to eliminate noise, then data is enhanced through a TimeGAN module and then is input into a prediction section, and the regression mode of an ET extreme random tree is used for prediction, so that the effect is remarkable.

Next, a time sequence prediction method device for lung nodule onset according to an embodiment of the present invention will be described with reference to the accompanying drawings.

Fig. 4 is a schematic structural diagram of a timing prediction apparatus for a timing prediction method for pulmonary nodule onset according to an embodiment of the present invention.

The time sequence prediction device of the time sequence prediction method for the occurrence of the pulmonary nodule comprises the following steps:

a timing feature interaction module 100;

a time series data discriminator module 200;

a data noise reduction module 300;

a data enhancement module 400;

a prediction module 500;

the time sequence feature interaction module 100 strengthens the connection before and after the time sequence features;

the time sequence data discriminator module 200 judges the existence of the current time sequence feature;

the data noise reduction module 300 eliminates data with noise in the time sequence data;

the data enhancement module 400 increases the number of samples of the model and enriches the data set;

the prediction module 500 performs regression prediction on the processed data by transmitting the processed data into an extreme random tree.

Further, in one embodiment of the present invention, to ameliorate the problem of incomplete use of the extremely random tree model for time series features, a time series feature interaction module is used to add a priori knowledge of the time series.

Further, in one embodiment of the present invention, in order to determine whether the current data is meaningful, a time sequence data discriminator is used, and if the current data is null, the value is set as NAN:

Further, in one embodiment of the present invention, the main idea of using TimeGAN generation against a network for data enhancement is to learn the way data enhancement from the data, while TimeGAN introduces the use of raw data as progressive supervised loss, further letting the model capture the conditional distribution in the time series data, while introducing Embedding Network the reversible mapping between the provisioning and potential characterization, further reducing the high dimensionality of Generative Adversrial Learning, and finally generating realistic time series data using supervised sharing and jointly trained embedded networks.

Further, in one embodiment of the present invention, because the extremely random tree is more sensitive to high-dimensional features, parallel computation is realized and useful features can be fully selected, because random sampling exists, the trained model variance is small, the model generalization capability is good, the implementation on codes is also simple, before executing the regression task mode algorithm of the extremely random tree, the data subjected to the steps are firstly subjected to enhancement and cleaning to form a training set and a test set, then standardized processing is performed, the extremely random tree is subjected to random attribute P of a single tree node in the process of generation, the minimum number of samples N of the tree node splitting, the influence of the model rule quantity S is large, therefore, grid search is used in parameter optimization, the influence degree of mean square error is realized through different parameters, the larger error is considered to be better, finally, the implementation core is that samples are selected, the selected features, decision tree construction and extremely random tree prediction are completed, three steps before the extremely random tree prediction are performed, then the decision forest is formed, the samples are input into the forest to be tested, the regression tree is formed, and finally, the extremely random tree is predicted by using the iteration results of each test tree, and the extremely random tree is predicted, and the result is obtained by using the extremely predicted.

It should be noted that, the foregoing explanation of the embodiment of the method for identifying intent of user questions and answers in the industrial field is also applicable to the device of this embodiment, and will not be repeated here.

Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include at least one such feature. In the description of the present invention, the meaning of "plurality" means at least two, for example, two, three, etc., unless specifically defined otherwise.

In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms are not necessarily directed to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, the different embodiments or examples described in this specification and the features of the different embodiments or examples may be combined and combined by those skilled in the art without contradiction.

While embodiments of the present invention have been shown and described above, it will be understood that the above embodiments are illustrative and not to be construed as limiting the invention, and that variations, modifications, alternatives and variations may be made to the above embodiments by one of ordinary skill in the art within the scope of the invention.

Claims

1. A time sequence data prediction method for lung nodule onset is characterized by comprising the following steps: comprising

2. The method for predicting time series data for onset of pulmonary nodules of claim 1, wherein: the time sequence data set is set into a matrix structure taking time sequence data of each feature as a column vector, and the steps of carrying out translation interaction processing on the time sequence data of each feature are as follows:

1) Deep copying time-series data set D to obtain D _shift ；

4) Repeating S2 and S3 until n _diff Is 0;

6) A final data set is obtained as an initial data set.

3. The method for predicting time series data for onset of pulmonary nodules of claim 2, wherein: judging each line of data one by one, if the characteristic value is NAN, marking the data segment at the moment, and finally deleting all marked data segments.

4. The method for predicting time series data for onset of pulmonary nodules of claim 1, wherein: and (3) carrying out noise reduction processing on the initial time sequence data set by using an EEMD integrated empirical mode decomposition method, and then generating an countermeasure network by using TimeGAN to carry out multi-dimensional data enhancement on time sequence data of each feature in the initial time sequence data set after the noise reduction processing, so as to obtain a final time sequence data set.

5. A time series data prediction device based on the time series data prediction method for pulmonary nodule onset according to claim 1, characterized in that: comprising

the prediction module is used for predicting the final time sequence data set.

6. The apparatus for predicting time series data for a pulmonary nodule onset according to claim 5, wherein: the data denoising module adopts an EEMD integrated empirical mode decomposition method to perform data denoising processing on an initial time sequence data set, the data enhancement module adopts TimeGAN to generate an initial time sequence data set subjected to counternetwork denoising to perform data enhancement processing, and the prediction module adopts an extreme random tree to predict a final time sequence data set.