CN116741385B

CN116741385B - Infectious disease cross-border propagation modeling prediction method

Info

Publication number: CN116741385B
Application number: CN202310383533.1A
Authority: CN
Inventors: 田洁; 王英; 宋悦谦; 肖利力; 杨建洲; 袁芳; 张晓龙; 郭惠琳
Original assignee: China Customs Science And Technology Research Center
Current assignee: China Customs Science And Technology Research Center
Priority date: 2023-04-11
Filing date: 2023-04-11
Publication date: 2023-11-14
Anticipated expiration: 2043-04-11
Also published as: CN116741385A

Abstract

The invention discloses a cross-border propagation modeling prediction method for infectious diseases, which utilizes a regression model, adopts a quantitative index to research and judge the input risk of the infectious diseases and provides a more accurate basis for the effectiveness evaluation of port prevention and control measures. The current situation that the current customs infectious disease risk analysis and research and judgment still stay in the current situation description and qualitative analysis in most cases is improved, the capabilities of scientific analysis, accurate prediction, dynamic adjustment and objective research and judgment are improved, experience is accumulated for research and accumulation of dynamic models of other infectious disease cross-border transmission, a foundation is established, and the capability of customs port for preventing and controlling infectious disease cross-border transmission is improved.

Description

Infectious disease cross-border propagation modeling prediction method

Technical Field

The present invention relates to the field of infection prediction. And more particularly, to a modeling prediction method for cross-border spread of infectious diseases.

Background

At present, customs still stays in a descriptive induction and summarization stage on the application of infectious disease data, methods such as expert consultation and the like are mainly adopted in the prevention and control of infectious disease cross-border transmission risks, and more visual and accurate model methods are needed for predicting and evaluating results. Customs accumulates a large amount of data of positive cases detected by infectious disease inputtable laboratories, but analysis of the data is still in a descriptive summary analysis stage, and an accurate data model is not established.

In order to timely, accurately and reliably predict the development change of infectious diseases, the mathematical and statistical models are comprehensively utilized to carry out more accurate modeling analysis on the infectious diseases, and research results based on the mathematical and statistical models are obtained. At present, no model is used for researching the dynamics of infectious disease transmission, the input risk of infectious disease cross-border transmission is closely related to the incidence of the output country, the times of flights of the output country and the input country, the frequencies of people, the types of flights, the density of airport personnel, the environmental conditions of different ports, prevention measures and the like, and is also related to factors such as crowd distribution, regional distribution, vaccination conditions, variant distribution and the like of the infectious disease in the output country, some of the above influencing factors can be used for defining the logic relationship of the infectious disease by using a mathematical formula, some of the influencing factors need to be continuously deduced in the model, so that a corresponding flexible, dynamic and real-time simulation system needs to be constructed for accurate evaluation and prevention and control of the input risk of the infectious disease, and theoretical support and visual display are provided for decision making and making of customs at each stage in the prevention and control of the infectious disease.

Disclosure of Invention

The invention aims to provide a modeling prediction method for cross-border transmission of infectious diseases, which provides theoretical support and visual display for decision making and formulation of customs in each stage of infectious disease prevention and control.

In order to achieve the above purpose, the invention adopts the following technical scheme:

the first aspect of the invention provides a modeling prediction method for cross-border transmission of infectious diseases, which comprises the following steps:

step1, acquiring raw data comprising the number of infected persons in each country worldwide, the number of inbound aviation tools, the origin of the inbound persons and the number of inbound persons, detecting the number of positive cases and the number of published input cases in a laboratory, and sequentially screening and preprocessing the raw data to obtain an infection curve I (t) of each country;

step2, dividing curve segments which can fit the SIR propagation process in the infection curves I (t) of all countries obtained by the pretreatment to obtain a curve I which can fit the SIR _s (t), wherein s is the segment number divided by the infection curve I (t);

step3, for each curve I of each country _s (t) fitting the SIR propagation process to obtain a fitted SIR curve and fitting parameters of each country;

step4, according to the fitted SIR curve and the fitting parameters, the curve I is compared with the fitting parameters _s Dividing the infection process represented by (t) according to distribution to obtain divided curve segments, and collecting total to obtain the infection of the country as the first characteristic variableA sequence and an output country infectious disease stage as a second characteristic variable;

step 5, summarizing the number of people entering the environment in the original data by using the divided curve segments to obtain the total number of people entering the environment as a third characteristic variable;

step 6, summarizing the number of input cases in the original data by using the divided curve segments to obtain the number of overseas input cases serving as a target variable;

and 7, adding the first characteristic variable, the second characteristic variable, the third characteristic variable and the target variable into a regression model, constructing a cross-border infectious disease propagation linear regression model, and predicting results.

Preferably, the preprocessing of the data comprises stitching the acquired data, and generating an infection curve I (t) of each country during the infectious disease infection period, namely a time sequence I (t) of the number of existing infectious persons per day, based on the raw data.

Preferably, said screening of data comprises,

s1: rejecting countries with non-zero sequence length less than or equal to 300;

s2: rejecting min (I (t)) ∈10000000 and max (I (t)) ∈1000, i.e. rejecting sequence with minimum value or maximum value of minimum value of sequence;

s3: deleting all header data of O in each country sequence;

s4: and deleting tail data of all O in each country sequence.

Preferably, for the curve I _s The infection process characterized in (t) is divided according to distribution, and further comprises the steps of starting from the peak value of the number of infected people, dividing the infected people from the left side and the right side at equal distances according to a complete infection period, and obtaining divided curve segments.

Preferably, the number of S susceptible people, the beta infection rate, the gamma recovery rate and the initial value of the number of I infected people are used for constructing an SIR model by combining parameters, N groups are randomly extracted from a determined parameter initial value vector to serve as initial values, N is a constant, an objective function is a residual error between the sequence of the number of infected people and the actual sequence of the number of infected people of the fitted SIR model, the parameter combination is input as the initial value, the L-BFGS-B is used as an optimization method, and the S, beta and gamma are optimized to output optimal parameters.

Preferably, the method further comprises using R ² Evaluating the fitting effect of the fitted SIR curve

Wherein the method comprises the steps ofRepresenting each true value, predicted value and average value of the sequence.

Preferably, the construction of a cross-border linear regression model of infectious disease

y＝α+λx _inject +ωx _input +δx _stage +∈

Where y is the number of positives entered over a period of time, x _inject 、x _input 、x _stage The method is characterized in that the number of the infection of the country is output at the same time, the input number of the country and the infection period of the infectious disease of the country are respectively output, wherein epsilon is a random error term, alpha is a constant, lambda is a coefficient of the number of the infection of the country at the same time, omega is a coefficient of the input number of the country, and delta is a coefficient of the infection period of the infectious disease of the country.

Preferably, in the process of predicting the result, it is also necessary to record the SIR process curve which cannot be accurately described at present in the application process, add the newly-appearing curve with incorrect prediction to the configuration table in time, and at the same time, re-fit or mark the SIR curve which can be predicted and has incorrect prediction for the next analysis and measurement.

A second aspect of the invention provides a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method provided in the first aspect of the invention when executing the program.

A third aspect of the invention provides a computer readable storage medium having stored thereon a computer program which when executed by a processor implements the method provided by the first aspect of the invention.

The model disclosed by the invention can flexibly, dynamically and real-timely study and output the influence of infectious disease incidence, variant epidemic situation, vaccination situation, the frequency of flights to and from the input country, the flight model, airport personnel density, environmental conditions of different ports and prevention and control measures on the input risk of infectious disease;

searching rules of infectious disease input risk under the condition that factors such as population distribution, regional distribution and the like of infectious disease of the output country are unknown;

theoretical basis and data support are provided for risk assessment, input early warning and decision making of national border port infectious diseases, prevention and control measures are comprehensively evaluated, and the national border port infectious diseases are represented by using an intuitive diagram mode.

The beneficial effects of the invention are as follows:

the invention improves the current situation that the current customs infectious disease risk analysis and research and judgment still stay in the current situation description and qualitative analysis in most cases, improves the capabilities of scientific analysis, accurate prediction, dynamic adjustment and objective research and judgment, builds the basis for research and accumulation of experience of power models of other infectious disease cross-border transmission, and improves the capability of customs port to prevent and control the infectious disease cross-border transmission.

Drawings

The following describes the embodiments of the present invention in further detail with reference to the drawings.

FIG. 1 shows an overall flowchart for modeling infectious disease cross-border propagation.

Fig. 2 shows a schematic diagram of peak intervals.

Fig. 3 shows a SIR model schematic.

Fig. 4 shows a schematic diagram of SIR model fitting process.

Fig. 5 shows a schematic of the fitting result.

Figure 6 shows a plot of SIR model goodness-of-fit density.

Fig. 7 shows a schematic diagram of a prediction process.

Detailed Description

In order to more clearly illustrate the present invention, the present invention will be further described with reference to the preferred embodiments and the accompanying fig. 1 to 7. Like parts in the drawings are denoted by the same reference numerals. It is to be understood by persons skilled in the art that the following detailed description is illustrative and not restrictive, and that this invention is not limited to the details given herein.

In order to describe the input risk more directly, the overall scenario is described as follows:

predicting a target: the number of overseas input cases within a certain time period T;

prediction variable: the infection of a country within a certain time period T illustrates the variable injectFeatureList;

a variable stageList is described for the infection stage of a country within a certain time period T;

the explanatory variable totalFeatureList of the overseas input situation of a certain country within a certain time period T.

Prediction model: linear regression model

The construction of specific prediction variables comprises the following steps:

step4, according to the fitted SIR curve and the fitting parameters, the curve I is compared with the fitting parameters _s The infection process characterized in (t) is divided according to distribution to obtain divided curve segments, and aggregate is carried out as the firstAn output country infection sequence (injectFeatureList) of one characteristic variable and an output country infectious disease infection stage (stageList) as a second characteristic variable;

step 5, summarizing the number of people entering the original data by using the divided curve segments to obtain an overseas input headcount (total FeatureList) serving as a third characteristic variable;

step 6, summarizing the number of input cases in the original data by using the divided curve segments to obtain the number of overseas input cases (caseNum) serving as a target variable;

1.1 data preprocessing and stitching

The raw data provides a 2020,2021,2022 three year sequence of existing infections, respectively, and a daily time series of existing infections I (t) for each country's infectious disease period needs to be generated based on the raw data. Firstly, the data in 2020 and 2021 are internally connected, the main key is the country (including the corresponding attribute item), the history infection number wide table of each country for two years is obtained, wherein the numerical matrix is marked as H _ij The row index represents time, the column index represents country, and element h in the matrix _ij Representing the number of people with historical infection. Then using the same notation, a single country historic infested population sequence H. _j Is recorded as H (t)

1.2 data screening and missing value handling

The data screening and missing value processing comprises the following steps of;

step1: rejecting countries with length ({ I (t) noteq0 }) less than or equal to 300, i.e. countries with non-zero sequences less than or equal to 300, because these country sequences are too short for efficient analysis;

step2: rejecting min (I (t)). Gtoreq.10000000, max (I (t)). Ltoreq.1000. Namely, removing sequences with excessively large minimum value or excessively small maximum value of the sequences, and basically screening national sequences and corresponding newly-increased sequences of the number of infectious agents which are suitable for further analysis and valuable for analysis;

step3: deleting all header data of O in each country sequence;

step4: and deleting tail data of all O in each country sequence.

The first step removes data with non-zero sequences that are too short, as the sequences are too short to fit the SIR model. The second step removes models with minimum values that are too small or maximum values that are too large, because such data are difficult to fit and have a very small duty cycle that is less meaningful for subsequent analysis. Thirdly, the data with the head and the tail being O are removed in the fourth step, because the data which is equal to O in the sequence can not fit the SIR model, the data which appears in the head and the tail can be directly deleted, and the subsequent analysis can not be influenced.

Through the steps, an analyzable sequence of the number of infected persons is obtained.

The epidemic process is mostly complex, and a plurality of infection processes or errors in the data collection process may occur in the same time period, and the complex situations of abnormal fluctuation of the number of infected persons and the like are represented in the data (a t2-t3 interval shown in fig. 2). By screening and selecting existing infection sequences in each country, the final selection can be simulated using the better fitted curve portion of the SIR propagation process.

As shown in fig. 2, (t) ₁ ，t ₂ ) And (t) ₃ ，t ₄ ) A curve is fitted for the selected SIR propagation process.

2.1SIR model profiling

The SIR transmission model can be used to describe the change process of uninfected population, infected population and recovered population at different moments after the occurrence of infectious disease, and is specifically explained as follows:

dividing the total population (N) in the propagation process into three parts: namely S (susceptible), I (infected) and R (recovering). When the transmission is performed, the infected person (I) is in contact with the susceptible person (S), and the susceptible person is infected with a certain probability. The infected person recovers after a period of time, and is moved out of the population of infected persons to become a restorer (R). The restorer for some reason will not become an infected person for a short period of time. Let t denote time, β denote infection rate, γ denote recovery rate, and the model can be expressed by the following differential equation:

by optimizing calculation I ₀ Model parameters such as S, beta, gamma and the like can determine the transmission condition of the whole epidemic disease in a period of time.

2.2SIR epidemic model action description

In fact, the SIR model described above is not able to fully describe the course of infection of an infectious disease, mainly because: the model is not enough for the transmission process of infectious diseases, only two parameters gamma and beta are set to describe the recovery and infection process, and the parameter construction of factors is simpler. In fact, most governments in countries and regions take corresponding measures to limit the spread of infectious diseases after they occur. The utility of the relevant intervention is therefore difficult to embody in the existing model, and the corresponding deviations are unavoidable.

From the above analysis, although the SIR model does not accurately describe the infection process in an actual state, the SIR base model can effectively describe the infection process for a certain period of time from the viewpoint of the collected observation data. Therefore, in the scheme, the purpose of using the SIR model is to effectively identify the infection process and remove noise points, and finally form a characteristic with direct correlation significance to predict the input case.

2.3 model fitting and parameter solving

In this project, the construction of SIR models requires four parameters, S (number of susceptible people), β (infection rate), γ (recovery rate), and I (initial number of infected people), respectively. In order to be able to efficiently search for the optimization vector and improve the effect of SIR model fitting, a traversal search of the best fit effect is performed using different parameter combinations as initial inputs. And randomly extracting N groups of initial values from the determined parameter initial value vectors to serve as initial values, and executing the following optimization fitting process.

The process of optimizing the parameters uses an optim function in r language, the objective function being the sum of the residuals between the sequence of the number of infected persons and the actual sequence of infected persons of the fitted SIR model (calculated using the dist function in r language). And (3) inputting the parameter combination as an initial value, using the L-BFGS-B as an optimization method, optimizing S, beta and gamma, and outputting the optimal parameters. For each piece of data, there are N initial value optimizing processes, and a group of optimizing parameters with minimum residual is taken and output. The optimization process can be described by the following formula

Wherein I is _SIR (S, beta, gamma, I) is the infection sequence of the SIR model determined by the parameters, I _real Is the true infection sequence, the optimized objective function is the residual of the two (i.e. the difference L ₂ Norms). Taking n=25, finding the optimal result with the smallest residual error in the 25 groups of parameters, and taking the optimal result as the final output parameter.

The optim function in the r language provides a variety of parameter optimization methods, where L-BFGS-B is selected as the optimization method.

When the fitting effect is poor, the relevant calculation precision needs to be analyzed, and the constant term is expanded. The fitting is performed this time, the parameter b (beta) needs to be multiplied by 10000 constant items, and other parameters need to be expanded less. The coordinates of the optimal S, β, γ, I and peak values for each interval are optimized, an example is shown in fig. 5, where the solid line represents the fitted curve and the points represent the truncated original interval, and the parameters of the fitted curve output are shown in table 1, for example.

Table 1 parameter output table

As shown in table 1, in the data finally output after fitting, in addition to the original section data (country, left and right section, peak value), there are infection parameters bVecOpt (infection rate β), gVeclopt (recovery rate γ), initial opt (initial number of infected persons I), sVecOpt (susceptible person S) of the SIR model.

2.4 evaluation of the Effect of SIR model

And obtaining the optimal parameters of the SIR model by using the parameter solving process. Evaluating the fit effect of a curve using R ² ，R ² The formula of (2) is as follows:

wherein the method comprises the steps ofRepresenting each true value, predicted value and average value of the sequence. R is R ² The closer to 1, the better the fitting effect representing the curve, the setting of R ² At > 0.4, the curve fitting effect is considered good. In SIR model evaluation, R is calculated for each fitted infectious disease fit data ² Drawing a density map; as shown in FIG. 6, most of the curve R ² Are all above 0.75, but only a very small part of R ² < 0.4. In general, all R ² The average value of (c) is about 0.86, proving that the SIR model can well describe the change of infection status in different countries and regions.

3.1 sample dataset introduction

After preliminary data processing and SIR fitting, summarizing to obtain a data set finally input into a regression model. The variable interpretation tables and samples in the dataset are shown in tables 2 and 3, for example;

table 2 regression analysis sample data interpretation table

Table 3 regression analysis sample example

After the SIR model is output, the original data sequence is intercepted and processed, and besides parameters of the SIR model, there are infection stage stageList, peak characteristic peaksListVec of corresponding time period, and infection population injectFeatureList (calculation mode is summation of infection population for a period of time) of the same time period, total population total feature is input, and total number of inbound cases caseFeatureList.

3.2 training procedure of model

The training process adopts N-fold for cross verification, ensures the stability and rationality of the model on a sample set, and comprises the following specific implementation processes: randomly sampling an original data set and a training set: test set = 7: and 3, fitting and testing a linear regression model, and selecting a more proper model result.

3.3 fitting model analysis specification

In this project, the model structure obtained by training is as follows:

y＝α+λx _inject +ωx _input +δx _stage +∈

where y is the number of positives entered over a period of time, x _inject 、x _input 、x _stage The method comprises the steps of outputting the number of infection of the country in the same period, outputting the number of input of the country and outputting the period of infection of infectious diseases of the country, wherein epsilon is a random error term, alpha is a constant, lambda is a coefficient of the number of infection of the country in the same period, omega is a coefficient of the number of input of the country, delta is a coefficient of the period of infection of infectious diseases of the country.

4.1 prediction Process

The prediction process is schematically shown in fig. 7, and two prediction processes are given here from the practical application point of view, due to the properties of the SIR model itself (which may determine the overall parameter process from the partial parameters):

prediction process 1: constructing variables similar to the linear model, and directly predicting the linear model, wherein the specific process is as follows:

step1: the prediction formula is determined as follows

case _t ＝pre(Inject _t ，input _t ，stage _t )

Step2: the following variables were constructed:

Inject _t : summarizing the number of infected people after being better fitted by SIR in the same time period (or the same type of time period);

input _t : summarizing the number of people input in the same time period;

stage _t : the timing of the pre-determined infection (which may be determined according to an originally provided table or randomly selected); if the infected waveform is incomplete, the SIR model can be fitted first, and the regression model is substituted for prediction according to possible different stages of input (the part is considered to be input, if the observed data are enough, definition and explanation can be carried out), and in a word, the upper limit and the lower limit of input prediction can be carried out for a plurality of times in terms of predicting the risk of input;

step3: prediction of the number of infected persons case using a linear model part _t I.e. input risk.

Prediction process 2: the prediction is performed by combining a characteristic table (initial value table of fitting SIR), and the specific prediction process is as follows:

step1: determining input as initial infection fragment object _[t,T] ，[t，T]Is an observation interval;

step2: performing simulation calculation by using relevant epidemic parameters in a characteristic table (table 2) to obtain a curve I (t);

step3: intercepting a part I (T, T) similar to the initial input infection fragment, and calculating the part I and the part I from the input _[t,T] And outputting a curve with smaller distance;

step4: using the obtained epidemic curves, feature summarization is performed, a prediction process 1 is performed, and a relevant prediction value is output.

4.2 predictive monitoring Range

In fact, in the process of prediction, besides the conventional items to be monitored, it is also necessary to record SIR process curves which cannot be accurately described by the current feature table (table 2) in the application process, add newly-appearing and mispredicted curves into the configuration table in time, and simultaneously re-fit or mark SIR curves which can be predicted and are not accurately predicted so as to perform the next analysis and calculation.

The prediction result application includes:

(1) And judging the immediate epidemic situation of the output country and the probability of inputting the infectious case under the traffic situation of the input country according to the input positive case number result.

(2) Judging, grading and dynamically adjusting the port and vehicle infectious disease prevention and control measures according to the number of the inputted positive cases.

(3) When virus variation and vaccination occur, the infectious disease input risk is re-estimated by adjusting model parameters under the condition of foreign morbidity or infectious virus variation.

(4) The probability of infecting other passengers during the travel is estimated by comprehensively considering the factors related to the density of the vehicles, such as the airplane boarding rate, the airplane seating distance and the like.

The present embodiment also provides a nonvolatile computer storage medium, which may be the nonvolatile computer storage medium included in the apparatus described in the above embodiment, or may be a nonvolatile computer storage medium that exists alone and is not incorporated in a terminal.

In the description of the present invention, it should be noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

It should be understood that the foregoing examples of the present invention are provided merely for clearly illustrating the present invention and are not intended to limit the embodiments of the present invention, and that various other changes and modifications may be made therein by one skilled in the art without departing from the spirit and scope of the present invention as defined by the appended claims.

Claims

1. A method for modeling and predicting cross-border spread of infectious diseases, comprising:

step4, according to the fitted SIR curve and the fitting parameters, the curve I is compared with the fitting parameters _s The infection process characterized in (t) is divided according to distribution to obtain divided curve segments, and summarizing calculation is carried outOutputting an output country infection sequence as a first characteristic variable and an output country infectious disease infection stage as a second characteristic variable;

step 7, adding the first characteristic variable, the second characteristic variable, the third characteristic variable and the target variable into a regression model, constructing a cross-border transmission linear regression model of infectious diseases, and predicting results;

constructing a cross-border spread linear regression model of infectious diseases

y＝α+λx _inject +ωx _input +δx _stage +∈

Where u is the number of positives entered over a period of time, x _inject 、x _input 、x _stage The method comprises the steps of outputting the number of infection of the country, the number of input of the country and the period of infection of infectious diseases of the country in the same period of time, wherein epsilon is a random error term, alpha is a constant, lambda is a coefficient of the number of infection of the country in the same period of time, omega is a coefficient of the number of input of the country in the country, and delta is a coefficient of the period of infection of infectious diseases of the country in the same period of time.

2. The method according to claim 1, wherein the preprocessing of the data comprises stitching the acquired data to generate an infection curve I (t), i.e. a time series I (t) of the number of existing infections per day, for each country during the infection of the infectious disease on the basis of the raw data.

3. The method of claim 1, wherein the screening the data comprises,

s3: deleting all header data of 0 in each country sequence;

s4: and deleting tail data with all 0 in each country sequence.

4. The method according to claim 1, characterized in that for the curve I _s The infection process characterized in (t) is divided according to distribution, and further comprises the steps of starting from the peak value of the number of infected people, dividing the infected people from the left side and the right side at equal distances according to a complete infection period, and obtaining divided curve segments.

5. The method of claim 1, wherein the SIR model is constructed using the number of S susceptible people, the β infection rate, the γ recovery rate, and the initial value of the I number of infected people, the parameter combination is constructed by randomly extracting N groups as initial values in the determined parameter initial value vector, and N is a constant, the objective function is a residual between the sequence of infected people and the actual sequence of infected people of the fitted SIR model, the parameter combination is input as initial values, and the S, β, and γ are optimized using the L-BFGS-B as an optimization method, and the optimal parameters are output.

6. The method of claim 5, further comprising using R ² Evaluating the fitting effect of the fitted SIR curve

Wherein Y is _i ,Representing each true value, predicted value and average value of the sequence.

7. The method of claim 1, wherein during the result prediction, it is further required to record SIR process curves which cannot be accurately described at present during the application process, and add new and mispredicted curves to the configuration table in time, and at the same time, re-fit or mark SIR curves which can be predicted and are not accurately predicted for further analysis and measurement.

8. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of any of claims 1-7 when the program is executed by the processor.

9. A computer readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the method according to any of claims 1-7.