CN116543911A

CN116543911A - Disease risk prediction model training method and device

Info

Publication number: CN116543911A
Application number: CN202310390036.4A
Authority: CN
Inventors: 杨远富; 张璐; 李泽铭; 杨欧洲
Original assignee: Shenzhen Arts Changhua Intelligent Technology Co ltd
Current assignee: Shenzhen Arts Changhua Intelligent Technology Co ltd
Priority date: 2023-04-04
Filing date: 2023-04-04
Publication date: 2023-08-04

Abstract

The embodiment of the application discloses a disease risk prediction model training method and device, wherein the method comprises the following steps: acquiring first statistics, N second statistics, genotype data of a plurality of users, historical disease codes of the plurality of users, disease time of the plurality of users, age characteristics of the plurality of users and disease marks of the plurality of users, wherein N is an integer greater than 0; training a disease risk score model to be trained according to the first statistics, the N second statistics, genotype data of a plurality of users, historical disease codes of the plurality of users, disease time of the plurality of users, age characteristics of the plurality of users and disease identifications of the plurality of users to obtain a disease risk score prediction model, wherein the disease risk score prediction model is used for predicting disease risk scores of the users. By adopting the embodiment of the application, the complex heterogeneous data is modeled by integrating the multi-source data and utilizing the machine learning model, so that the prediction result has higher effectiveness, accuracy and reliability.

Description

Disease risk prediction model training method and device

Technical Field

The invention relates to the field of intelligent medical treatment, in particular to a disease risk prediction model training method and device.

Background

The risk prediction of complex diseases is of great importance to the clinical medical diagnosis of the prior art, and is an auxiliary diagnostic tool, so that primary health care doctors can more accurately identify high-risk patients, provide targeted diet and life-style intervention opinions and drug treatment for the patients in time, and avoid the doctors from excessively treating low-risk people and wasting extra medical resources. The risk prediction of complex diseases mainly includes two kinds, one is to predict the disease based on a plurality of specific effective risk factor variables, for example, the common characteristics of predicting cardiovascular disease are age, sex, total cholesterol, low density lipoprotein, high density lipoprotein, systolic pressure, diastolic pressure, smoking and diabetes; another is to use genomic variations to predict an individual's risk of developing a disease. However, studies have shown that most complex diseases are not only affected by a specific factor in some way, but that there is often interaction between different diseases, i.e. the patient's history of the disease will also have a certain effect on the occurrence of the future disease. Thus, there are limitations and not enough accuracy in predicting complex diseases using only a single piece of information.

Disclosure of Invention

The embodiment of the application provides a disease risk prediction model training method and device, which enable the prediction result of a disease risk score prediction model to have higher effectiveness, higher accuracy and higher reliability by integrating multi-source data and modeling complex heterogeneous data by using a machine learning model.

In a first aspect, an embodiment of the present application provides a disease risk prediction model training method, including:

acquiring first statistics, N second statistics, genotype data of a plurality of users, historical disease codes of the plurality of users, disease time of the plurality of users, age characteristics of the plurality of users and disease marks of the plurality of users, wherein the first statistics comprise a plurality of groups of incidence relation statistics of single nucleotide polymorphisms and life habits, each of the N second statistics comprises a plurality of groups of incidence relation statistics of single nucleotide polymorphisms and specified diseases, the disease marks are used for marking whether the users are ill or not, and N is an integer larger than 0;

determining lifestyle habits with causal relationships to the specified disease according to the first statistics and the N second statistics;

determining a polygenic risk score according to the N second statistics and genotype data of the plurality of users;

Determining historical disease information of a plurality of users according to the historical disease codes of the plurality of users and the disease time of the plurality of users;

training a disease risk score model to be trained according to living habits with causal relation with specified diseases, polygenic risk scores, historical disease information of a plurality of users, age characteristics of the plurality of users and disease identifications of the plurality of users to obtain a disease risk score prediction model, wherein the disease risk score prediction model is used for predicting the disease risk scores of the users.

By integrating and modeling through combining life habits with causal relation to specified diseases, polygenic risk scores, historical disease information of a plurality of users, age characteristics of a plurality of users and disease marks of a plurality of users, the prediction result of the disease risk score prediction model is more effective, accurate and reliable, and diseases possibly occurring in the future of healthy people can be accurately predicted.

In another possible design, living habits, polygenic risk scores, historical disease information of a plurality of users and age characteristics of the plurality of users which have causal relation with the specified disease are taken as independent variables of a disease risk score model to be trained, and disease identifications of the plurality of users are taken as dependent variables of the disease risk score model to be trained; based on independent variables and dependent variables, training the disease risk score model to be trained to obtain a disease risk score prediction model. And carrying out data fitting according to the independent variable and the dependent variable to obtain a disease risk score prediction model, and improving the expression capability of the data by combining multi-source data so that the prediction result of the disease risk score prediction model is more effective, accurate and reliable.

In another possible design, N pairs of data are obtained from the first statistics and N second statistics; inputting N data pairs into a Mendelian randomization model to obtain N Mendelian randomization results; and determining living habits with causal relation with the specified diseases according to the N Mendelian randomization results. The Mendelian randomization model is adopted to screen life habit characteristics with causal relation with the appointed diseases, so that the contribution of life habits to disease prediction can be effectively calculated, and a foundation is laid for reducing disease risks by improving the life habits in a targeted manner.

In another possible design, N mendelian randomization results are analyzed to obtain an analysis result; determining whether living habits in the first statistics have causal relation with the specified diseases according to the N Mendelian randomization results and the analysis results; if so, the lifestyle habit in the first statistics is determined as a lifestyle habit having a causal relationship with the specified disease. According to the N Mendelian randomization results and the analysis results, the method is used for screening life habit features with causal relation with the specified diseases, and laying a foundation for reducing disease risks by improving the life habits in a targeted manner.

In another possible design, the first statistics and the N second statistics are matched to obtain N original data pairs, where each of the N original data pairs includes a plurality of sets of statistics of associations of single nucleotide polymorphisms with lifestyles and a plurality of sets of statistics of associations of single nucleotide polymorphisms with specified diseases, and each set of statistics of associations of single nucleotide polymorphisms with lifestyles includes a minor allele frequency and a significance P value; and screening each of the N original data pairs according to the minor allele frequency and the significance P value to obtain N data pairs. Single nucleotide polymorphisms in the first population are screened based on minor allele frequencies and significance P values to subsequently determine whether lifestyle patterns in the first population have causal relationships with the specified disease.

In another possible design, determining whether the significance P value of the correlation statistic of each set of single nucleotide polymorphisms of each of the N raw data pairs with lifestyle is less than a first threshold and whether the minor allele frequency of the correlation statistic of each set of single nucleotide polymorphisms of each of the N raw data pairs with lifestyle is less than a second threshold; when the significance P value of the incidence relation statistic of the first single nucleotide polymorphism and the living habit of each of the N original data pairs is smaller than a first threshold value, and the minor allele frequency of the incidence relation statistic of the first single nucleotide polymorphism and the living habit of each of the N original data pairs is smaller than a second threshold value, the incidence relation statistic of the first single nucleotide polymorphism and the living habit of each of the N original data pairs is selected; and determining N data pairs according to the association relation statistics of the first single nucleotide polymorphism and the lifestyle habit in each original data pair. Screening of single nucleotide polymorphisms in the first population is performed to subsequently determine whether lifestyle habits in the first population have causal relationships with the specified disease.

In another possible design, a first database and a second database are obtained, the first database comprises a plurality of independent single nucleotide polymorphisms, the second database comprises a plurality of groups of association statistics of single nucleotide polymorphisms and other habits, the association statistics of each group of single nucleotide polymorphisms and other habits comprise significance P values, and the significance P values of the association statistics of each group of single nucleotide polymorphisms and other habits are all smaller than a third threshold; matching the incidence relation statistics of the first single nucleotide polymorphism in each original data pair and the life habit with the first database and the second database to determine N data pairs, wherein the single nucleotide polymorphism in each original data pair in the N data pairs is independent of each other, and the single nucleotide polymorphism in the N data pairs does not contain the single nucleotide polymorphism in the second database. Screening of single nucleotide polymorphisms in the first population is performed to subsequently determine whether lifestyle habits in the first population have causal relationships with the specified disease.

In another possible design, a third database is obtained, the third database comprising single nucleotide polymorphisms in the international human genome haplotype map plan; randomly selecting a second statistic from N second statistic; matching the selected second statistic with a third database to obtain a third statistic, wherein the third statistic comprises at least one group of incidence relation statistic of single nucleotide polymorphism and life habit, and the single nucleotide polymorphism in the third statistic is the single nucleotide polymorphism in the international human genome haplotype map plan; and inputting the third statistic and genotype data of the plurality of users into the first model to obtain the polygenic risk score. According to the incidence relation statistics of single nucleotide polymorphism and specified diseases and genotype data of a plurality of users, the polygenic risk score is obtained, and the contribution of the polygenic risk score to disease prediction can be effectively calculated.

In another possible design, the disease time of the plurality of users is processed to obtain ordered disease time; training the model to be trained based on the ordered disease time and historical disease codes of a plurality of users to obtain a second model, wherein the second model is used for predicting future events and event occurrence time; and inputting the ordered illness time and the historical disease codes of the plurality of users into a second model to obtain the historical disease information of the plurality of users. According to the disease time of a plurality of users and the historical disease codes of the plurality of users, historical disease information is obtained, and the contribution of the historical disease information to disease prediction can be effectively calculated.

In a second aspect, an embodiment of the present application provides a disease risk prediction model training device, including:

the system comprises an acquisition module, a control module and a control module, wherein the acquisition module is used for acquiring first statistics, N second statistics, genotype data of a plurality of users, historical disease codes of the plurality of users, disease time of the plurality of users, age characteristics of the plurality of users and disease marks of the plurality of users, the first statistics comprises a plurality of groups of incidence relation statistics of single nucleotide polymorphisms and life habits, each second statistic in the N second statistics comprises a plurality of groups of incidence relation statistics of single nucleotide polymorphisms and appointed diseases, the disease marks are used for marking whether the users are ill or not, and N is an integer larger than 0;

The processing module is used for determining living habits with causal relation with the specified diseases according to the first statistics and the N second statistics;

the processing module is also used for determining a polygene risk score according to the N second statistics and genotype data of a plurality of users;

the processing module is also used for determining the historical disease information of the plurality of users according to the historical disease codes of the plurality of users and the disease time of the plurality of users;

the processing module is further used for training the disease risk score model to be trained according to living habits with causal relation with the appointed diseases, the polygene risk score, historical disease information of a plurality of users, age characteristics of the plurality of users and disease identifications of the plurality of users to obtain a disease risk score prediction model, and the disease risk score prediction model is used for predicting the disease risk score of the users.

In one possible design, the processing module is further configured to use lifestyle habits, polygenic risk scores, historical disease information of a plurality of users, and age characteristics of the plurality of users, which have causal relationships with the specified disease, as independent variables of the disease risk score model to be trained, and use disease identifiers of the plurality of users as dependent variables of the disease risk score model to be trained; based on independent variables and dependent variables, training the disease risk score model to be trained to obtain a disease risk score prediction model.

In another possible design, the processing module is further configured to obtain N data pairs according to the first statistics and the N second statistics; inputting N data pairs into a Mendelian randomization model to obtain N Mendelian randomization results; and determining living habits with causal relation with the specified diseases according to the N Mendelian randomization results.

In another possible design, the processing module is further configured to analyze N mendelian randomization results to obtain an analysis result; determining whether living habits in the first statistics have causal relation with the specified diseases according to the N Mendelian randomization results and the analysis results; if so, the lifestyle habit in the first statistics is determined as a lifestyle habit having a causal relationship with the specified disease.

In another possible design, the processing module is further configured to match the first statistics with N second statistics to obtain N original data pairs, where each of the N original data pairs includes a plurality of sets of statistics of associations between single nucleotide polymorphisms and lifestyles and a plurality of sets of statistics of associations between single nucleotide polymorphisms and specified diseases, and each set of statistics of associations between single nucleotide polymorphisms and lifestyles includes a minor allele frequency and a significance P value; and screening each of the N original data pairs according to the minor allele frequency and the significance P value to obtain N data pairs.

In another possible design, the processing module is further configured to determine whether a significance P value of the correlation statistic of each set of single nucleotide polymorphisms of each of the N raw data pairs with lifestyle is less than a first threshold and whether a minor allele frequency of the correlation statistic of each set of single nucleotide polymorphisms of each of the N raw data pairs with lifestyle is less than a second threshold; when the significance P value of the incidence relation statistic of the first single nucleotide polymorphism and the living habit of each of the N original data pairs is smaller than a first threshold value, and the minor allele frequency of the incidence relation statistic of the first single nucleotide polymorphism and the living habit of each of the N original data pairs is smaller than a second threshold value, the incidence relation statistic of the first single nucleotide polymorphism and the living habit of each of the N original data pairs is selected; and determining N data pairs according to the association relation statistics of the first single nucleotide polymorphism and the lifestyle habit in each original data pair.

In another possible design, the obtaining module is further configured to obtain a first database and a second database, where the first database includes a plurality of independent single nucleotide polymorphisms, the second database includes a plurality of sets of association statistics of single nucleotide polymorphisms with other habits, the association statistics of each set of single nucleotide polymorphisms with other habits includes a significance P value, and the significance P value of the association statistics of each set of single nucleotide polymorphisms with other habits is less than a third threshold; matching the incidence relation statistics of the first single nucleotide polymorphism in each original data pair and the life habit with the first database and the second database to determine N data pairs, wherein the single nucleotide polymorphism in each original data pair in the N data pairs is independent of each other, and the single nucleotide polymorphism in the N data pairs does not contain the single nucleotide polymorphism in the second database.

In another possible design, the obtaining module is further configured to obtain a third database comprising single nucleotide polymorphisms in the international human genome haplotype map schema.

In another possible design, the processing module is further configured to randomly select a second statistic from the N second statistics; matching the selected second statistic with a third database to obtain a third statistic, wherein the third statistic comprises at least one group of incidence relation statistic of single nucleotide polymorphism and life habit, and the single nucleotide polymorphism in the third statistic is the single nucleotide polymorphism in the international human genome haplotype map plan; and inputting the third statistic and genotype data of the plurality of users into the first model to obtain the polygenic risk score.

In another possible design, the processing module is further configured to perform data processing on the disease time of the plurality of users to obtain ordered disease time; training the model to be trained based on the ordered disease time and historical disease codes of a plurality of users to obtain a second model, wherein the second model is used for predicting future events and event occurrence time; and inputting the ordered illness time and the historical disease codes of the plurality of users into a second model to obtain the historical disease information of the plurality of users.

In a third aspect, an embodiment of the present application provides a disease risk prediction model training system, where the disease risk prediction model training system includes a lifestyle characteristic screening module, a polygenic risk score calculation module, a historical disease information learning module, and a disease risk score learning module. The lifestyle characteristic screening module is used for determining lifestyle with causal relation with the appointed diseases according to the first statistics and the N second statistics; the polygene risk score calculation module is used for determining polygene risk scores according to the N second statistics and genotype data of a plurality of users; the historical disease information learning module is used for determining the historical disease information of the plurality of users according to the historical disease codes of the plurality of users and the disease time of the plurality of users; the disease risk score learning module is used for training the disease risk score model to be trained according to living habits with causal relation with the appointed diseases, the polygene risk score, historical disease information of a plurality of users, age characteristics of the plurality of users and disease marks of the plurality of users, and obtaining a disease risk score prediction model. The complex heterogeneous data is modeled by integrating the multi-source data and utilizing the machine learning model, so that the prediction result of the disease risk score prediction model is more effective, accurate and reliable.

In a fourth aspect, embodiments of the present application provide a disease risk prediction model training system, including a processor, a memory, and a communication bus, where the memory is configured to store computer-executable instructions; the processor is configured to execute computer-executable instructions stored in the memory to cause the disease risk prediction model training system to perform the method of any of the first aspects; the communication bus is used for realizing connection communication between the processor and the memory.

In a fifth aspect, embodiments of the present application provide a disease risk prediction model training system, which may perform the method of the first aspect. The function of the disease risk prediction model training system can be realized by hardware, and can also be realized by executing corresponding software by hardware. The hardware or software includes one or more modules corresponding to the functions described above. The system may be software and/or hardware.

In a sixth aspect, embodiments of the present application provide a computer readable storage medium for storing a computer program which, when executed, causes the method according to any one of the first aspects to be implemented.

In a seventh aspect, embodiments of the present application provide a computer program product comprising a computer program which, when executed, causes the method according to any one of the first aspects to be carried out.

Drawings

In order to more clearly describe the technical solutions in the embodiments or the background of the present application, the following description will describe the drawings that are required to be used in the embodiments or the background of the present application.

Fig. 1 is a schematic structural diagram of a disease risk prediction model training system according to an embodiment of the present application;

FIG. 2 is a schematic flow chart of a disease risk prediction model training method according to an embodiment of the present application;

FIG. 3 is a causal inference flow chart of lifestyle and specified diseases provided in an embodiment of the present application;

FIG. 4 is a schematic diagram of an encoder House (transform-Hawkes) model based on an attention mechanism according to an embodiment of the present application;

FIG. 5 is a Manhattan diagram of correlation statistic data of single nucleotide polymorphisms of European and coronary heart disease provided in an embodiment of the present application;

FIG. 6 is a Manhattan diagram of relationship statistics for European single nucleotide polymorphisms and type II diabetes provided in an embodiment of the present application;

FIG. 7 is an AUROC evaluation graph of a model predictive coronary heart disease provided in an embodiment of the present application;

FIG. 8 is an AUROC evaluation graph of model predictive type II diabetes provided in an embodiment of the present application;

fig. 9 is a schematic structural diagram of a disease risk prediction model training device according to an embodiment of the present application;

fig. 10 is a schematic structural diagram of a server according to an embodiment of the present application.

Detailed Description

Some of the terms referred to in this application are described below to facilitate understanding by those skilled in the art.

1. Single nucleotide polymorphism (single nucleotide polymorphism, SNP) refers mainly to a sequence polymorphism of deoxyribonucleic acid (deoxyribonucleic acid, DNA) caused by variation of a single nucleotide at the genomic level.

2. The international human genome haplotype map program is a multi-nationally participated collaborative project, and aims to develop haplotype maps of human genome to describe common patterns of human genetic variation and explore genes related to individual response differences of human health, diseases, drugs and environmental factors for different ethnic groups.

3. International disease classification (international Classification of diseases, ICD) is a system that classifies diseases according to rules and is represented by coding methods based on certain characteristics of the disease. The international statistical classification of diseases and related health problems is revised 10 times worldwide, the ICD short is still reserved, and the ICD short is called ICD-10, the ICD-10 refers to letter catalogs for the first time, and the original pure numerical code is changed into the mixed coding of letters and numbers.

4. The area under the receiver operating characteristic curve (area under the receiver operating characteristic curve, AUROC), which is a value between 0 and 1, reflects the performance of the classifier by the size of the area between the receiver operating characteristic curve (receiver operating characteristic curve) and the coordinate axis, and when the AUROC value is closer to 1, it means that the classifier can better classify positive and negative samples.

Embodiments of the present application are described below with reference to the accompanying drawings in the embodiments of the present application.

As shown in fig. 1, fig. 1 is a schematic structural diagram of a disease risk prediction model training system provided in an embodiment of the present application, where the disease risk prediction model training system includes a life habit feature screening module 101, a polygenic risk score calculating module 102, a historical disease information learning module 103, and a disease risk score learning module 104. Wherein the detailed description of each module is as follows.

A lifestyle characteristic screening module 101, configured to obtain a first statistics and N second statistics, where the first statistics includes a plurality of sets of association statistics of single nucleotide polymorphisms and lifestyles, each of the N second statistics includes a plurality of sets of association statistics of single nucleotide polymorphisms and specified diseases, and N is an integer greater than 0; obtaining N data pairs according to the first statistics and N second statistics; inputting N data pairs into a Mendelian randomization model to obtain N Mendelian randomization results; analyzing the N Mendelian randomization results to obtain analysis results; determining whether living habits in the first statistics have causal relation with the specified diseases according to the N Mendelian randomization results and the analysis results; if so, the lifestyle habit in the first statistics is determined as a lifestyle habit having a causal relationship with the specified disease.

A polygenic risk score calculation module 102 for obtaining genotype data for a plurality of users and a third database, wherein the second database comprises single nucleotide polymorphisms in an international human genome haplotype map plan; randomly selecting a second statistic from N second statistic; matching the selected second statistic with a third database to obtain a third statistic, wherein the third statistic comprises at least one group of incidence relation statistic of single nucleotide polymorphism and life habit, and the single nucleotide polymorphism in the third statistic is the single nucleotide polymorphism in the international human genome haplotype map plan; and inputting the third statistic and genotype data of the plurality of users into the first model to obtain the polygenic risk score.

The historical disease information learning module 103 is configured to obtain disease times of a plurality of users and historical disease codes of the plurality of users, and perform data processing on the disease times of the plurality of users to obtain ordered disease times; training the model to be trained based on the ordered disease time and historical disease codes of a plurality of users to obtain a second model, wherein the second model is used for predicting future events and event occurrence time; and inputting the ordered illness time and the historical disease codes of the plurality of users into a second model to obtain the historical disease information of the plurality of users.

The disease risk score learning module 104 is configured to obtain lifestyle habits, polygenic risk scores, historical disease information of a plurality of users, age characteristics of the plurality of users and disease identifiers of the plurality of users, where the disease identifiers are used to mark whether the users are ill; taking life habits, polygenic risk scores, historical disease information of a plurality of users and age characteristics of the plurality of users which have causal relation with the appointed diseases as independent variables of a disease risk score model to be trained, and taking disease marks of the plurality of users as independent variables of the disease risk score model to be trained; and training the disease risk score model to be trained based on the independent variable and the dependent variable to obtain a disease risk score prediction model.

It should be noted that, the disease risk prediction model training system is a non-end-to-end model training system, and takes the output result of the life habit feature screening module 101, the output result of the polygenic risk score calculating module 102 and the output result of the historical disease information learning module 103 as the input of the disease risk score learning module 104, and improves the expression capability of the data by combining various types of data, so that the result of the disease risk score prediction model is more effective, accurate and reliable. In addition, the disease risk prediction model training system may be a system that interacts with the user, and this system may be a software system, a hardware system, or a system that combines software and hardware, which is not specifically limited in this application. It should be further noted that fig. 1 is only an exemplary structural diagram showing a disease risk prediction model training system, and in practical application, the disease risk prediction model training system of fig. 1 may be transformed correspondingly according to specific situations.

As shown in fig. 2, fig. 2 is a flow chart of a disease risk prediction model training method according to an embodiment of the present application, where the method includes, but is not limited to, the following steps:

step S201: the method comprises the steps of obtaining a first statistic, N second statistics, genotype data of a plurality of users, historical disease codes of the plurality of users, disease time of the plurality of users, age characteristics of the plurality of users and disease identification of the plurality of users.

The first statistics comprise a plurality of groups of association relation statistics of single nucleotide polymorphisms and life habits, each second statistic in the N second statistics comprises a plurality of groups of association relation statistics of single nucleotide polymorphisms and specified diseases, the disease mark is used for marking whether a user is ill or not, and N is an integer larger than 0; the association statistics of each group of single nucleotide polymorphisms and lifestyle include single nucleotide polymorphism numbers, effector alleles, non-effector alleles, effector values, standard deviations, minor allele frequencies, and significance P values; the statistics of associations of each set of single nucleotide polymorphisms with a given disease include single nucleotide polymorphism numbers, effector alleles, non-effector alleles, effector values, standard deviations, minor allele frequencies, and significance P values.

In one implementation, a first statistics, N second statistics, genotype data for a plurality of users, historical disease codes for the plurality of users, time of illness for the plurality of users, age characteristics for the plurality of users, and illness identification for the plurality of users uploaded by the user are received. Specifically, the disease risk prediction model training system provides a data uploading interface, the data uploading interface comprises a data uploading interface, and a user can upload the prepared first statistics, the N second statistics, genotype data of the plurality of users, historical disease codes of the plurality of users, disease time of the plurality of users, age characteristics of the plurality of users and disease identifications of the plurality of users to the disease risk prediction model training system by clicking the data uploading interface.

In another implementation, the first statistics, the N second statistics, genotype data for the plurality of users, historical disease codes for the plurality of users, time of illness for the plurality of users, age characteristics for the plurality of users, and identification of illness for the plurality of users are obtained from a database (including a local database or database of other devices). Further, the sample data meeting the requirements is searched to obtain a first statistic, N second statistics, genotype data of the plurality of users, historical disease codes of the plurality of users, disease times of the plurality of users, age characteristics of the plurality of users and disease identifications of the plurality of users.

In an embodiment of the present application, the historical disease codes for the plurality of users include ICD-10 codes (first three digits) for the historical disease.

Step S202: determining lifestyle habits with causal relationships to the specified disease according to the first statistics and the N second statistics; determining a polygenic risk score according to the N second statistics and genotype data of the plurality of users; and determining the historical disease information of the plurality of users according to the historical disease codes of the plurality of users and the disease time of the plurality of users.

FIG. 3 is a flow chart of causal inference of lifestyle and specific diseases according to the present invention. Specifically, the first statistics and the N second statistics are matched to obtain N original data pairs, each original data pair in the N original data pairs is screened to obtain N data pairs, causal relation analysis is conducted on N data pairs input into a Mendelian randomization model to obtain N Mendelian randomization results, meta-analysis is conducted on the N Mendelian randomization results to obtain analysis results, whether the living habit in the first statistics has causal relation with the appointed disease is determined according to the N Mendelian randomization results and the analysis results, and if yes, the living habit in the first statistics is determined to be the living habit with causal relation with the appointed disease.

Wherein each of the N raw data pairs comprises a plurality of sets of statistics of associations of single nucleotide polymorphisms with lifestyles and a plurality of sets of statistics of associations of single nucleotide polymorphisms with specified diseases, each set of statistics of associations of single nucleotide polymorphisms with lifestyles comprising minor allele frequencies and significance P values.

In the embodiment of the application, whether the significance P value of the incidence relation statistic of each group of single nucleotide polymorphism of each of N original data pairs and life habit is smaller than a first threshold value and whether the minor allele frequency of the incidence relation statistic of each group of single nucleotide polymorphism of each of N original data pairs and life habit is smaller than a second threshold value are determined; when the significance P value of the incidence relation statistic of the first single nucleotide polymorphism and the living habit of each of the N original data pairs is smaller than a first threshold value, and the minor allele frequency of the incidence relation statistic of the first single nucleotide polymorphism and the living habit of each of the N original data pairs is smaller than a second threshold value, the incidence relation statistic of the first single nucleotide polymorphism and the living habit of each of the N original data pairs is selected; acquiring a first database and a second database, matching the incidence relation statistics of the first single nucleotide polymorphism in each original data pair and the living habit with the first database and the second database, and determining N data pairs, wherein the single nucleotide polymorphism in each data pair in the N data pairs is independent of the single nucleotide polymorphism in the second database, and the single nucleotide polymorphism in the N data pairs does not contain the single nucleotide polymorphism in the second database.

Wherein the first database comprises a plurality of single nucleotide polymorphisms independent of each other; the second database comprises a plurality of groups of association statistics of single nucleotide polymorphisms and other habits, wherein each group of association statistics of single nucleotide polymorphisms and other habits comprises a significance P value, and the significance P value of each group of association statistics of single nucleotide polymorphisms and other habits is smaller than a third threshold.

The first threshold value, the second threshold value and the third threshold value are all experience parameters, and the first threshold value is used for determining whether each single nucleotide polymorphism of each original data pair in the N original data pairs has a strong correlation relationship with life habits in the first statistics; the third threshold is used to determine whether each single nucleotide polymorphism in the second database has a strong correlation with other habits.

Optionally, the second threshold in the embodiments of the present application is equal to 0.01.

It should be noted that, the association statistics of the first single nucleotide polymorphism and the lifestyle habit in the embodiment of the present application may be used to indicate the association statistics of any group of single nucleotide polymorphisms and the lifestyle habit of each of the N original data pairs, without order division. In addition, the correlation statistics of the single nucleotide polymorphism and the life habit of each of the N original data pairs are screened only, and the correlation statistics of the single nucleotide polymorphism and the specified diseases of each of the N original data pairs are kept unchanged.

Through the data screening, each of the N data pairs simultaneously satisfies the following three conditions: (1) Each single nucleotide polymorphism has a strong correlation relationship with the lifestyle habit in the first statistics only; (2) all single nucleotide polymorphisms are independent from each other; (3) The frequency of minor alleles corresponding to each single nucleotide polymorphism is less than a second threshold.

Optionally, the first statistics may be first screened to obtain a screened first statistics, and then the screened first statistics and the N second statistics are matched to obtain N data pairs.

In one embodiment, if the N mendelian randomization results and the analysis results meet the following five conditions simultaneously: (1) The significance P value of one Mendelian randomization result in the N Mendelian randomization results is less than 0.05; (2) The fixed effect 95% confidence interval for the analysis results does not contain 0; (3) a stationary effect significance P value of less than 0.05 for the analysis result; (4) the square of the analysis result I is less than 0.05; (5) an equivalence P value of the analysis result is greater than 0.05. Then, it is determined that the lifestyle in the first statistics has a causal relationship with the specified disease, and the lifestyle in the first statistics is determined as the lifestyle relationship with the specified disease.

In the embodiment of the application, a third database is obtained, one second statistic is randomly selected from N second statistic, the selected second statistic is matched with the third database, and single nucleotide polymorphism in the third database in the selected second statistic is reserved to obtain the third statistic; inputting the third statistic and the genotype data of the plurality of users into the first model, determining a gene risk score corresponding to each single nucleotide polymorphism in the third statistic according to the third statistic and the genotype data of the plurality of users, and determining a multi-gene risk score according to the gene risk score corresponding to each single nucleotide polymorphism in the third statistic.

Wherein the third database comprises single nucleotide polymorphisms in the international human genome haplotype map plan, the third statistics comprise at least one group of association statistics of single nucleotide polymorphisms and lifestyle habits, and the single nucleotide polymorphisms in the third statistics are all single nucleotide polymorphisms in the international human genome haplotype map plan; the first model is used to calculate a polygenic risk score, e.g. the first model may be an LDPred2 model.

In the embodiment of the application, data processing is carried out on the illness time of a plurality of users to obtain ordered illness time; training the model to be trained based on the ordered disease time and historical disease codes of a plurality of users to obtain a second model, wherein the second model is used for predicting future events and event occurrence time; and inputting the ordered illness time and the historical disease codes of the plurality of users into a second model to obtain the historical disease information of the plurality of users.

Specifically, the disease time of each user in the plurality of users is respectively ordered, the first disease time of each user is set to be 0, the subsequent disease time of each user is set to be the interval number of days between the subsequent disease date and the first disease date, and then the disease time of each user is respectively ordered according to the order from small to large, so that the ordered disease time of the plurality of users is obtained; converting ICD-10 codes of historical diseases of a plurality of users to obtain numbers corresponding to the ICD-10 codes; training a to-be-trained transducer-Hawkes model by using a gradient descent method based on the ordered illness time of the plurality of users and ICD-10 codes of the historical illness of the plurality of users and minimizing a loss function; finally, the ordered illness time of the plurality of users and the ICD-10 codes of the historical diseases of the plurality of users are input into a trained transducer-Hawkes model to obtain the historical disease information. Wherein the loss function satisfies:

wherein λ represents a conditional intensity function (conditional intensity function), H _j The j-th event, t, representing a sequence of events _j The time corresponding to the jth event is represented, and L represents the length of the event sequence.

As shown in fig. 4, fig. 4 is a diagram of a transducer-Hawkes model structure based on an attention mechanism according to an embodiment of the present application. The transducer-Hawkes model mainly comprises the following three structures: (1) coding layer: a novel position code containing time information is used to improve the relationship between events occurring at different times in the self-attention layer learning, and the self-attention layer uses a masking multi-head attention model, which is more consistent with the reality that events occurring at a certain point in time are only affected by previous events and not by events occurring later; (2) event prediction layer: for predicting a probability that an event will occur at a point in time; (3) temporal prediction layer: for predicting the time of occurrence of a next event.

The historical disease information is the output of the coding layer of the transducer-Hawkes model.

Step S203: training a disease risk score model to be trained according to living habits with causal relation with the specified diseases, the polygenic risk scores, historical disease information of a plurality of users, age characteristics of the plurality of users and disease marks of the plurality of users to obtain a disease risk score prediction model.

Specifically, living habits, polygenic risk scores, historical disease information of a plurality of users and age characteristics of the plurality of users which have causal relation with the specified diseases are taken as independent variables of a disease risk score model to be trained, and disease marks of the plurality of users are taken as dependent variables of the disease risk score model to be trained; based on independent variables and dependent variables, training the disease risk score model to be trained to obtain a disease risk score prediction model.

In the embodiment of the application, life habits, polygenic risk scores, historical disease information of a plurality of users and age characteristics of the plurality of users, which have causal relation with a specified disease, are combined to obtain independent variables of a disease risk score model to be trained, whether the plurality of users are ill or not is used as dependent variables of the disease risk score model to be trained, a stepwise logistic regression (stepwise logistic regression) model is used for fitting data, and an output result of the model is the risk prediction score of the disease.

The data are combined into characteristic splicing of the matrix. For example, the matrix a includes features of 5 users, each user corresponds to 10 features, the dimension of the matrix a is 5×10, the matrix B includes features of 5 users, each user corresponds to 8 features, the dimension of the matrix B is 5×8, and the data of the matrix a and the matrix B are combined to obtain a matrix C, so that the dimension of the matrix C is 5×18, the matrix C includes features of 5 users, and each user corresponds to 18 features.

In one embodiment, the disease risk score model to be trained is trained according to European white data of biological banks in England to obtain a disease risk score prediction model of coronary heart disease and a disease risk score prediction model of type II diabetes. The specific process comprises the following steps: acquiring first statistics, genotype data, disease time, historical disease code, age characteristics, disease identification of European white persons in the British biological bank and 4 second statistics of other European persons except the British biological bank, wherein the first statistics comprise correlation statistics of single nucleotide polymorphisms and life habits of European white persons in the British biological bank, the 4 second statistics comprise correlation statistics of single nucleotide polymorphisms and coronary heart diseases of 2 European persons except the British biological bank and correlation statistics of single nucleotide polymorphisms and type II diabetes of 2 European persons except the British biological bank, and determining the life habit characteristics with causal relation to coronary heart diseases and the life habit characteristics with causal relation to type II diabetes according to the first statistics and the 4 second statistics; determining specified disease patients according to genotype data, disease time and historical disease codes, and dividing the specified disease patients into training set groups and verification set groups; determining a polygenic risk score for coronary heart disease and a polygenic risk score for type two diabetes based on the genotype data; training a to-be-trained transducer-Hawkes model according to historical disease diagnosis data of the tracking personnel in the British biological bank, and inputting the historical disease diagnosis data of the tracking personnel in the British biological bank into the trained transducer-Hawkes model to obtain historical disease information; finally, fitting data by using a stepwise logistic regression model according to life habit features with causal relation to coronary heart diseases, polygenic risk scores of coronary heart diseases and historical disease information corresponding to the training set crowd to obtain a disease risk score prediction model of coronary heart diseases, and fitting data by using a stepwise logistic regression model according to life habit features with causal relation to type II diabetes corresponding to the training set crowd, polygenic risk scores of type II diabetes and historical disease information to obtain a disease risk score prediction model of type II diabetes; and respectively evaluating the prediction result of the disease risk score prediction model of coronary heart disease and the prediction result of the disease risk score prediction model of type II diabetes according to the characteristic data corresponding to the verification set crowd, and comparing and inputting different types of data to obtain AUROC scores of the model predicted coronary heart disease and AUROC scores of the model predicted type II diabetes.

Wherein the first statistics comprise 268 different life habit data, and the historical disease diagnosis data of the tracking personnel in the British biological bank comprises the disease time and the historical disease code of each tracking personnel, and the tracking personnel in the British biological bank comprises disease patients and non-disease patients.

FIG. 5 is a Manhattan diagram of statistics of relationship between single nucleotide polymorphisms of European and coronary heart disease, provided in the examples of the present application; as shown in fig. 6, fig. 6 is a manhattan diagram of statistics of association between single nucleotide polymorphisms of european and type two diabetes provided in the example of the present application. In fig. 5 and 6, each dot represents a single nucleotide polymorphism, the horizontal axis represents the chromosome where the single nucleotide polymorphism is located, including chromosomes 1 to 20, and the vertical axis represents the negative logarithm of the significance P value of each single nucleotide polymorphism, where the negative logarithm is taken for the P value, the smaller the P value, the higher the position of the dot.

In this example, the european white people in british biobank were pre-processed and data of lifestyle causally related to coronary heart disease and data of lifestyle causally related to type two diabetes were extracted, wherein the number of european white people involved in this example was 442591, and there was no relationship between these european white people. The specific process comprises the following steps: screening out useless living habits counted by British biological banks, such as whether wine is drunk yesterday, whether cheese is eaten yesterday, and the like; secondly, screening living habits with the number of missing parts larger than 44259, processing missing values, filling the missing values in a median mode if the missing values are numerical data, and filling the missing values in a highest frequency mode if the missing values are classified data; carrying out heat coding treatment on unordered classification data with more than two categories, and finally reserving 133 life habit features; finally, the life habit with causal relation to the coronary heart disease is matched with the 133 life habit features, so that 65 life habit features with causal relation to the coronary heart disease are obtained, such as drinking frequency, body fat rate, grain intake frequency, current smoking frequency, sleeping time and the like; the lifestyle characteristics with causal relation to type two diabetes mellitus are matched with the 133 lifestyle characteristics, so that 45 lifestyle characteristics with causal relation to type two diabetes mellitus are obtained, for example, alcohol intake, dry fruit intake frequency, current smoking state, insomnia and the like compared with ten years ago.

In this example, the ICD-10 code specifying the disease is matched against historical disease diagnosis data for tracking personnel in the biological bank of the United kingdom; then, matching the disease code in the self-reported non-cancer disease data for the specified disease in the british biological bank with the self-reported non-cancer disease data in the british biological bank; screening out disease patients without genotype data and historical diagnostic information to obtain 9727 of European white people with coronary heart disease and 2643 of European white people with type II diabetes; taking the time of participating in British biological bank research as a demarcation point, determining the patient with self-reported diseases and the patient with disease diagnosis time before the demarcation point as training set groups, determining the patient with disease diagnosis time after the demarcation point as verification set groups, and finally randomly selecting the groups with the same number as the diseased persons from non-diseased European white persons to enter the training set and the verification set.

Wherein ICD-10 of coronary heart disease is encoded as I21, I22, I23, I241, I252; ICD-10 for type II diabetes is encoded as E11; coronary heart disease the disease code in the british biobank self-reported non-cancer disease data is 1075; type two diabetes mellitus is coded 1223 in the uk biobank from disease in non-cancer disease data reported.

In this embodiment, historical disease diagnosis data of tracking personnel in the uk biological bank is processed, if a disease patient is diagnosed, historical disease diagnosis information before a specified disease is extracted, and if a disease patient is not diagnosed, no additional processing is required; then, the extracted diagnosis data is input into a transducer-Hawkes model, and then the coding layer of the last time node in the diagnosis data is output as the historical disease information of the user.

In the embodiment, standardized processing is performed on life habit features with continuous polygenic risk scores and continuous data types, then data combination is performed on age features and life habit features to obtain a first matrix, a principal component analysis method is adopted to reduce the dimension of the first matrix to obtain a second matrix, the dimension of the second matrix is 32 dimensions, then data combination is performed on the second matrix and the polygenic risk scores to obtain a third matrix, data combination is performed on the third matrix and historical disease information to obtain a fourth matrix, and the fourth matrix is finally required training data.

As shown in fig. 7, fig. 7 is an AUROC evaluation chart for model prediction of coronary heart disease according to an embodiment of the present application, wherein the AUROC evaluation chart for model prediction of coronary heart disease includes a polygenic risk score prediction curve 701, a combined polygenic risk score and lifestyle characteristic prediction curve 702, a combined polygenic risk score and historical disease information prediction curve 703, and a combined polygenic risk score, lifestyle characteristic and historical disease information prediction curve 704. Specifically, the polygenic risk score predicts the AUROC score of coronary heart disease to be equal to 0.61, the polygenic risk score and the lifestyle characteristic are combined to predict the AUROC score of coronary heart disease to be equal to 0.69, the polygenic risk score and the historical disease information are combined to predict the AUROC score of coronary heart disease to be equal to 0.77, and the polygenic risk score, the lifestyle characteristic and the historical disease information are combined to predict the AUROC score of coronary heart disease to be equal to 0.81.

As shown in fig. 8, fig. 8 is an AUROC evaluation chart for model prediction of type two diabetes provided in the embodiment of the present application, wherein the AUROC evaluation chart for model prediction of type two diabetes includes a polygenic risk score prediction curve 801, a combination polygenic risk score and lifestyle characteristic prediction curve 802, a combination polygenic risk score and historical disease information prediction curve 803, and a combination polygenic risk score, lifestyle characteristic and historical disease information prediction curve 804. Specifically, the AUROC score of the polygenic risk score predicted type II diabetes is equal to 0.71, the AUROC score of the polygenic risk score and life habit characteristics predicted type II diabetes is equal to 0.84, the AUROC score of the polygenic risk score and history disease information predicted type II diabetes is equal to 0.71, and the AUROC score of the polygenic risk score, life habit characteristics and history disease information predicted type II diabetes is equal to 0.86. It can be seen that the disease risk prediction model combining the polygenic risk score, lifestyle characteristics and the historical disease information is best in prediction ability, and the AUROC score for predicting coronary heart disease and type two diabetes can reach 0.81 and 0.86, respectively.

By adopting the embodiment of the application, through integrating and modeling by combining life habits with causal relation with the appointed diseases, polygenic risk scores, historical disease information of a plurality of users, age characteristics of a plurality of users and disease marks of a plurality of users, the prediction result of the disease risk score prediction model is more effective, accurate and reliable, the contribution of each part of information to disease prediction can be effectively calculated, and a foundation is laid for improving life habits and other factors in a targeted manner and reducing disease risks.

As shown in fig. 9, fig. 9 is a schematic structural diagram of a disease risk prediction model training device provided in an embodiment of the present application, where the disease risk prediction model training device includes an obtaining module 901 and a processing module 902. Among them, the detailed description of each unit is as follows.

Acquisition module 901: the method comprises the steps of obtaining a first statistic, N second statistics, genotype data of a plurality of users, historical disease codes of the plurality of users, disease time of the plurality of users, age characteristics of the plurality of users and disease identification of the plurality of users.

The first statistics comprise a plurality of groups of association relation statistics of single nucleotide polymorphisms and life habits, each second statistic in the N second statistics comprises a plurality of groups of association relation statistics of single nucleotide polymorphisms and specified diseases, the disease mark is used for marking whether a user is ill or not, and N is an integer larger than 0.

Optionally, the acquiring module 901 is further configured to acquire the first database, the second database, and the third database.

Wherein the first database comprises a plurality of single nucleotide polymorphisms independent of each other; the second database comprises a plurality of groups of incidence relation statistics of single nucleotide polymorphism and other living habits, the incidence relation statistics of each group of single nucleotide polymorphism and other living habits comprise significance P values, and the significance P values of the incidence relation statistics of each group of single nucleotide polymorphism and other living habits are smaller than a third threshold; the third database includes single nucleotide polymorphisms in the international human genome haplotype map program.

Processing module 902: for determining lifestyle habits causally related to the specified disease based on the first statistics and the N second statistics.

Processing module 902: and determining a polygenic risk score based on the N second statistics and the genotype data for the plurality of users.

Processing module 902: and the method is also used for determining the historical disease information of the plurality of users according to the historical disease codes of the plurality of users and the disease time of the plurality of users.

Processing module 902: the disease risk score prediction model is used for predicting the disease risk score of the user, and is further used for training the disease risk score model to be trained according to living habits with causal relation to the specified diseases, the polygenic risk score, historical disease information of the plurality of users, age characteristics of the plurality of users and disease identifications of the plurality of users to obtain the disease risk score prediction model.

Optionally, the processing module 902 is further configured to use lifestyle habits, polygenic risk scores, historical disease information of a plurality of users, and age characteristics of the plurality of users, which have causal relationships with the specified disease, as independent variables of the disease risk score model to be trained, and use disease identifiers of the plurality of users as dependent variables of the disease risk score model to be trained; based on independent variables and dependent variables, training the disease risk score model to be trained to obtain a disease risk score prediction model.

Optionally, the processing module 902 is further configured to obtain N data pairs according to the first statistics and the N second statistics; inputting N data pairs into a Mendelian randomization model to obtain N Mendelian randomization results; and determining living habits with causal relation with the specified diseases according to the N Mendelian randomization results.

Optionally, the processing module 902 is further configured to analyze N mendelian randomization results to obtain an analysis result; determining whether living habits in the first statistics have causal relation with the specified diseases according to the N Mendelian randomization results and the analysis results; if so, the lifestyle habit in the first statistics is determined as a lifestyle habit having a causal relationship with the specified disease.

Optionally, the processing module 902 is further configured to match the first statistics with N second statistics to obtain N original data pairs, where each of the N original data pairs includes a plurality of sets of association statistics of single nucleotide polymorphisms with lifestyles and a plurality of sets of association statistics of single nucleotide polymorphisms with specified diseases, and each set of association statistics of single nucleotide polymorphisms with lifestyles includes a minor allele frequency and a significance P value; and screening each of the N original data pairs according to the minor allele frequency and the significance P value to obtain N data pairs.

Optionally, the processing module 902 is further configured to determine whether a significance P value of the correlation statistic between each set of single nucleotide polymorphisms of each of the N raw data pairs and lifestyle habits is less than a first threshold, and whether a minor allele frequency of the correlation statistic between each set of single nucleotide polymorphisms of each of the N raw data pairs and lifestyle habits is less than a second threshold; when the significance P value of the incidence relation statistic of the first single nucleotide polymorphism and the living habit of each of the N original data pairs is smaller than a first threshold value, and the minor allele frequency of the incidence relation statistic of the first single nucleotide polymorphism and the living habit of each of the N original data pairs is smaller than a second threshold value, the incidence relation statistic of the first single nucleotide polymorphism and the living habit of each of the N original data pairs is selected; and determining N data pairs according to the association relation statistics of the first single nucleotide polymorphism and the lifestyle habit in each original data pair.

Optionally, the processing module 902 is further configured to match the correlation statistics of the first single nucleotide polymorphism in each of the original data pairs with the lifestyle habits with the first database and the second database, determine N data pairs, where the single nucleotide polymorphism in each of the N data pairs is independent of the single nucleotide polymorphism in each of the original data pairs, and the single nucleotide polymorphism in the N data pairs does not include the single nucleotide polymorphism in the second database.

Optionally, the processing module 902 is further configured to randomly select a second statistic from the N second statistics; matching the selected second statistic with a third database to obtain a third statistic, wherein the third statistic comprises at least one group of incidence relation statistic of single nucleotide polymorphism and life habit, and the single nucleotide polymorphism in the third statistic is the single nucleotide polymorphism in the international human genome haplotype map plan; and inputting the third statistic and genotype data of the plurality of users into the first model to obtain the polygenic risk score.

Optionally, the processing module 902 is further configured to perform data processing on the disease time of the plurality of users to obtain ordered disease time; training the model to be trained based on the ordered disease time and historical disease codes of a plurality of users to obtain a second model, wherein the second model is used for predicting future events and event occurrence time; and inputting the ordered illness time and the historical disease codes of the plurality of users into a second model to obtain the historical disease information of the plurality of users.

It should be noted that, the implementation of the obtaining module 901 and the processing module 902 may also correspond to the corresponding description of the embodiment shown in fig. 1, and perform the methods and functions performed by the lifestyle characteristic screening module 101, the polygenic risk score calculating module 102, the historical disease information learning module 103 and the disease risk score learning module 104 in the foregoing embodiment.

The foregoing details the disease risk prediction model training system provided in the present application, and how to implement the disease risk prediction model training process using the system, and the deployment manner of the disease risk prediction model training system is described below with reference to fig. 10.

As shown in fig. 10, fig. 10 is a schematic structural diagram of a server provided in an embodiment of the present application, where the server includes a processor 1001, a memory 1002, and a transceiver 1003. Wherein the processor 1001, the memory 1002 and the transceiver 1003 can communicate with each other through a communication bus 1004 connection path, and transfer instructions and/or data signals, the memory 1002 is used for storing a computer program, and the processor 1001 is used for calling and running the computer program from the memory 1002 to control the transceiver 1003 to send and receive signals.

The processor 1001 may correspond to the processing module 902 in fig. 9, and the processor 1001 and the memory 1002 may be combined into one processing device, and the processor 1001 is configured to execute the program code stored in the memory 1002 to implement the functions. In particular implementations, the memory 1002 may also be integrated within the processor 1001 or separate from the processor 1001.

The transceiver 1003 may also be referred to as a transceiver unit or a transceiver module. The transceiver 1003 may include a receiver (or receiver, receiving circuitry) and a transmitter (or transmitter, transmitting circuitry). Wherein the receiver is for receiving signals and the transmitter is for transmitting signals.

It should be appreciated that the server shown in fig. 10 is capable of implementing the various processes involved in the disease risk prediction model training method in the method embodiment shown in fig. 2. The operations and/or functions of the respective modules in the server are respectively for implementing the corresponding flows in the above-mentioned method embodiments. Reference is specifically made to the description of the above method embodiments, and detailed descriptions are omitted here as appropriate to avoid redundancy.

The processor 1001 may be a central processing unit, a general purpose processor, a digital signal processor, an application specific integrated circuit, a field programmable gate array or other programmable logic device, a transistor logic device, a hardware component, or any combination thereof. Which may implement or execute the various exemplary modules described in connection with the present disclosure. The processor 1001 may also be a combination that implements computing functionality, such as a combination comprising one or more microprocessors, a combination of digital signal processors and microprocessors, and so forth. The communication bus 1004 may be a peripheral component interconnect standard PCI bus or an extended industry standard architecture EISA bus or the like. The buses may be classified as address buses, data buses, control buses, etc. For ease of illustration, only one thick line is shown in fig. 10, but not only one bus or one type of bus. Communication bus 1004 is used to enable connected communication between these components. The memory 1002 in the embodiments of the present application may include volatile memory, such as nonvolatile dynamic random access memory (nonvolatile random access memory, NVRAM), phase change RAM (PRAM), magnetoresistive RAM (MRAM), and the like, and may further include nonvolatile memory, such as at least one magnetic disk storage device, an EEPROM (electrically erasable programmable read-only memory), a flash memory device, such as a flash memory (NOR flash memory) or flash memory (NAND flash memory), a semiconductor device, such as a Solid State Disk (SSD), and the like. The memory 1002 may also optionally be at least one storage device located remotely from the processor 1001. Optionally, memory 1002 may also store a set of computer program code or configuration information. Optionally, the processor 1001 may also execute a program stored in the memory 1002. The transceiver 1003 is used for communication of instructions or data with other components. The processor may cooperate with the memory and transceiver to perform any of the methods and functions of the disease risk prediction model training system of the embodiments of the application described above.

According to the method provided by the embodiment of the application, the application further provides a computer program product, which comprises: a computer program which, when run on a computer, causes the computer to perform the method of any of the embodiments shown in fig. 1 or fig. 2.

According to the method provided in the embodiments of the present application, there is further provided a computer readable medium storing a computer program, which when run on a computer causes the computer to perform the method of any one of the embodiments shown in fig. 1 or fig. 2.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When the computer instructions are loaded and executed on a computer, the processes or functions described in accordance with embodiments of the present application are produced in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by a wired (e.g., coaxial cable, fiber optic, digital subscriber line (digital subscriber line, DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The readable medium may be a magnetic medium (e.g., a floppy disk, a hard disk, a magnetic tape), an optical medium (e.g., a high-density digital video disc (digital video disc, DVD)), or a semiconductor medium (e.g., a Solid State Disk (SSD)), etc.

The term "plurality" as used in the embodiments herein refers to two or more.

The first, second, etc. descriptions in the embodiments of the present application are only used for illustrating and distinguishing the description objects, and no order division is made, nor is the number of the description objects in the embodiments of the present application specified, and no limitation in the embodiments of the present application should be construed.

The above embodiments are further described in detail for the purposes, technical solutions and advantageous effects of the present application. Any modification, equivalent replacement, improvement, etc. made within the principles of the present application should be included in the protection scope of the present application.

Claims

1. A disease risk prediction model training method, comprising:

acquiring first statistics, N second statistics, genotype data of a plurality of users, historical disease codes of the plurality of users, disease time of the plurality of users, age characteristics of the plurality of users and disease identifications of the plurality of users, wherein the first statistics comprise a plurality of groups of incidence relation statistics of single nucleotide polymorphisms and life habits, each of the N second statistics comprises a plurality of groups of incidence relation statistics of single nucleotide polymorphisms and specified diseases, the disease identifications are used for marking whether the plurality of users are ill, and N is an integer greater than 0;

Determining lifestyle habits causally related to the specified disease according to the first statistics and the N second statistics;

determining historical disease information of the plurality of users according to the historical disease codes of the plurality of users and the disease time of the plurality of users;

training a disease risk score model to be trained according to the life habits with causal relation with the appointed diseases, the polygene risk score, historical disease information of the plurality of users, age characteristics of the plurality of users and disease marks of the plurality of users to obtain a disease risk score prediction model, wherein the disease risk score prediction model is used for predicting disease risk scores of the users.

2. The method of claim 1, wherein the training the disease risk score model to be trained based on the lifestyle habits causally related to the specified disease, the polygenic risk score, historical disease information for the plurality of users, age characteristics for the plurality of users, and disease identification for the plurality of users, to obtain a disease risk score prediction model comprises:

Taking the life habits with causal relation with the appointed diseases, the polygenic risk scores, the historical disease information of the plurality of users and the age characteristics of the plurality of users as independent variables of the disease risk score model to be trained, and taking the disease marks of the plurality of users as dependent variables of the disease risk score model to be trained;

and training the disease risk score model to be trained based on the independent variable and the dependent variable to obtain the disease risk score prediction model.

3. The method of claim 1, wherein said determining lifestyle habits causally related to the specified disease based on the first statistics and the N second statistics comprises:

obtaining N data pairs according to the first statistics and the N second statistics;

inputting the N data pairs into a Mendelian randomization model to obtain N Mendelian randomization results;

and determining the living habit with causal relation with the appointed disease according to the N Mendelian randomization results.

4. The method of claim 3, wherein said determining said lifestyle patterns causally related to said given disease based on said N mendelian randomization results comprises:

Analyzing the N Mendelian randomization results to obtain analysis results;

determining whether lifestyle habits in the first statistics have causal relationship with the specified disease according to the N mendelian randomization results and the analysis results;

if so, the lifestyle habit in the first statistics is determined to be the lifestyle habit which has causal relation with the appointed disease.

5. A method according to claim 3, wherein said deriving N pairs of data from said first statistics and said N second statistics comprises:

matching the first statistics with the N second statistics to obtain N original data pairs, wherein each original data pair of the N original data pairs comprises incidence relation statistics of the plurality of groups of single nucleotide polymorphisms and life habits and incidence relation statistics of the plurality of groups of single nucleotide polymorphisms and specified diseases, and each incidence relation statistic of the single nucleotide polymorphisms and life habits comprises secondary allele frequency and significance P value;

and screening each original data pair of the N original data pairs according to the minor allele frequency and the significance P value to obtain the N data pairs.

6. The method of claim 5, wherein said screening each of said N pairs of raw data for said secondary allele frequency and said significance P value comprises:

determining whether a significance P value of the association statistic of each set of single nucleotide polymorphisms with lifestyle for each of the N raw data pairs is less than a first threshold and whether a minor allele frequency of the association statistic of each set of single nucleotide polymorphisms with lifestyle for each of the N raw data pairs is less than a second threshold;

when the significance P value of the incidence relation statistic of the first single nucleotide polymorphism and the living habit of each of the N original data pairs is smaller than the first threshold value, and the minor allele frequency of the incidence relation statistic of the first single nucleotide polymorphism and the living habit of each of the N original data pairs is smaller than the second threshold value, selecting the incidence relation statistic of the first single nucleotide polymorphism and the living habit of each of the N original data pairs;

And determining the N data pairs according to the incidence relation statistics of the first single nucleotide polymorphism and life habits in each original data pair.

7. The method of claim 6, wherein said determining said N data pairs based on said first single nucleotide polymorphism in each of said original data pairs and lifestyle relationship statistics comprises:

acquiring a first database and a second database, wherein the first database comprises a plurality of independent single nucleotide polymorphisms, the second database comprises a plurality of groups of incidence relation statistics of the single nucleotide polymorphisms and other living habits, the incidence relation statistics of each group of the single nucleotide polymorphisms and the other living habits comprise significance P values, and the significance P values of the incidence relation statistics of each group of the single nucleotide polymorphisms and the other living habits are smaller than a third threshold;

matching the incidence relation statistics of the first single nucleotide polymorphism in each original data pair and the living habit with the first database and the second database, and determining the N data pairs, wherein the single nucleotide polymorphism in each original data pair in the N data pairs is independent, and the single nucleotide polymorphism in the N data pairs does not contain the single nucleotide polymorphism in the second database.

8. The method of any one of claims 1-7, wherein said determining a polygenic risk score from said N second statistics and genotype data of said plurality of users comprises:

obtaining a third database comprising single nucleotide polymorphisms in an international human genome haplotype map plan;

randomly selecting a second statistic from the N second statistic;

matching the selected second statistic with the third database to obtain a third statistic, wherein the third statistic comprises at least one group of incidence relation statistic of the single nucleotide polymorphism and life habit, and the single nucleotide polymorphism in the third statistic is the single nucleotide polymorphism in the international human genome haplotype map plan;

inputting the third statistic and genotype data of the plurality of users into a first model to obtain the polygenic risk score.

9. The method of any one of claims 1-7, wherein determining the historical disease information for the plurality of users based on the historical disease codes for the plurality of users and the time of illness for the plurality of users comprises:

Data processing is carried out on the illness time of the plurality of users, so that ordered illness time is obtained;

training a model to be trained based on the ordered disease time and the historical disease codes of the plurality of users to obtain a second model, wherein the second model is used for predicting future events and event occurrence time;

and inputting the ordered illness time and the historical disease codes of the plurality of users into the second model to obtain the historical disease information of the plurality of users.

10. A disease risk prediction model training device, comprising:

the system comprises an acquisition module, a control module and a control module, wherein the acquisition module is used for acquiring first statistics, N second statistics, genotype data of a plurality of users, historical disease codes of the plurality of users, disease time of the plurality of users, age characteristics of the plurality of users and disease identifications of the plurality of users, wherein the first statistics comprise a plurality of groups of incidence relation statistics of single nucleotide polymorphisms and life habits, each of the N second statistics comprises a plurality of groups of incidence relation statistics of single nucleotide polymorphisms and appointed diseases, the disease identifications are used for marking whether the plurality of users are ill or not, and N is an integer greater than 0;

The processing module is used for determining living habits with causal relation with the appointed diseases according to the first statistics and the N second statistics;

the processing module is further used for determining a polygenic risk score according to the N second statistics and genotype data of the plurality of users;

the processing module is further used for determining historical disease information of the plurality of users according to the historical disease codes of the plurality of users and the disease time of the plurality of users;

the processing module is further configured to train a disease risk score model to be trained according to the lifestyle habit with causal relation to the specified disease, the polygenic risk score, historical disease information of the plurality of users, age characteristics of the plurality of users and disease identifications of the plurality of users, so as to obtain a disease risk score prediction model, where the disease risk score prediction model is used for predicting disease risk scores of the users.