WO2023080379A1

WO2023080379A1 - Disease onset information generating apparatus based on time-dependent correlation using polygenic risk score and method therefor

Info

Publication number: WO2023080379A1
Application number: PCT/KR2022/009116
Authority: WO
Inventors: 김호; 김정오; 김정은; 윤상혁; 이솔; 박승환; 권도형; 차지희; 김나영; 김은교; 박다현; 안지민; 송우정
Original assignee: 주식회사 바스젠바이오
Priority date: 2021-11-02
Filing date: 2022-06-27
Publication date: 2023-05-11
Also published as: KR102382707B9; KR102382707B1

Abstract

The present invention relates to a disease onset information-generating technique based on time-dependent correlation using polygenic risk scores, wherein a size of the correlation at each point in time between the independent and dependent variables to be input into a COX model is calculated and selected as a time series characteristic variable; at least one time-series characteristic variable is applied; COX regression analysis is performed for each group on the checkup result data for the disease-related factors of a large number of applied persons to generate the risk of disease onset for each group; the change in risk is calculated using a difference in risk of disease onset for each group to generate disease onset prediction information, whereby the present invention aims to provide relatively more accurate disease onset information reflecting the passage of time.

Description

Apparatus and method for generating disease onset information based on time-dependent association using multigene risk score

The present invention relates to a technology for generating disease onset information based on time-dependent association using multi-gene risk scores, and more specifically, by receiving genome data for a large number of people or a plurality of prior literature, performing multiple analyses, preprocessing the genome data, Examination result data, including the examination results over time of a large number of people, or multiple disease-related data are input and multiple analyzes are performed to pre-process the examination result data, and then genetic mutations are classified by group using the pre-processed genome data Calculate multigene risk scores, apply multigene risk scores for each gene variation as covariates to a time-dependent association calculation model to calculate time series characteristic variables, and at least one time series calculated based on examination result data for disease association factors Time-dependent association-based disease using multi-gene risk scores that calculates the risk of disease occurrence for each group by performing COX regression analysis for each group targeting the examination result data for disease-related factors of a large number of people by applying characteristic variables It relates to an onset information generating device and method.

Biomarkers refer to indicators such as genetic mutations that affect changes in the body using proteins, DNA, RNA (reebok nucleic acid), metabolites, etc. The importance of technology that can objectively measure the back is gradually increasing.

It is in the spotlight as an effective method for diagnosing various incurable diseases such as cancer, stroke, and dementia by deriving these biomarkers. Since it is not easy to verify the association, in the conventional technology, there have been techniques to analyze specific individual genome mutations through GWAS analysis, etc., create a personal genome map, select genetic mutations highly correlated with a specific disease, and define them as biomarkers. .

[Republic of Korea Publication No. 10-2019-0000341 "Customized medicine analysis platform based on personal genome map and analysis method using the same"]

GWAS analysis is an exploratory method for finding traits (e.g., diseases) associated with genetic variation. Generally, cases (groups with traits of interest, for example, patients) and controls (controls) do not have traits. A method of selecting a genetic mutation having a higher frequency in a case as a genetic mutation having a correlation with a trait by comparing genetic information of a group having a higher frequency in a case (for example, a normal group) is used.

Since GWAS analysis analyzes the degree of association for all gene loci, it can be a very useful screening method for finding candidate genes primarily related to traits or diseases of interest. Since the principle is based on statistical association analysis and is not a causal relationship, but a process of finding candidates for genes that appear to be related by chance, GWAS analysis alone has a limitation in raising the accuracy of searching for genetic variations associated with traits. It's clear.

In addition, in the past, many efforts have been made to accurately derive a plurality of factors that have a complex effect on disease induction, but it is very difficult to generalize the analysis pattern by analyzing the health status data of each individual once. The method of analyzing the health condition data of a plurality of people through data analysis has problems in that the accuracy of the influence of the corresponding factors on the disease cannot be trusted because the causal relationship between the input value and the result value is unclear.

The present invention preprocesses genomic data and checkup result data through multiple analysis methods, and calculates multigene risk scores for each gene mutation by group using the preprocessed genome data to calculate time-dependent correlation of multigene risk scores for each genetic mutation Time-series characteristic variables are calculated by applying them as covariates to the model, and time variability can be reflected in the analysis of disease-related factors by applying it to the examination result data for disease-related factors. COX regression analysis is performed for each group on the screening result data for the disease-related factors of the number of persons, and the risk of disease occurrence is calculated for each group. .

According to an embodiment of the present invention, an apparatus for generating disease onset information based on time-dependent correlation using polygenic risk scores receives genomic data for a plurality of persons or a plurality of prior literature and performs a plurality of analyses, thereby selecting a plurality of disease-inducing factor candidates. a genome data pre-processing unit that generates a list and classifies genetic mutations included in the plurality of disease-causing factor candidate lists into a plurality of groups; At least one disease-related factor is selected by receiving examination result data or a plurality of disease-related data, including examination results over time of a large number of persons, and performing a plurality of analyzes, and using a population trend model to determine the at least one disease-related data. a checkup result data pre-processor configured to create a plurality of groups by grouping the plurality of persons based on changes in individual checkup result values included in the checkup result data of the plurality of persons for disease-related factors; A PRS model is designed for each group targeting a plurality of genetic variants included in each of the plurality of groups classified in the genomic data pre-processing unit, and the number of risk alleles of the genetic variant in each group is correlated using the PRS model. a multi-gene risk score calculation unit for calculating a multi-gene risk score for each genetic mutation and a multi-gene risk score for each group by calculating ? as a weight; The calculated multigene risk score for each genetic mutation is applied as a covariate to the time-dependent correlation calculation model, and individual examinations for each disease-related factor included in the examination result data of persons included in each group generated in the examination result data preprocessing unit a time-series characteristic parameterization unit that inputs the resultant values into the time-dependent correlation calculation model, calculates the magnitude of the correlation between the independent variable and the dependent variable to be input into the COX model, and selects it as a time-series characteristic variable; Applying the at least one time series characteristic variable calculated for the examination result data for the disease-related factor, and performing COX regression analysis for each group on the examination result data for the disease-related factor of a large number of applied persons Risk calculation unit for calculating the risk of disease occurrence for each group; and an onset prediction information generation unit configured to generate onset prediction information by calculating a change in risk using a difference between the calculated risk of disease occurrence for each group.

According to one embodiment of the present invention, the genomic data pre-processing unit, a disease-causing factor screening unit for performing a plurality of analyzes for selecting disease-causing factor candidates by receiving genomic data for a plurality of persons or a plurality of prior literature; a disease-inducing factor candidate list generating unit generating a plurality of disease-causing factor candidate lists including a plurality of gene mutations selected as disease-causing factor candidates for each of the plurality of analyses; a gene mutation group classification unit which classifies the gene mutations included in the plurality of disease-inducing factor candidate lists into a plurality of groups according to the degree of overlap among the gene mutations included in the plurality of disease-causing factor candidate lists; A priority class classification unit for dividing the classified plurality of groups into a plurality of priority levels and generating a genetic variation list for each class by removing only one overlapping genetic variation among the plurality of genetic variations included in each priority level. can include more.

According to an embodiment of the present invention, the genomic data pre-processing unit receives genomic data for a plurality of people or a plurality of prior literature, and performs at least one of GWAS analysis, AI analysis, and meta-analysis on the target disease. .

According to an embodiment of the present invention, the disease-causing factor screening unit receives genome data for a plurality of people, performs genome-wide association analysis on a target disease, and as a result of the execution, calculates the P value for each genetic mutation. It may further include a GWAS analysis performing unit that selects a plurality of genetic mutations below the threshold as disease-inducing factor candidates in comparison with a preset threshold.

According to an embodiment of the present invention, the disease-inducing factor screening unit inputs genome data for a plurality of persons labeled with diseases into an artificial neural network-based disease-causing factor prediction model and outputs an importance score for each genetic mutation. and an AI analysis performing unit that selects a plurality of genetic mutations having an importance score exceeding a predetermined score among the importance scores for each genetic mutation outputted as disease-inducing factor candidates.

According to an embodiment of the present invention, the disease-inducing factor screening unit inputs a plurality of prior literature contributed to the subject of the effect of genetic mutation on a target disease into a meta-analysis model, and the genetic mutation for each of the plurality of prior literature The effect size corresponding to the theme of is calculated, and the reciprocal of the variance of the calculated effect size is applied as a weight to the effect size of each prior literature to measure the target disease impact score for each genetic variant, and the target disease influence for each genetic variant It may further include a meta-analysis performing unit that selects a plurality of genetic mutations as disease-inducing factor candidates based on the score.

According to an embodiment of the present invention, the GWAS analysis unit determines whether the location of each genetic mutation has a linkage disequilibrium state for the plurality of genetic mutations selected as the disease-inducing factor candidates, and determines Accordingly, it is possible to generate a final disease-inducing factor candidate by selecting only one genetic mutation having a representativeness for each locus.

According to an embodiment of the present invention, the AI analysis performer may include genetic mutation identification code, covariate information, and target disease information in the genomic data of a plurality of individuals labeled with a disease.

According to an embodiment of the present invention, the AI analysis performing unit, the artificial neural network-based disease-causing factor prediction model receives genetic mutation identification codes, covariate information, and target disease information included in genome data for a plurality of people It can be learned to output an importance score for each gene mutation for a target disease.

According to an embodiment of the present invention, the AI analysis unit randomly mixes the order of importance scores for each genetic mutation in the order of each genetic mutation, creates a model defining the genetic mutation to be determined as noise, and the model The dependence on the gene mutation can be quantified and calculated.

According to an embodiment of the present invention, the meta-analysis performing unit calculates an odds ratio and a confidence interval for each prior document to determine the effect size corresponding to the subject of the genetic mutation for each of the plurality of prior documents, Based on the ratio and confidence interval, it is possible to estimate the effect size of genetic mutations for each prior literature on the target disease.

According to an embodiment of the present invention, the meta-analysis performing unit calculates the effect size as a weight for each prior document through inverse variance estimation, and for each prior document calculated by applying the weight to the calculated odds ratio for each prior document. A target disease impact score can be calculated by summing the odds ratios.

According to an embodiment of the present invention, the genetic mutation group classification unit classifies nine groups according to the degree of overlap among the genetic mutations included in the three disease-causing factor candidate lists generated by performing the GWAS analysis, AI analysis, and meta-analysis, respectively. Genetic mutations can be classified as:

According to an embodiment of the present invention, the priority level classification unit classifies the nine groups into priority levels of 1, 2, and 3, and classifies 1 group into 1 level, 4 groups into 2 levels, Class 3 can include 4 groups.

According to an embodiment of the present invention, the multi-gene risk score calculation unit correlates the number of risk alleles of genetic variation in each group with the number of risk alleles in each group derived from the GWAS analysis result. may be related to

According to an embodiment of the present invention, a PRS model verification unit for determining use or redesign of the PRS model by performing verification of the PRS model according to whether the PRS model is for a continuous target disease or a discrete target disease. can include more.

According to an embodiment of the present invention, the checkup result data pre-processing unit receives checkup result data or a plurality of disease-related data including checkup results over time of a plurality of people and selects disease-related factor candidates for a plurality of analyzes A correlation analysis performing unit that performs; a disease-related factor selector selecting at least one disease-related factor according to an overlapping degree among the plurality of disease-related factors selected as disease-related factor candidates for each of the plurality of analyses; a pre-processing unit for processing data according to pre-set pre-processing standards for disease-related factors requiring secondary processing of data among the plurality of persons' checkup result data for the selected at least one disease-related factor; and a plurality of groups by grouping the plurality of persons based on changes in individual checkup result values included in the checkup result data of the plurality of persons for at least one disease-related factor over time using a group trend model. It may further include a data group classification unit that generates.

According to an embodiment of the present invention, the correlation analysis performing unit receives examination result data including examination results of a plurality of people over time or a plurality of disease-related data, and analyzes disease correlation and big data analysis for a target disease , at least one or more of the meta-analyses may be performed.

According to an embodiment of the present invention, the association analysis performing unit performs correlation analysis of a plurality of disease-related factors with respect to the possibility of onset of a target disease targeting examination result data including examination results over time of a plurality of persons Thus, a disease correlation analysis unit may be further included that selects the disease-related factor, which is derived to have a high correlation, as a disease-related factor candidate.

According to an embodiment of the present invention, the association analysis performing unit collects a plurality of data by using crawling from a database in which text-based disease-related data is stored, and performs text mining on the collected plurality of data to perform disease-related data. It may further include a big data analysis unit that selects correlation factor candidates.

According to an embodiment of the present invention, the correlation analysis performing unit inputs a plurality of disease-related data on the subject of the target disease and the effect on the disease-related factor into a meta-analysis model, and the disease-related factor for each of the plurality of disease-related data. The method may further include a meta-analysis performing unit that calculates an effect size of stars and selects disease-related factor candidates according to the effect size.

According to an embodiment of the present invention, the disease-related factor selection unit is a disease-related factor generated by comparing a plurality of disease-related factor candidates generated by performing at least one or more of disease correlation analysis, big data analysis, and meta-analysis, respectively. Only disease-related factors included in all candidates can be selected as disease-related factors.

According to an embodiment of the present invention, the preprocessing unit collects individual checkup result values for disease-related factors included in the selected at least one disease-related factor from checkup result data, and collects the collected individual checkup result values in time series. It is possible to generate time-series checkup data for each checkup target period by performing preprocessing to enumerate.

According to an embodiment of the present invention, the preprocessing unit determines that a disease-related factor included in at least one disease-related factor selected according to a pre-set preprocessing criterion cannot be used as a tendency criterion or a judgment criterion as an individual checkup result value In the case of classification, time-series examination data for each period subject to examination may be generated by performing pre-processing of calculating or reprocessing the examination result data according to the pre-set pre-processing criterion so as to be used as a tendency criterion or a judgment criterion.

According to an embodiment of the present invention, the pre-processing unit, the preset pre-processing criterion is information on the type of disease-related factor that cannot produce a result value by inputting individual checkup result values into a group trend model without pre-processing And it may include information on the pre-processing method of the disease-related factor.

According to an embodiment of the present invention, the data group classification unit estimates the trajectory form for individual examination result values for each disease-related factor included in the examination result data of the persons included in each group, and the difference in the trajectory form for each group In contrast, the classification suitability of the classified groups can be verified.

According to an embodiment of the present invention, a method for generating disease onset information based on time-dependent association using a polygenic risk score is driven by a disease onset information generating device including at least one processor, genomic data for a plurality of persons, or a plurality of priorities. Generating a plurality of disease-causing factor candidate lists by performing a plurality of analyzes by receiving documents as input, and classifying genetic mutations included in the plurality of disease-causing factor candidate lists into a plurality of groups; At least one disease-related factor is selected by receiving examination result data or a plurality of disease-related data, including examination results over time of a large number of persons, and performing a plurality of analyzes, and using a population trend model to determine the at least one disease-related data. generating a plurality of groups by grouping the plurality of persons based on changes in individual checkup result values included in the checkup result data of the plurality of persons for the disease-related factors of the number of persons; A PRS model is designed for each group targeting a plurality of genetic variants included in each of a plurality of groups classified for the genome data, and the association with respect to the number of risk alleles of genetic variants in each group using the PRS model Calculating a multi-gene risk score for each gene mutation and a group multi-gene risk score for each group by calculating ? as a weight; The calculated multigene risk score for each genetic mutation is applied as a covariate to the time-dependent correlation calculation model, and individual examinations for each disease-related factor included in the examination result data of persons included in each group generated for the examination result data inputting the resulting values into the time-dependent correlation calculation model to calculate the magnitude of correlation between independent variables and dependent variables to be input into the COX model at each point in time, and selecting them as time-series characteristic variables; Applying the time series characteristic variables calculated for the checkup result data for the disease-related factors, and performing COX regression analysis for each group on the checkup result data for the disease-related factors of a large number of applied persons, Calculating the risk of disease occurrence; and generating outbreak prediction information by calculating a change in risk using a difference between the calculated risk of disease occurrence for each group.

According to an embodiment of the present invention, the step of classifying the genetic mutations into a plurality of groups includes performing a plurality of analyzes for selecting disease-causing factor candidates by receiving genome data or a plurality of prior literature for a plurality of persons; generating a plurality of disease-causing factor candidate lists including a plurality of gene mutations selected as disease-causing factor candidates for each of the plurality of analyses; classifying genetic mutations included in the plurality of disease-causing factor candidate lists into a plurality of groups according to the degree of overlap among the gene mutations included in the plurality of disease-causing factor candidate lists; Dividing the classified plurality of groups into a plurality of priority levels, and generating a list of genetic mutations according to a plurality of levels by removing overlapping genetic mutations from among the plurality of genetic mutations included in each priority level, leaving only one. can

According to an embodiment of the present invention, the step of performing the plurality of analyzes may include at least one of GWAS analysis, AI analysis, and meta-analysis for a target disease by receiving genomic data or a plurality of prior literature for a plurality of persons. can be done

According to an embodiment of the present invention, in the step of performing the plurality of analyzes, genome data for a plurality of individuals is input and genome association analysis is performed for target diseases, and as a result of the above, the results calculated for each genetic mutation are The method may further include comparing a P value with a preset threshold and selecting a plurality of genetic mutations below the threshold as disease-inducing factor candidates.

According to an embodiment of the present invention, the step of performing the plurality of analyzes may include inputting genomic data for a plurality of persons labeled with a disease into an artificial neural network-based disease-causing factor prediction model to obtain an importance score for each genetic mutation (Importance score). ), and selecting a plurality of genetic mutations having an importance score exceeding a predetermined score among the output importance scores for each genetic mutation as disease-inducing factor candidates.

According to an embodiment of the present invention, the step of performing the plurality of analyzes may include inputting a plurality of prior art documents contributed on the subject of the effect of genetic mutation targeting a target disease into a meta-analysis model, and dividing the plurality of prior art documents into a meta-analysis model. The effect size corresponding to the subject of the genetic variation is calculated, and the reciprocal of the variance of the calculated effect size is applied as a weight to the effect size of each prior literature to measure the target disease impact score for each genetic variation, and each genetic variation The method may further include selecting a plurality of genetic mutations as disease-inducing factor candidates based on the target disease influence score.

According to an embodiment of the present invention, the step of selecting a plurality of genetic mutations that are below the threshold as disease-causing factor candidates is a disease-causing factor candidate, in which the position of each genetic mutation is in linkage disequilibrium. condition, and according to the determination result, only one genetic mutation having a representativeness for each locus is selected to generate a final disease-inducing factor candidate.

According to an embodiment of the present invention, in the step of selecting a plurality of genetic mutations having an importance score exceeding the preset score as disease-causing factor candidates, genomic data for a plurality of people labeled with a disease is a genetic mutation identification code. , covariate information, and target disease information.

According to an embodiment of the present invention, the step of selecting a plurality of genetic mutations having an importance score exceeding the preset score as disease-causing factor candidates, wherein the artificial neural network-based disease-causing factor prediction model is used for a plurality of individuals. It may be learned to receive genetic mutation identification codes, covariate information, and target disease information included in genome data, and output an importance score for each genetic mutation for a target disease.

According to an embodiment of the present invention, in the step of selecting a plurality of genetic mutations having an importance score exceeding the preset score as disease-inducing factor candidates, the importance score for each genetic mutation is randomly mixed in order of each genetic mutation, , it is possible to generate a model that defines the genetic variation to be determined as noise, and quantify the dependence of the genetic variation on the model.

According to an embodiment of the present invention, the step of selecting a plurality of genetic mutations as disease-inducing factor candidates based on the target disease influence score for each genetic mutation has an effect corresponding to the subject of the genetic mutation for each of the plurality of prior documents. As for the size, an odds ratio and a confidence interval are calculated for each prior literature, and the effect size of the genetic mutation for each prior literature on the target disease can be estimated based on the odds ratio and confidence interval.

According to an embodiment of the present invention, in the step of selecting a plurality of genetic mutations as disease-inducing factor candidates based on the target disease influence score for each genetic mutation, the effect size is calculated as a weight for each prior document through inverse variance estimation. Then, a target disease influence score may be calculated by adding the calculated odds ratio for each prior document by applying the weight to the calculated odds ratio for each prior document.

According to an embodiment of the present invention, the step of classifying the genetic mutations included in the plurality of disease-causing factor candidate lists into a plurality of groups causes three diseases generated by performing the GWAS analysis, AI analysis, and meta-analysis, respectively. Among the genetic variants included in the factor candidate list, genetic variants can be classified into nine groups according to the degree of overlap with each other.

According to an embodiment of the present invention, in the step of classifying the genetic mutations included in the plurality of disease-causing factor candidate lists into a plurality of groups, the nine groups are classified into priority levels of 1, 2, and 3, Level 1 may include one group, level 2 may include 4 groups, and level 3 may include 4 groups.

According to an embodiment of the present invention, in the step of calculating the multi-gene risk score, the association with respect to the number of risk alleles of the genetic variation in each group is the risk allele of the genetic variation in each group derived from the results of the GWAS analysis. It can be an association with the number of genes.

According to an embodiment of the present invention, the step of determining whether to use or redesign the PRS model by performing verification of the PRS model according to whether the PRS model is for a continuous target disease or a discrete target disease can do.

According to an embodiment of the present invention, the checkup result data pre-processing unit receives checkup result data or a plurality of disease-related data including checkup results over time of a plurality of people and selects disease-related factor candidates for a plurality of analyzes performing; selecting at least one disease-related factor according to an overlapping degree among a plurality of disease-related factors selected as disease-related factor candidates for each of the plurality of analyses; Processing data according to pre-processing criteria set for a disease-related factor requiring secondary processing of data from among a plurality of persons' examination result data for the selected at least one disease-related factor; and a plurality of groups by grouping the plurality of persons based on changes in individual checkup result values included in the checkup result data of the plurality of persons for at least one disease-related factor over time using a group trend model. A step of generating may be further included.

According to an embodiment of the present invention, the step of performing a plurality of analyzes for selecting disease-related factor candidates may include receiving examination result data including examination results of a plurality of people over time or a plurality of disease-related data. For the target disease, at least one of disease association analysis, big data analysis, and meta-analysis may be performed.

According to an embodiment of the present invention, the step of performing a plurality of analyzes for selecting disease-related factor candidates may include determining the possibility of onset of a target disease targeting examination result data including examination results of a plurality of persons over time. The method may further include performing an association analysis of a plurality of disease-related factors for the disease, and selecting disease-related factors derived to be highly correlated as disease-related factor candidates.

According to an embodiment of the present invention, the step of performing a plurality of analyzes for selecting disease-related factor candidates includes collecting a plurality of data by using crawling from a database in which text-based disease-related data are stored, and collecting a plurality of collected data. The method may further include selecting disease-related factor candidates by performing text mining on the data of .

According to an embodiment of the present invention, the step of performing a plurality of analyzes for selecting disease-related factor candidates includes inputting a plurality of disease-related data on the subject of a target disease and its effect on disease-related factors into a meta-analysis model , Calculating an effect size for each disease-related factor for each of the plurality of disease-related data, and selecting a disease-related factor candidate according to the effect size.

According to an embodiment of the present invention, the step of performing a plurality of analyzes for selecting disease-related factor candidates includes a plurality of disease-related factors generated by performing at least one or more of disease correlation analysis, big data analysis, and meta-analysis. Only disease-related factors included in all of the disease-related factor candidates generated by comparing the candidates may be selected as disease-related factors.

According to an embodiment of the present invention, in the step of processing the data according to the preset preprocessing criteria, individual checkup result values for disease-related factors included in the selected at least one disease-related factor are collected from the checkup result data, It is possible to generate time-series checkup data for each period of the entire checkup subject by performing pre-processing that lists the collected individual checkup result values in a time series.

According to an embodiment of the present invention, in the step of processing the data according to the pre-processing criteria, the disease-related factors included in the at least one disease-related factors selected according to the pre-processing criteria are selected as individual checkup result values. If it is classified as unusable as a criterion or criterion, preprocessing is performed to calculate or reprocess the checkup result data according to the pre-set preprocessing criterion so that it can be used as a tendency criterion or criterion to create time-series checkup data for each period of the entire checkup subject can do.

According to an embodiment of the present invention, in the step of processing the data according to the preset preprocessing criteria, the preset preprocessing criteria may input individual checkup result values into a group trend model without preprocessing to generate result values. It may include information on the type of disease-related factor that is not present and information on a pre-processing method for the disease-related factor.

According to an embodiment of the present invention, in the step of processing the data according to the preset preprocessing criteria, the shape of the trajectory is estimated for the individual checkup result values for each disease-related factor included in the checkup result data of the persons included in each group Thus, the classification suitability of the classified groups can be verified by comparing the differences in trajectory shapes for each group.

According to the device for generating disease onset information based on time-dependent correlation using multi-gene risk scores implemented according to an embodiment of the present invention, in generating disease onset information including onset prediction information, using preprocessed genomic data and examination result data Diseases included in the examination result data in the time-dependent correlation calculation model in which a PRS model was designed for each group targeting multiple genetic mutations included in each of the multiple classified groups and multigene risk scores calculated using the PRS model were applied as covariates By inputting the individual checkup result values for each correlation factor, the size of the correlation at each time point between the independent variable and the dependent variable to be input to the COX model is calculated and selected as a time series characteristic variable, and the at least one time series characteristic variable calculated based on the checkup result data COX regression analysis is performed for each group on the screening result data for the disease-related factors of a large number of people applied, and the risk change is calculated using the difference value of the risk of disease occurrence for each group to generate predicted information on occurrence By doing so, it is possible to provide relatively more accurate disease onset information by reflecting the passage of time.

1 is a block diagram of an apparatus for generating disease onset information based on time-dependent association using polygenic risk scores implemented according to a first embodiment of the present invention.

FIG. 2 is a detailed configuration diagram of the genome data pre-processing unit shown in FIG. 1 .

FIG. 3 is a detailed configuration diagram of the disease-causing factor screening unit shown in FIG. 2 .

FIG. 4 is a detailed configuration diagram of a pre-processing unit for examination result data shown in FIG. 1 .

5 is a detailed configuration diagram of the disease association analysis unit shown in FIG. 4 .

6 is a block diagram of an apparatus for generating disease onset information based on time-dependent association using polygenic risk scores implemented according to a second embodiment of the present invention.

7 is a diagram illustrating selection of disease-inducing factor candidates for each of a plurality of analyzes using a Manhattan plot generated as a result of GWAS analysis according to an embodiment of the present invention.

8 is a diagram showing a data table format of result data generated as a result of performing GWAS analysis according to an embodiment of the present invention.

9 is a diagram showing the data format of genomic data for a plurality of persons whose diseases are labeled in the prediction model to be input to the artificial neural network-based disease-causing factor prediction model to perform AI analysis according to an embodiment of the present invention. .

Figure 10 shows the odds ratio (OR) calculated for each prior document describing the association between a specific genetic variant and a disease through a meta-analysis according to an embodiment of the present invention and the target disease influence score of a specific genetic variant. it is a drawing

11 is a diagram showing genetic mutations in a plurality of groups according to the degree of overlap among the genetic mutations included in the three disease-inducing factor candidate lists generated by performing GWAS analysis, AI analysis, and meta-analysis, respectively, according to an embodiment of the present invention. It is a drawing showing the criteria for classification.

12 is a view showing genetic mutations included in three disease-inducing factor candidate lists classified into nine groups and classified into three priority levels according to an embodiment of the present invention.

13 is a diagram showing a graph of life expectancy calculated according to a classified risk level according to an embodiment of the present invention.

14 is a flowchart of a method for generating disease onset information based on time-dependent association using polygenic risk scores according to an embodiment of the present invention.

Hereinafter, with reference to the accompanying drawings, embodiments of the present invention will be described in detail so that those skilled in the art can easily carry out the present invention. However, the present invention may be implemented in many different forms and is not limited to the embodiments described herein.

Terms used in the present invention are only used to describe specific embodiments, and are not intended to limit the present invention. Singular expressions include plural expressions unless the context clearly dictates otherwise.

In the present invention, terms such as "comprise" or "having" are intended to designate that there is a feature, number, step, operation, component, part, or combination thereof described in the specification, but one or more other features It should be understood that the presence or addition of numbers, steps, operations, components, parts, or combinations thereof is not precluded.

Unless defined otherwise, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by one of ordinary skill in the art to which the present invention belongs.

Terms such as those defined in commonly used dictionaries should be interpreted as having a meaning consistent with the meaning in the context of the related art, and unless explicitly defined in the present invention, they should not be interpreted in an ideal or excessively formal meaning. don't

It will also be understood that combinations of each block of the drawings and flowchart drawings can be performed by computer program instructions, and these computer program instructions can be loaded into a processor of a general-purpose computer, special-purpose computer, or other programmable data processing equipment. Thus, those instructions executed by a processor of a computer or other programmable data processing equipment create means for performing the functions described in the flowchart block(s).

These computer program instructions may also be stored in a computer usable or computer readable memory that can be directed to a computer or other programmable data processing equipment to implement functionality in a particular way, such that the computer usable or computer readable memory The instructions stored in are also capable of producing an article of manufacture containing instruction means that perform the functions described in the flowchart block(s).

The computer program instructions can also be loaded on a computer or other programmable data processing equipment, so that a series of operational steps are performed on the computer or other programmable data processing equipment to create a computer-executed process to generate computer or other programmable data processing equipment. Instructions for performing processing equipment may also provide steps for performing the functions described in the flowchart block(s).

Additionally, each block may represent a module, segment, or portion of code that includes one or more executable instructions for executing specified logical function(s).

And it should be noted that in some alternative embodiments it is also possible for the functions mentioned in the blocks to occur out of order. For example, two blocks shown in succession may in fact be executed substantially concurrently, or the blocks may sometimes be executed in reverse order depending on their function.

At this time, the term '~unit' used in this embodiment means software or a hardware component such as a field-programmable gate array (FPGA) or application specific integrated circuit (ASIC), and what role does '~unit' have? perform them

However, '~ part' is not limited to software or hardware. '~bu' may be configured to be in an addressable storage medium and may be configured to reproduce one or more processors.

Therefore, as an example, '~unit' refers to components such as software components, object-oriented software components, class components, and task components, processes, functions, properties, and procedures. , subroutines, segments of program code, drivers, firmware, microcode, circuitry, data, databases, data structures, tables, arrays, and variables. Functions provided within components and '~units' may be combined into smaller numbers of components and '~units' or further separated into additional components and '~units'. In addition, components and '~units' may be implemented to play one or more CPUs in a device or a secure multimedia card.

In describing the embodiments of the present invention in detail, an example of a specific system will be the main target, but the main subject matter to be claimed in this specification extends the scope disclosed herein to other communication systems and services having a similar technical background. It can be applied within a range that does not deviate greatly, and this will be possible with the judgment of those skilled in the art.

Hereinafter, an apparatus and method for generating risk gene mutation information for each disease using a time-varying covariate-based PRS model according to an embodiment of the present invention will be described with reference to the drawings.

Referring to FIG. 1, an apparatus 1 for generating risk gene mutation information for each disease using a time-varying covariate-based PRS model implemented according to the first embodiment of the present invention includes a genome data pre-processing unit 10 and a checkup result data pre-processing unit. (20), a multi-gene risk score calculation unit 30, a time series characteristic parameterization unit 40, a risk calculation unit 50, and an onset prediction information generation unit 60.

The genomic data pre-processing unit 10 receives genomic data of a plurality of persons or a plurality of preceding documents, performs a plurality of analyses, generates a plurality of disease-inducing factor candidate lists, and generates a plurality of disease-causing factor candidate lists. Mutations may be classified into a plurality of groups, and the classified groups may be divided into a plurality of priority levels.

According to an embodiment of the present invention, the genome data pre-processing unit 10 may analyze genome data of a plurality of persons or a plurality of prior literature to generate a list of disease-inducing factor candidate candidates for each analysis.

According to an embodiment of the present invention, the genomic data pre-processing unit 10 classifies the genetic mutations included in the disease-causing factor candidate list generated for each analysis into a plurality of groups, and selects at least one group for the classified plurality of groups. It can be classified by selecting a priority level including.

The dielectric data pre-processing unit 10 will be described in more detail with reference to FIG. 2 .

The examination result data pre-processing unit 20 selects at least one disease-related factor by performing a plurality of analyzes by receiving examination result data or a plurality of disease-related data including examination results over time of a number of persons, A plurality of groups may be created by grouping the plurality of persons based on changes in individual checkup result values included in the checkup result data of the plurality of persons for the at least one disease-related factor using a group trend model. .

According to an embodiment of the present invention, the examination result data pre-processing unit 20 receives examination result data including examination results of a plurality of persons over time or a plurality of disease-related data, performs a plurality of analyzes, and analyzes each According to the results, at least one disease-related factor may be selected.

According to an embodiment of the present invention, the examination result data pre-processing unit 20 uses a group trend model to determine, based on changes in individual examination result values included in the examination result data of a plurality of persons for at least one disease-related factor. A plurality of groups can be created by grouping a large number of people, where the group trend model classifies behavior types over time into clusters and estimates the trajectory form of each group to have the best fit with the number of groups and data. It may refer to a method of verifying the number of groups.

The examination result data pre-processing unit 20 will be described in more detail with reference to FIG. 4 .

The multi-gene risk score calculation unit 30 designs a PRS model for each group for a plurality of genetic mutations included in each of the plurality of groups classified by the genomic data preprocessing unit, and uses the PRS model to determine the genetic variation for each group. Association for the number of risk alleles (

) is calculated as a weight, and the polygene risk score (

) and group polygenic risk score (

) can be calculated.

According to an embodiment of the present invention, the PRS model is designed as in Equation 1, and the correlation (

) is calculated as a weight, and the polygene risk score (

) and group polygenic risk score (

) can be designed to calculate

According to an embodiment of the present invention, the number of risk alleles of P gene mutations (SNPs) in the group derived as a result of GWAS analysis for the target disease (pheno type) (

) for relevance (

) as the weighted sum of the multigene risk scores for each group (

) can be calculated.

According to one embodiment of the present invention, weights can also be calculated through regression analysis between genetic mutations and target diseases (phenotypes), but according to another embodiment, since there is an association (LD) between genetic mutations (SNP0), general regression analysis Estimates of the weights when calculated through (

) can be estimated using the regularized regression method due to statistical problems such as an increase in the variance of ), and an estimation model based on the Lasso and Ridge method can be used during the generalized regression analysis.

The time series characteristic parameterization unit 40 calculates the polygenic risk score for each gene mutation (

) is applied as a covariate to the time-dependent correlation calculation model, and individual examination result values for each disease correlation factor included in the examination result data of persons included in each group generated in the examination result data preprocessing unit are input to the time-dependent correlation calculation model. Therefore, it is possible to select a time series characteristic variable by calculating the size of the correlation between the independent variable and the dependent variable to be input into the COX model at each point in time.

According to one embodiment of the present invention, the time-dependent association calculation model is a model for using the time-dependent cox, which is an extended cox model, to reflect variables whose values change with time in survival analysis. The magnitude of the association between the explanatory variable (X) and the response variable (Y) can be calculated, defined as a time-series characteristic variable, and applied to the examination result data for disease-related factors.

According to an embodiment of the present invention, the time-dependent correlation calculation model may be expressed as in Equation 2 below in order to use time-dependent cox.

According to an embodiment of the present invention, the association index is calculated by calculating the association with the response variable (disease occurrence) of a specific variable (

) can be defined, and in Equation 2

The formula of the part of is a plurality of variables that are not affected by the flow in time (

) with the response variable (

) means the COX regression analysis formula using

The formula of a plurality of variables affected by the flow in time (

) with the response variable (

) means the COX regression analysis formula using

According to one embodiment of the present invention, a plurality of variables that are not affected by the flow of time (

) can mean a digitized value for variables that do not fluctuate from time to time, such as gender, genotype, history of disease up to the point of analysis, multigene risk score, etc., and a plurality of variables that are affected by the flow of time (

) is for examination result data excluding gender, such as fasting blood glucose level, systolic blood pressure and/or diastolic blood pressure, total cholesterol level and/or high-density cholesterol level (HDL), low-density cholesterol level (LDL), weight, and body mass index (BMI). It can mean the digitized value of the variable of the included item.

According to an embodiment of the present invention, the polygenic risk score for each genetic mutation calculated (

) as a covariate in the time-dependent association calculation model of Equation 2

In the formula, the polygenic risk score for each genetic variant (

)cast

, and the multigene risk score for each gene mutation (

) the correlation index with the response variable

input method can be used.

The risk calculation unit 50 applies at least one time-series characteristic variable calculated for the examination result data for disease-related factors, and applies COX for each group to the examination result data for disease-related factors of a large number of applied persons Regression analysis can be performed to calculate the risk of disease occurrence for each group.

According to an embodiment of the present invention, the risk calculation unit 50 may apply the calculated at least one time series characteristic variable to the examination result data for the disease-related factor, and at each time point of the examination result data for the disease-related factor The application can be performed by multiplying the number of disease inducing factors by the time series characteristic variable calculated for each time point.

According to an embodiment of the present invention, the risk of disease occurrence for each group may be calculated by performing COX regression analysis for each group consisting of examination result data for disease-related factors to which time-series characteristic variables are applied.

According to an embodiment of the present invention, when COX regression analysis is performed by inputting examination result data for disease-related factors to which time series characteristic variables are applied to a COX regression analysis model for each group, survival rate data for each time point between each group is calculated can

According to an embodiment of the present invention, the reciprocal value of the survival rate data for each time point in each group calculated by performing COX regression analysis can be used to calculate the risk of disease occurrence for each group.

The outbreak prediction information generating unit 60 may generate outbreak prediction information by calculating a change in risk using a difference between the calculated risk of disease occurrence for each group.

According to an embodiment of the present invention, the onset prediction information generating unit 60 may generate onset prediction information by calculating a change in risk using a difference between the calculated risk of disease occurrence for each group.

According to an embodiment of the present invention, the onset prediction information generation unit 60 compares the calculated risk of disease occurrence for each group at each time point to calculate the change in risk at a specific time point for each group, based on this. predictive information can be generated.

According to an embodiment of the present invention, the incidence prediction information generation unit 60 specifies the calculated average value of the risk change at a specific time point for each group as the risk change at that time point, and based on this, the expected incidence rate at a later time point It can be calculated to generate outbreak prediction information.

According to an embodiment of the present invention, the onset prediction information represents the probability or risk rate of a person having at least one disease-inducing factor developing the disease over time in the form of a graph, and each graph shows a risk level, an intermediate level, and a risk level. It may be classified into risk stages, but it may be used without limitation as long as it can indicate expected information about outbreaks over time.

Referring to FIG. 2, the genome data pre-processing unit 10 includes a disease-inducing factor selection unit 110, a disease-causing factor candidate list generation unit 120, a gene mutation group classification unit 130, and a priority class classification unit 140. can include

The disease-inducing factor screening unit 110 may perform a plurality of analyzes to select disease-causing factor candidates by receiving genome data of a plurality of persons or a plurality of prior documents.

Here, the disease-inducing factor candidate may mean selecting a candidate for a single nucleotide polymorphism (SNP) expected to be related to causing a specific disease.

According to an embodiment of the present invention, cohort data may be used as genome data for a plurality of persons, but genome information on a plurality of persons may be used without limitation if the data is implemented in the form of a data set.

Here, the cohort data may refer to data in which genome and health information about a specific population suspected of having a specific disease or having a specific disease is expressed in the form of a data set.

In addition, prior literature refers to literature that contains information about the relationship between a specific disease and a specific genetic mutation, so that the subject of the literature can be selected as a disease-inducing factor candidate for a specific disease among a large amount of genetic mutations included in genome data. In general, thesis may be applicable to this, but it is not limited to this, and if the research topic of the literature is about the relationship to a specific disease and specific genetic mutation, it can be used without limitation.

According to an embodiment of the present invention, the disease-inducing factor screening unit 110 receives genome data or a plurality of prior literature on a plurality of persons and performs at least one of GWAS analysis, AI analysis, and meta-analysis on the target disease. can do.

According to an embodiment of the present invention, GWAS analysis and AI analysis can be performed on genomic data, and meta-analysis can be performed on a plurality of prior literature.

Here, GWAS analysis refers to an analysis tool that discovers genetic mutations related to a specific disease by targeting genomic data. According to an embodiment of the present invention, when a gene mutation capable of causing a disease is searched through GWAS analysis, a disease-inducing factor candidate can be selected.

In addition, AI analysis calculates the importance score for each genetic mutation using an artificial neural network-based disease-inducing factor prediction model for genome data, and according to the importance score for each genetic mutation, the disease-inducing factor among genetic mutations candidates can be selected.

Finally, the meta-analysis creates a data set based on the analysis information for each prior literature based on the information collected by crawling the text information of the prior literature, and targets the data set to determine the size of the effect corresponding to the theme of genetic variation, That is, it may mean to select a disease-inducing factor candidate by calculating the magnitude of the genetic mutation affecting a specific disease and measuring the target disease influence score using the effect size.

The disease-causing factor candidate list generation unit 120 may generate a plurality of disease-causing factor candidate lists including a plurality of gene mutations selected as disease-causing factor candidates for each of a plurality of analyses.

According to an embodiment of the present invention, the disease-inducing factor candidate list generation unit 120 selects a plurality of disease-inducing factor candidates through at least one of GWAS analysis, AI analysis, and meta-analysis, and selects genetic mutations. The analysis result data of each analysis result can be grouped in the form of a list, and a disease inducing factor candidate list for each analysis result can be generated.

The gene mutation group classification unit 130 may classify genetic mutations included in the plurality of disease-inducing factor candidate lists into a plurality of groups according to the degree of overlap among the gene mutations included in the generated plurality of disease-causing factor candidate lists. .

According to an embodiment of the present invention, genetic mutations may be classified into a plurality of groups by determining whether they intersect according to the degree of overlap among genetic mutations, which will be described in more detail with reference to FIG. 11 .

According to an embodiment of the present invention, among the genetic variants included in the three disease-causing factor candidate lists generated by performing GWAS analysis, AI analysis, and meta-analysis, genetic variants can be classified into nine groups according to the degree of overlap with each other. there is.

According to an embodiment of the present invention, the genetic mutation group classification unit 130 determines whether or not among the genetic mutations included in the three disease-causing factor candidate lists is included in the intersection with each list, and if included, how many lists are crossed with each other. Genetic variation can be classified into groups of dogs.

According to an embodiment of the present invention, the gene mutation group classification unit 130 classifies 9 groups into priority levels of 1, 2, and 3, and classifies 1 group into 1st class and 4 groups into 2nd class. , it is possible to include 4 groups in the 3rd grade.

According to one embodiment of the present invention, a group formed by genetic mutations included in all three disease-causing factor candidate lists among nine groups is ranked as the first grade, and two disease-causing factor candidate lists among three disease-causing factor candidate lists A group formed by the included genetic mutations may be classified as 2nd grade, and a group formed by genetic mutations included in only one disease-inducing factor candidate list among the 3 disease-causing factor candidate lists may be classified as 3rd grade.

Classification of the nine groups into priority levels of 1, 2, and 3 will be described in more detail with reference to FIG. 12 .

The priority level classification unit 140 divides the classified groups into a plurality of priority levels, removes overlapping genetic mutations from among the plurality of genetic variations included in each priority level, leaving only one genetic variation list, and lists the genetic variation according to the plurality of levels. can create

According to an embodiment of the present invention, a plurality of classified groups are divided into a plurality of priority levels, and among the plurality of genetic variations included in each priority level, overlapping genetic mutations are removed, leaving only one genetic mutation list, thereby obtaining a list of genetic mutations for each priority level. can create

According to one embodiment of the present invention, among the genetic variants included in the three disease-inducing factor candidate lists, since there is a possibility that genetic variants to be included in each list overlappingly exist, a plurality of groups are divided into a plurality of priority levels, and each If there are overlapping gene variants among a plurality of gene variants included in each priority level, they may overlap when ranking each rank, so it is possible to generate a list of genetic variants by rank by removing only one gene variant.

Referring to FIG. 3, the detailed configuration of the disease-causing factor selection unit 110 disclosed in FIG. 1 is shown. The disease-causing factor selection unit 100 includes a GWAS analysis unit 111, AI analysis unit 112, It may include at least one of the analysis performer 113, and according to an embodiment of the present invention, it may include all of the GWAS analysis performer 111, the AI analysis performer 112, and the meta-analysis performer 113. can

The GWAS analysis unit 111 receives genomic data for a large number of people, performs whole genome association analysis on target diseases, and compares the P value calculated for each genetic mutation as a result of the execution with a preset threshold, A plurality of genetic mutations below can be selected as disease-causing factor candidates.

According to an embodiment of the present invention, a Manhattan plot can be used as a method of selecting genetic mutations as disease-causing factor candidates using the P value calculated for each genetic mutation as a result of performing genome-wide association analysis, which is shown in FIG. 7 Please refer to for a more detailed explanation.

According to an embodiment of the present invention, the GWAS analysis unit 111 determines whether the location of each genetic mutation has a linkage disequilibrium state for a plurality of genetic mutations selected as disease-causing factor candidates. And, according to the judgment result, only one genetic mutation having a representativeness for each locus can be selected to generate a final disease-inducing factor candidate.

According to an embodiment of the present invention, the GWAS analysis unit 111 performs LD clumping on a plurality of gene mutations selected as disease-causing factor candidates in order to select only one genetic mutation having a representativeness for each locus. Therefore, it is possible to use a method of selecting genetic variants that are representative of each locus, and the selection criterion is to set a ranking based on the calculated importance score for each genetic variant to select the top genetic variant. .

Here, the importance score for each genetic variation may mean a method of calculating a quantified value to calculate a feature that has the most influence on predictive power, that is, a genetic variation that is a feature.

According to an embodiment of the present invention, the GWAS analysis unit 111 may perform whole genome association analysis to generate result data in the form of a data table with a plurality of field values as items, including chromosome ID and SNP ID. At this time, the P value calculated for each genetic mutation may be included, and will be described with reference to FIG. 8 in more detail.

The AI analysis unit 112 inputs genome data for a plurality of persons labeled with diseases into an artificial neural network-based disease-inducing factor prediction model, outputs an importance score for each genetic mutation, and outputs an importance score for each genetic mutation. Among the importance scores, a plurality of genetic mutations having an importance score exceeding a preset score may be selected as disease-inducing factor candidates.

According to an embodiment of the present invention, genomic data of a plurality of persons labeled with a disease, which is input to an artificial neural network-based disease-causing factor prediction model, may include a genetic mutation identification code, covariate information, and target disease information.

The format of genomic data for a plurality of persons labeled with a disease will be described in more detail with reference to FIG. 9 .

According to an embodiment of the present invention, an artificial neural network-based disease-inducing factor prediction model is used to select disease-causing factor candidates by using multiple genetic mutations in order to solve the black box problem in which it is difficult to understand the causal relationship between input values and output values. Among machine learning, a tree-based algorithm is used, and a method of obtaining an importance score for each genetic mutation through an XAI (Explainable AI) technique can be used.

According to an embodiment of the present invention, an artificial neural network-based disease-causing factor prediction model receives genetic mutation identification codes, covariate information, and target disease information included in genome data for a plurality of individuals, and identifies genetic mutations for target diseases. It can be learned to output an importance score.

According to an embodiment of the present invention, an importance score for each gene mutation for a target disease may be calculated through a formula such as Equation 3.

Here, the disease-inducing factor prediction model that has been trained is m, the data set of genomic data for a large number of people labeled with the disease is D, the score of the disease-inducing factor prediction model m for data set D is S, and the data set D is random. The number of shuffles is k, and the data obtained by randomly shuffling the data set D k times

,

The score of disease j-inducing factor prediction model m for

, and the importance score for each genetic variation for genetic variation j using Equation 3

can be calculated.

According to an embodiment of the present invention, the AI analysis performer 112 randomly mixes the order of importance scores for each genetic mutation, and then creates a model defining the genetic mutation whose importance is to be determined as noise, The dependence of the model on genetic variation can be quantified and calculated.

According to an embodiment of the present invention, a model defining genetic variants whose importance is to be determined as noise can be created, and the permutation feature importance technique can be used to quantify and calculate the dependence of the model on genetic variants. It is an explainability AI technique suitable for data suitable for a data set in the form of a data table. After randomly mixing the order of each variable, and then making noise the feature (genetic mutation) to determine the importance, the model is ) can be performed in a way that quantifies how dependent

The meta-analysis unit 113 inputs a plurality of prior literature contributed to the subject of the genetic variation targeting the target disease into the meta-analysis model, and calculates the effect size corresponding to the subject of the genetic variation for each of the plurality of prior literature. Calculate and apply the reciprocal of the variance of the calculated effect size as a weight to the effect size of each prior literature to measure the target disease influence score for each genetic variant, and determine a plurality of genes based on the target disease influence score for each genetic variant Mutations can be selected as disease-causing factor candidates.

According to an embodiment of the present invention, the meta-analysis unit 113 calculates an odds ratio and a confidence interval for each prior document to determine the effect size corresponding to the subject of the genetic mutation for each of a plurality of prior documents, Based on the ratio and confidence interval, it is possible to estimate the effect size of genetic mutations for each preceding disease on the target disease.

According to an embodiment of the present invention, the meta-analysis performing unit 113 analyzes a plurality of prior literature, systematically considers prior literature described on the effect of a specific genetic mutation on the same topic, that is, a specific disease, and selects the final selection. The results (effect size) corresponding to the topic can be extracted and used by analyzing the published literature.

According to an embodiment of the present invention, there may be various methods for extracting effect size, and the type of effect size to be extracted is different depending on the subject. The effect size based on the standardized mean difference, Effect sizes based on correlation coefficients and effect sizes based on odds ratios can be used.

According to one embodiment of the present invention, in order to calculate the effect size based on the odds ratio, the odds ratio (OR), which is an index for the size of the effect on the disease (association size) for each genetic mutation, and The effect size can be estimated based on the 95% confidence interval (95% Confidence Interval, CI). The odds ratio of each individual literature can be combined to calculate the overall effect size (overall OR).

Estimation of the effect size of a genetic mutation for each antecedent on a target disease based on an odds ratio and a confidence interval according to an embodiment of the present invention will be described in more detail with reference to FIG. 10 .

According to an embodiment of the present invention, a generic inverse variance estimation method may be used to measure the target disease influence score for each genetic mutation using the calculated effect size.

The inverse variance estimation method is a method used to give weight in meta-analysis, and the reciprocal of the variance of the estimated effect size can be used as the weight of individual prior literature.

According to an embodiment of the present invention using the inverse variance estimation method, prior literature on studies with a large sample size will have a small variance and the reciprocal of the variance will become large, so a higher weight is given to prior literature on studies with a large sample size. It can be used for giving.

According to the above embodiment, the natural logarithm of the odds ratio of each prior document is taken.

Calculate,

Calculate the standard error (SE) for , and weight the reciprocal of the square of the calculated standard error

, and the overall effect size (OR _pooled ) can be calculated by summing all the values obtained by multiplying the odds ratio by the weight of each prior literature calculated as shown in Equation 4.

Referring to FIG. 4 , the checkup result data preprocessing unit 20 may include a correlation analysis unit 210, a disease correlation factor selection unit 220, a preprocessing unit 230, and a data group classification unit 240. .

The correlation analysis performing unit 210 may perform a plurality of analyzes to select disease-related factor candidates by receiving examination result data including examination results of a plurality of persons over time or a plurality of disease-related data.

Here, the checkup result data may refer to data including a plurality of health checkup results for a plurality of persons by storing results of each item of a health checkup performed by a specific person at least once in the form of a data set.

According to an embodiment of the present invention, the health checkup items included in the checkup result data include fasting blood sugar level, systolic blood pressure and/or diastolic blood pressure, total cholesterol level and/or high density cholesterol level (HDL), low density cholesterol level (LDL), Weight, body mass index (BMI), and the like may be included.

Here, the disease-related factor candidates may mean that factors inducing the onset of a target disease are defined as disease-related factors, and a plurality of factors that may be selected as disease-related factors are selected as a candidate group of disease-related factors.

According to an embodiment of the present invention, the factor causing the onset of the target disease may be a health checkup item included in the checkup result data, and the health checkup item included in the checkup result data is processed once or through a plurality of steps. It could be a specific factor that created it.

According to an embodiment of the present invention, disease-related data may refer to text-based data including the results of a study on the relationship between a target disease and a specific factor or the result of statistical analysis on a large number of people, , In general, it may be medical papers, statistical data, etc., but it is not limited thereto, and any text-based data on the correlation between target diseases and specific factors may be used without limitation.

According to an embodiment of the present invention, the association analysis performer 210 receives examination result data including examination results of a plurality of persons over time or a plurality of disease-related data, and analyzes disease association with respect to a target disease, big At least one of data analysis and meta-analysis may be performed.

According to an embodiment of the present invention, the correlation analysis performing unit 210 receives examination result data or a plurality of disease-related data and performs disease correlation analysis, big data analysis, and meta-analysis to determine three diseases according to each analysis result. Associate factor candidates can be selected.

The disease-related factor selector 220 may select at least one disease-related factor according to an overlapping degree among a plurality of disease-related factors selected as disease-related factor candidates for each analysis.

According to an embodiment of the present invention, at least two or more disease associations generated according to each analysis result by performing at least one of disease correlation analysis, big data analysis, and meta-analysis by receiving examination result data or a plurality of disease-related data A list of disease-related factors may be generated by selecting only factors commonly included in at least two or more among the factor candidates as disease-related factors.

According to an embodiment of the present invention, the disease-related factor selector 220 is a disease generated by comparing a plurality of disease-related factor candidates generated by performing at least one or more of disease-related analysis, big data analysis, and meta-analysis, respectively. Only disease-related factors included in all candidates for related factors may be selected as disease-related factors.

According to the above embodiment, factors included in all three disease-related factor candidates according to each analysis result generated by receiving examination result data or a plurality of disease-related data and performing disease correlation analysis, big data analysis, and meta-analysis A list of disease-related factors may be generated by selecting only the factors related to the disease.

According to the above embodiment, the reason why only factors included in all three disease-related factor candidates according to each analysis result generated by performing disease-related analysis, big data analysis, and meta-analysis are selected as disease-related factors is selected among numerous factors. This is to conservatively select a factor that has a relatively high influence on the induction of the disease, thereby reducing computational resources and time required for analysis, and increasing its accuracy.

The pre-processing unit 230 may process data according to preset pre-processing criteria for disease-related factors that require secondary processing of data among the examination result data of a plurality of persons for the selected at least one disease-related factor. .

According to an embodiment of the present invention, the following criteria exist as preset preprocessing criteria, but are not limited thereto, and the results of each item of the health checkup included in the checkup result data are processed as factors that can clarify disease-related factors. If it can be derived, it can be used without limitation.

According to an embodiment of the present invention, in the preprocessing unit 230, a disease-related factor included in at least one disease-related factor selected according to a pre-set preprocessing criterion may be used as a tendency criterion or a judgment criterion as an individual checkup result value. If it is classified as non-existent, it is possible to generate time-series checkup data for each checkup target period by performing preprocessing that calculates or reprocesses the checkup result data according to a pre-processing criterion set in advance so that it can be used as a tendency criterion or a judgment criterion.

According to an embodiment of the present invention, the preprocessing criteria may be as follows, but are not limited thereto, and may be used without limitation as long as they are preprocessed to be used as tendency criteria or judgment criteria.

[Pre-processing standard]

(1) Preprocessing with diabetes status data using fasting blood glucose

: Fasting blood sugar < 100 (normal),

100 ≤ fasting blood glucose <126 (impaired fasting blood sugar)

126 < fasting blood sugar (diabetes)

(2) Pre-processing of hypertension data using systolic or diastolic blood pressure

: Systolic blood pressure < 120 or diastolic blood pressure < 80 (normal)

120 ≤ systolic blood pressure < 140 or 80 ≤ diastolic blood pressure < 90 (prehypertension)

140 ≤ systolic blood pressure < 160 or 90 ≤ diastolic blood pressure < 100 (stage 1 hypertension)

160 ≤ systolic blood pressure or 100 ≤ diastolic blood pressure (stage 2 hypertension)

(3) Preprocessing with dyslipidemia data using total cholesterol or LDL

(4) Pre-processing with obesity data using BMI

: BMI < 18.5 (underweight)

18.5 ≤ BMI < 25.0 (normal)

25.0 ≤ BMI < 30.0 (overweight)

30 ≤ BMI (obesity)

30.0 ≤ BMI < 35.0 (moderately obese)

35.0 ≤ BMI < 40.0 (severely obese)

40 ≤ BMI (extremely obese)

According to an embodiment of the present invention, the preprocessing unit 230 collects individual checkup result values for disease-related factors included in the selected at least one disease-related factor from checkup result data, and collects the collected individual checkup result values. It is possible to generate time-series checkup data for each checkup target period by performing pre-processing to list them in time series.

According to an embodiment of the present invention, in the process of generating time-series checkup data for each checkup target period by performing preprocessing to list the collected individual checkup result values in time series, if there is a missing value in the time-series checkup data for each checkup target period, Preprocessing may be performed to remove missing value values, and according to another embodiment, preprocessing may be performed by estimating the value of the corresponding missing value using a statistical imputation method and adding the estimated value to the missing value item. , According to another embodiment, preprocessing to compensate for missing values may be performed using an artificial neural network-based machine learning technique.

According to an embodiment of the present invention, the pre-processing unit 230 pre-sets the pre-processing criteria for the types of disease-related factors that cannot produce result values by inputting individual checkup result values into a group trend model without pre-processing. Information and information on a preprocessing method of the disease-related factor may be included.

Here, the group trend model may refer to a method of classifying behavior types over time into clusters and estimating the trajectory shape of each group to verify the number of groups and the number of groups having the best fit with the data.

The data group classification unit 240 uses a group trend model to determine the number of persons based on changes in individual checkup result values included in the checkup result data of the plurality of persons for at least one disease-related factor over time. A plurality of groups can be created by grouping.

According to an embodiment of the present invention, the data group classification unit 240 inputs individual checkup result values included in the checkup result data of a plurality of persons for any one disease-related factor into a group trend model, so that the individual observation values of the data By calculating the probability of belonging to each group and assuming and estimating different distributions according to time points according to the properties of the dependent variable probability density function, a number of people are included in each group according to changes in individual checkup result values as shown in Equation 5 below Multiple groups can be created.

According to an embodiment of the present invention, as shown in Equation 5, the probability density function of the dependent variable can be expressed as the sum of the product of the probability of belonging to a specific group and the probability density function of the dependent variable of specific group members, and the dependent variable of the specific group member Since the variables have mutual independence at each time point, the probability density function of the dependent variable can be calculated as a product of the corresponding probability density function at each time point.

According to an embodiment of the present invention, the data group classification unit 240 estimates the trajectory shape for individual examination result values for each disease-related factor included in the examination result data of the persons included in each group, and determines the trajectory shape for each group. The classification suitability of the classified groups can be verified by contrasting the difference in .

According to an embodiment of the present invention, the maximum likelihood estimation method can be used to estimate the trajectory of each group and the proportion of cases in each group together using a group trend model, and the individual event trajectory contents shown in the data are relatively most The selection of the final model that describes well can be determined based on the Bayesian Information Index (BIC), and at this time, the lower the BIC value, the more the event trajectory of the individual shown in the data can be evaluated as a model that relatively describes. .

FIG. 5 is a detailed configuration diagram of a correlation analysis performing unit shown in FIG. 4 .

Referring to FIG. 5 , the correlation analysis performer 20 may include a disease correlation analyzer 211 , a big data analyzer 212 , and a meta-analysis performer 213 .

The disease correlation analysis unit 211 analyzes the correlation of a plurality of disease-related factors with respect to the possibility of onset of a target disease targeting the examination result data including the examination results over time of a number of persons, and determines that the correlation is high. The derived disease-related factors may be selected as disease-related factor candidates.

According to an embodiment of the present invention, the disease association analysis unit 211 inputs examination result data including examination results of a plurality of persons over time into a disease correlation analysis model to determine a plurality of diseases related to the possibility of onset of a target disease. Correlation analysis of related factors can be performed.

According to an embodiment of the present invention, the disease association analysis model can be implemented as a deep learning model based on an artificial neural network, and upon receiving examination result data including examination results over time of a large number of people, the correlation with the disease is relatively It can be learned to derive at least one checkup result item that is high as .

According to another embodiment of the present invention, the disease association analysis model may be a model that performs correlation analysis on examination result data including examination results over time of a plurality of persons received as input, and through this, correlation with disease By deriving at least one relatively high examination result item, correlation analysis of a plurality of disease-related factors may be performed.

The big data analysis unit 212 collects a plurality of data by crawling from a database in which text-based disease-related data is stored, and selects disease-related factor candidates by performing text mining on the collected data. can

According to an embodiment of the present invention, the big data analysis unit 212 crawls text-based data from databases such as NCBI DB, OMIM, Diseases Card, and open DB to obtain each disease name, related item, cause information, etc. It is possible to perform an analysis of correlation between a target disease and a plurality of disease-related factors by collecting a target and selecting and deriving significant related items through text mining of the collected data.

The meta-analysis unit 213 inputs a plurality of disease-related data on the subject of the target disease and the effect on the disease-related factor into the meta-analysis model, calculates the effect size for each disease-related factor for each of the plurality of disease-related data, , disease-related factor candidates can be selected according to the effect size.

According to one embodiment of the present invention, disease-related factors refer to factors that can affect the occurrence of a specific disease, and may include the presence or absence of other diseases, whether or not the result of a health checkup is within a predetermined range, but It is not limited and can be used without limitation as long as it is a factor that can affect the development of other diseases.

According to an embodiment of the present invention, meta-analysis generates a data set based on analysis information for each disease-related data by inputting a plurality of disease-related data into a meta-analysis model, and targeting the data set to match the subject of the disease-related factor. It may mean calculating an effect size, that is, a size that a corresponding disease-related factor affects a specific disease, and using the effect size to measure a target disease influence score, thereby selecting a disease-related factor candidate.

According to an embodiment of the present invention, in order to calculate the effect size based on the odds ratio, an odds ratio (OR), which is an index for the size of the effect on the disease (correlation size) for each disease-related factor And the effect size can be estimated based on the 95% confidence interval (95% Confidence Interval, CI). The overall OR can be calculated by combining the odds ratios of each individual disease-related data.

According to an embodiment of the present invention, a generic inverse variance estimation method may be used to measure a target disease influence score for each disease-related factor using the calculated effect size.

Inverse variance estimation is a method used to give weight in meta-analysis, and the reciprocal of the variance of the estimated effect size can be used as a weight for individual disease-related data.

According to an embodiment of the present invention using the inverse variance estimation method, disease-related data for a study with a large sample will have a small variance and the reciprocal of the variance will become large, so that the disease-related data for a study with a large sample will have a larger variance. Can be used for weighting.

According to the above embodiment, the natural logarithm of the odds ratio of each disease-related data is taken.

Calculate,

, and the overall effect size (OR _pooled ) can be calculated by summing all the values obtained by multiplying the odds ratio by the weight of each disease-related data calculated as in Equation 4.

6 is a block diagram of an apparatus for generating risk gene mutation information for each disease using a PRS model based on time-varying covariates implemented according to a second embodiment of the present invention.

Referring to FIG. 6 , an apparatus for generating disease onset information based on time-dependent association using multi-gene risk scores implemented according to the second embodiment of the present invention includes a genome data pre-processing unit 10, a checkup result data pre-processing unit 20, It may further include a multi-gene risk score calculation unit 30, a time series characteristic parameterization unit 40, a risk calculation unit 50, an onset prediction information generation unit 60, and a PRS model verification unit 70.

The PRS model verification unit 70 may determine whether to use or redesign the time-varying PRS model by performing verification of the time-varying PRS model according to whether the PRS model is for a continuous target disease or a discrete target disease. .

According to an embodiment of the present invention, the use or redesign of the PRS model may be determined by performing verification of the PRS model according to whether the PRS model is for a continuous target disease or a discrete target disease.

Evaluation of the PRS model can be largely divided into two types, and the case where the phenotype is continuous, such as height, weight, and BMI, and the case where the phenotype is discrete, such as disease, such as in the present invention.

According to an embodiment of the present invention, in order for the PRS verification unit 70 to verify the time-varying PRS model for a discrete target disease, an ROC curve may be used, and an AUC value of the ROC curve is calculated to correspond to the PRS. You can verify whether the model is adequate or not.

According to an embodiment of the present invention, in the case of a discrete type, a ROC curve can be generated using the PRS estimate, the performance of the model can be evaluated using the AUC between diseases (phenotypes), and the AUC is high. The performance of the model can be evaluated as good.

Referring to FIG. 7, a Manhattan plot generated as a result of GWAS analysis according to an embodiment of the present invention is shown. It means a bar-shaped graph created by performing correlation analysis between a target disease and a plurality of gene mutations included in genome data through a linear regression model, logistic regression model, or mixed model as a dependent variable, and the X axis is an individual gene It can represent mutation, and the Y-axis can mean the P value for each gene mutation calculated through GWAS analysis.

According to an embodiment of the present invention, when the P value among the plurality of genetic variants displayed on the Manhattan plot by setting the threshold to 5.0x10 ^-8 is 5.0x10 ^-8 or less, the plurality of genetic variants can be selected as disease-causing factor candidates. can

Referring to FIG. 8, a data table format of result data generated as a result of GWAS analysis performed according to an embodiment of the present invention is shown, and the result data includes chromosome ID, gene mutation (SNP) ID, locus (base-pair) Information, information on tested alleles, information such as effect size calculation criteria, prior literature information, etc. may be included.

Referring to FIG. 9, the data format of genomic data labeled with a disease is shown according to an embodiment of the present invention, and the genomic data labeled with a disease includes a genetic mutation identification code (SNP rs number), covariate information (covariate), Target disease information (pheno type) may be included.

Referring to FIG. 10, an odds ratio (OR) calculated for each prior literature describing the association between a specific genetic variant and a disease through a meta-analysis according to an embodiment of the present invention and a target disease influence score of a specific genetic variant The process of calculating is shown.

Referring to Figure 10, Abraham, R (2009), Allen, M. (Mayo Cohort) (2014), etc. represent individual prior literature, and the table of Figure 6 shows the odds ratio (OR) and 95% confidence of each individual prior literature. The interval (95% CI) is recorded, and it is shown that the overall effect size (overall OR) of 1.03 was calculated by combining the odds ratios of each individual prior literature.

Referring to FIG. 11, the criteria for classifying genetic mutations into a plurality of groups according to the degree of overlap among the genetic mutations included in the three disease-inducing factor candidate lists generated by performing GWAS analysis, AI analysis, and meta-analysis, respectively, are shown. Genetic mutations included in the generated three disease-inducing factor candidate lists are compared, respectively, and the group included in all three disease-inducing factor candidate lists according to the degree of overlap, that is, the degree of intersection, and the two disease-inducing factor candidate lists It can be classified into an included group and a group included only in one disease-inducing factor candidate list.

Referring to FIG. 12, it is shown that genetic mutations included in the three disease-inducing factor candidate lists are classified into 9 groups and classified into 3 priority levels, and as shown in FIG. 8, all of the three disease-inducing factor candidate lists are included. It can be classified into 1 group, 4 groups included in two disease-causing factor candidate lists, and 4 groups included in 1 disease-causing factor candidate list, and each group can be generated by an intersection combination of each list. .

According to an embodiment of the present invention, genetic mutations can be classified into high-risk groups, intermediate-risk groups, and low-risk groups by sorting based on the genetic mutation correlation score for each disease of each genetic mutation included in a plurality of graded genetic mutation lists, , By using this, as shown in FIG. 13, the elapsed time of people having the gene can be created as a graph to provide the user with the expected incidence rate of the disease for each risk group.

According to an embodiment of the present invention, the method for generating disease outbreak information based on time-dependent correlation using polygenic risk scores may be driven by a disease outbreak information generating device including at least one processor.

A plurality of disease-inducing factor candidate lists are generated by receiving genome data on a plurality of persons or a plurality of preceding literatures, and genetic mutations are classified into a plurality of groups to prioritize them (S10).

According to an embodiment of the present invention, a plurality of disease-inducing factor candidate lists are generated by receiving genomic data or a plurality of prior literatures for a plurality of persons, performing a plurality of analyses, and genes included in the plurality of disease-causing factor candidate lists. Mutations may be classified into a plurality of groups, and the classified groups may be divided into a plurality of priority levels.

According to an embodiment of the present invention, a list of disease-inducing factor candidates may be generated for each analysis by analyzing genomic data of a plurality of persons or a plurality of prior literature.

According to an embodiment of the present invention, genetic mutations included in the disease-inducing factor candidate list generated for each analysis are classified into a plurality of groups, and a priority level including at least one group is selected for the classified plurality of groups. can be classified.

According to an embodiment of the present invention, a plurality of analyzes may be performed to select disease-inducing factor candidates by receiving genomic data on a plurality of persons or a plurality of prior literature.

According to an embodiment of the present invention, a plurality of disease-inducing factor candidate lists including a plurality of gene mutations selected as disease-inducing factor candidates for each of a plurality of analyzes may be generated.

According to an embodiment of the present invention, genetic mutations selected as a plurality of disease-causing factor candidates through at least one of GWAS analysis, AI analysis, and meta-analysis, and analysis result data of the selected genetic mutations are presented in a list format for each analysis result. It is possible to generate a candidate list of disease-inducing factors for each analysis result.

According to an embodiment of the present invention, genome data for a large number of people is input and genome association analysis is performed for target diseases, and as a result of the execution, the P value calculated for each genetic mutation is compared with a preset threshold, A plurality of genetic mutations below can be selected as disease-causing factor candidates.

According to an embodiment of the present invention, for a plurality of genetic mutations selected as disease-inducing factor candidates, it is determined whether the location of each genetic mutation has a linkage disequilibrium state, and each gene according to the determination result. A final disease-inducing factor candidate can be generated by selecting only one genetic mutation having a representative locus.

According to an embodiment of the present invention, LD clumping is performed on a plurality of gene mutations selected as disease-causing factor candidates in order to select only one genetic mutation having a representativeness for each locus, and each locus A method of selecting genetic mutations having representativeness can be used, and the highest genetic mutations can be selected by setting a ranking based on the calculated importance score for each genetic mutation as a selection criterion.

According to an embodiment of the present invention, genome-wide association analysis can be performed to generate result data in the form of a data table with a plurality of field values as items, including chromosome ID, SNP ID, and P value calculated for each genetic mutation. this may be included.

According to an embodiment of the present invention, genomic data for a plurality of persons labeled with a disease is input to an artificial neural network-based disease-causing factor prediction model to output an importance score for each genetic mutation, and to output an importance score for each genetic mutation. Among the importance scores, a plurality of genetic mutations having an importance score exceeding a preset score may be selected as disease-inducing factor candidates.

According to an embodiment of the present invention, after randomly shuffling the order of each genetic variant in the order of importance scores for each genetic variant, a model defining the genetic variant whose importance is to be determined as noise is created, and the model determines the dependence on the genetic variant. It can be calculated by quantification.

According to one embodiment of the present invention, a plurality of prior art articles contributed on the subject of genetic mutation targeting a target disease are input into a meta-analysis model, and the effect size corresponding to the subject of the genetic mutation is determined for each of the plurality of prior art documents. Calculate and apply the reciprocal of the variance of the calculated effect size as a weight to the effect size of each prior literature to measure the target disease influence score for each genetic variant, and determine a plurality of genes based on the target disease influence score for each genetic variant Mutations can be selected as disease-causing factor candidates.

According to an embodiment of the present invention, the effect size corresponding to the subject of the genetic mutation for each of a plurality of prior documents is calculated by calculating the odds ratio and confidence interval for each prior document, and based on the odds ratio and confidence interval, The effect size of a genetic mutation on a target disease can be estimated.

According to one embodiment of the present invention, by analyzing a plurality of prior literature, systematically reviewing prior literature describing the effect of a specific genetic mutation on the same topic, that is, a specific disease, and analyzing the final selected literature to match the topic. The resulting value (effect size) can be extracted and used.

Calculate,

It can be calculated as Equation 4, and the overall effect size (OR _pooled ) can be calculated by summing all the values obtained by multiplying the odds ratio by the weight of each prior literature calculated as in Equation 4.

According to an embodiment of the present invention, genetic mutations included in the plurality of disease-inducing factor candidate lists may be classified into a plurality of groups according to the degree of overlap among the gene mutations included in the plurality of disease-causing factor candidate lists.

According to an embodiment of the present invention, gene mutations may be classified into a plurality of groups by determining whether or not to cross each other according to the degree of overlap among genetic mutations.

According to one embodiment of the present invention, if among the genetic variants included in the three disease-inducing factor candidate lists, whether they are included in the intersection with each list, the genetic variants can be classified into 9 groups according to how many lists they cross with. there is.

According to an embodiment of the present invention, 9 groups are classified into priority levels of 1, 2, and 3, including 1 group in 1st class, 4 groups in 2nd class, and 4 groups in 3rd class. can make it

At least one disease-related factor is selected by receiving examination result data or multiple disease-related data, and a plurality of groups are selected based on changes in individual examination result values included in the examination result data of a large number of people using a group trend model is generated (S20).

According to an embodiment of the present invention, at least one disease-related factor is selected by receiving a plurality of analyzes by receiving examination result data or a plurality of disease-related data including examination results over time of a plurality of persons, and selecting a group A plurality of persons may be grouped based on changes in individual checkup result values included in the checkup result data of the plurality of persons for the at least one disease-related factor using a trend model to generate a plurality of groups.

According to an embodiment of the present invention, a plurality of analyzes are performed by receiving examination result data or a plurality of disease-related data including examination results over time of a plurality of persons, and at least one disease-related factor according to each analysis result can be selected.

According to an embodiment of the present invention, a plurality of persons are grouped based on changes in individual checkup result values included in the checkup result data of a plurality of persons for at least one disease-related factor using a group trend model, and a plurality of persons are grouped. Groups can be created, where the group trend model refers to a method of classifying behavioral types over time into clusters and estimating the trajectory shape of each group to verify the number of groups and the number of groups that have the best fit with the data. can do.

According to an embodiment of the present invention, a plurality of analyzes may be performed to select disease-related factor candidates by receiving examination result data including examination results of a plurality of persons over time or a plurality of disease-related data.

According to an embodiment of the present invention, at least one of disease association analysis, big data analysis, and meta-analysis for a target disease by receiving examination result data or a plurality of disease-related data including examination results over time of a plurality of persons above analysis can be performed.

According to an embodiment of the present invention, three disease-related factor candidates can be selected according to each analysis result by receiving examination result data or a plurality of disease-related data and performing disease correlation analysis, big data analysis, and meta-analysis. .

According to an embodiment of the present invention, correlation analysis of a plurality of disease-related factors for the possibility of onset of a target disease is performed for examination result data including examination results over time of a plurality of persons, and the correlation is high. The derived disease-related factors may be selected as disease-related factor candidates.

According to an embodiment of the present invention, a correlation analysis of a plurality of disease-related factors with respect to the possibility of developing a target disease is performed by inputting examination result data including examination results of a plurality of persons over time into a disease correlation analysis model. can

According to an embodiment of the present invention, a plurality of data is collected by crawling from a database in which text-based disease-related data is stored, and text mining is performed on the collected plurality of data to select disease-related factor candidates. can

According to an embodiment of the present invention, text-based data from databases such as NCBI DB, OMIM, Diseases Card, and open DB are crawled to collect each disease name, related item, cause information, etc., and collect It is possible to analyze the association between a target disease and a plurality of disease-related factors by selecting and deriving significant related items through text mining of the analyzed data.

According to an embodiment of the present invention, a plurality of disease-related data on the subject of a target disease and its effect on disease-related factors are input into a meta-analysis model, and the effect size for each disease-related factor is calculated for each of the plurality of disease-related data , disease-related factor candidates can be selected according to the effect size.

Calculate,

According to an embodiment of the present invention, at least one disease-related factor may be selected according to an overlapping degree among a plurality of disease-related factors selected as disease-related factor candidates for a plurality of analyses.

According to an embodiment of the present invention, a plurality of disease-related factor candidates generated by performing at least one or more of disease-related analysis, big data analysis, and meta-analysis are respectively compared, and diseases included in all of the disease-related factor candidates generated Only relevant factors can be selected as disease-related factors.

According to an embodiment of the present invention, data can be processed according to pre-processing criteria set for a disease-related factor requiring secondary processing of the data among the examination result data of a plurality of persons for the selected at least one disease-related factor. there is.

According to an embodiment of the present invention, when a disease-related factor included in at least one disease-related factor selected according to a pre-set preprocessing criterion is classified as not being able to be used as a tendency criterion or a judgment criterion as an individual examination result value, the tendency Time-series examination data for each period subject to examination may be generated by performing pre-processing to calculate or reprocess the examination result data according to a pre-processing criterion set in advance so as to be used as a criterion or judgment standard.

[Pre-processing standard]

(1) Preprocessing with diabetes status data using fasting blood glucose

: Fasting blood sugar < 100 (normal),

100 ≤ fasting blood glucose <126 (impaired fasting blood sugar)

126 < fasting blood sugar (diabetes)

: Systolic blood pressure < 120 or diastolic blood pressure < 80 (normal)

1 ≤ systolic blood pressure or 100 ≤ diastolic blood pressure (stage 2 hypertension)

(3) Preprocessing with dyslipidemia data using total cholesterol or LDL

(4) Pre-processing with obesity data using BMI

: BMI < 18.5 (underweight)

18.5 ≤ BMI < 25.0 (normal)

25.0 ≤ BMI < 30.0 (overweight)

30 ≤ BMI (obesity)

30.0 ≤ BMI < 35.0 (moderately obese)

35.0 ≤ BMI < 40.0 (severely obese)

40 ≤ BMI (extremely obese)

According to an embodiment of the present invention, individual checkup result values for disease-related factors included in at least one selected disease-related factor are collected from checkup result data, and preprocessing is performed to list the collected individual checkup result values in time series. Therefore, it is possible to generate time-series examination data for each period of the entire examination target.

According to an embodiment of the present invention, the preset preprocessing criterion is information on the type of disease-related factor that cannot produce a result value by inputting individual checkup result values into a group trend model without preprocessing, and information on the type of disease-related factor Information on preprocessing methods may be included.

According to an embodiment of the present invention, based on changes in individual checkup result values included in checkup result data of a plurality of persons for at least one disease-related factor over time using a group trend model, the plurality of persons A plurality of groups can be created by grouping.

According to an embodiment of the present invention, the individual checkup result values included in the checkup result data of a plurality of persons for at least one disease-related factor are entered into a group trend model to calculate the probability that individual observations of the data belong to each group By assuming and estimating different distributions according to time points according to the properties of the probability density function of the dependent variable, a plurality of groups including a plurality of people in each group can be generated according to changes in individual checkup result values as shown in Equation 5. .

According to an embodiment of the present invention, the group classified by estimating the trajectory form for the individual examination result values for each disease-related factor included in the examination result data of the persons included in each group and comparing the difference in the trajectory form for each group The classification suitability of can be verified.

Using the PRS model designed for each group, targeting the plurality of genetic variants included in each of the plurality of groups classified for the genome data, the multigene risk score and group multigene risk score for each genetic variant are calculated (S30). .

According to an embodiment of the present invention, a PRS model is designed for each group for a plurality of genetic mutations included in each of a plurality of groups classified in the genomic data preprocessing unit, and the risk confrontation of genetic mutations in each group is used by using the PRS model. Association for the number of genes (

) is calculated as a weight, and the polygene risk score (

) and group polygenic risk score (

) can be calculated.

) is calculated as a weight, and the polygene risk score (

) and group polygenic risk score (

) can be designed to calculate

) for relevance (

) as the weighted sum of the multigene risk scores for each group (

) can be calculated.

According to one embodiment of the present invention, weights can also be calculated through regression analysis between a genetic variation and a target disease (phenotype), but according to another embodiment, since there is an association (LD) between genetic variations (SNPs), a general regression An estimate of the weights (

The calculated multigene risk score for each genetic mutation is applied as a covariate to the model for calculating time-dependent correlation, and the individual checkup result values for each disease-related factor included in the checkup result data of persons included in each group generated for the checkup result data are input into the time-dependent correlation calculation model to calculate the correlation size for each time point and select it as a time-series characteristic variable (S40).

According to an embodiment of the present invention, a time-dependent correlation calculation model may be expressed as Equation 2 in order to use time-dependent cox.

) can be defined, and in Equation 2

) with the response variable (

) means the COX regression analysis formula using

The formula of a plurality of variables affected by the flow in time (

) and the association with the response variable (

) means the COX regression analysis formula using

) may mean a digitized value for variables that do not fluctuate from time to time, such as gender, genotype, disease history, and polygenic risk score, and a plurality of variables affected by the flow of time (

In the formula, the polygenic risk score for each genetic variant (

)cast

, and the multigene risk score for each gene mutation (

) the correlation index with the response variable

input method can be used.

At least one time-series characteristic variable calculated for the checkup result data for disease-related factors is applied, and COX regression analysis is performed for each group on the checkup result data for the disease-related factors of a large number of applied persons. The risk of disease occurrence is calculated for each category (S50).

According to an embodiment of the present invention, at least one time-series characteristic variable calculated for the checkup result data for disease-related factors is applied, and the checkup result data for disease-related factors of a plurality of people applied is applied for each group. COX regression analysis can be performed to calculate the disease risk for each group.

According to an embodiment of the present invention, at least one calculated time-series characteristic variable may be applied to examination result data for disease-related factors, and each value of the disease-inducing factor at each time point of the examination result data for disease-related factors Application can be performed by multiplying the time series characteristic variables calculated for each time point.

Risk change is calculated using the difference between the calculated risk of disease occurrence for each group, and outbreak prediction information is generated (S60).

According to an embodiment of the present invention, risk change may be calculated using a difference value between the calculated risk of disease occurrence for each group, and thus onset prediction information may be generated.

According to an embodiment of the present invention, the calculated risk of disease occurrence for each group can be compared for each time point to calculate the risk change at a specific time point for each group, and outbreak prediction information can be generated based on this.

According to an embodiment of the present invention, the average value of the calculated risk change at a specific time point for each group is specified as the risk change amount at that time point, and based on this, the expected incidence rate at a later time point is calculated to generate onset prediction information. there is.

Embodiments of the present invention are not implemented only through the devices and / or methods described above, and the embodiments of the present invention have been described in detail above, but the scope of the present invention is not limited thereto, and the following claims Various modifications and improvements of those skilled in the art using the basic concept of the present invention defined in , also belong to the scope of the present invention.

Claims

A plurality of analyzes are performed by receiving genomic data of a plurality of individuals or a plurality of prior literature, generating a plurality of disease-causing factor candidate lists, and classifying genetic mutations included in the plurality of disease-causing factor candidate lists into a plurality of groups. Genomic data pre-processing unit to;

At least one disease-related factor is selected by receiving examination result data or a plurality of disease-related data, including examination results of a large number of people over time, and performing a plurality of analyzes, and using a population trend model to determine the at least one disease-related data. a checkup result data pre-processor configured to create a plurality of groups by grouping the plurality of persons based on changes in individual checkup result values included in the checkup result data of the plurality of persons for disease-related factors;

A PRS model is designed for each group targeting a plurality of genetic variants included in each of the plurality of groups classified in the genomic data pre-processing unit, and the number of risk alleles of the genetic variant in each group is correlated using the PRS model. a multi-gene risk score calculation unit for calculating a multi-gene risk score for each genetic mutation and a multi-gene risk score for each group by calculating ? as a weight;

The calculated multigene risk score for each genetic mutation is applied as a covariate to the time-dependent correlation calculation model, and individual examinations for each disease-related factor included in the examination result data of persons included in each group generated in the examination result data preprocessing unit a time-series characteristic parameterization unit that inputs the resultant values into the time-dependent correlation calculation model, calculates the magnitude of the correlation between the independent variable and the dependent variable to be input into the COX model, and selects it as a time-series characteristic variable;

Applying the time series characteristic variables calculated for the checkup result data for the disease-related factors, and performing COX regression analysis for each group on the checkup result data for the disease-related factors of a large number of applied persons, a risk calculation unit that calculates a risk of disease occurrence; and

Disease outbreak information based on time-dependent association using multi-gene risk score, characterized in that it comprises an outbreak prediction information generation unit that generates outbreak prediction information by calculating the risk change using the calculated risk of disease occurrence for each group using the difference value generating device.
The method of claim 1, wherein the dielectric data pre-processing unit,

a disease-inducing factor screening unit that performs a plurality of analyzes to select disease-causing factor candidates by receiving genomic data for a plurality of individuals or a plurality of prior literature;

a disease-causing factor candidate list generation unit generating a plurality of disease-causing factor candidate lists including a plurality of gene mutations selected as disease-causing factor candidates for each of the plurality of analyses;

a gene mutation group classification unit which classifies the gene mutations included in the plurality of disease-inducing factor candidate lists into a plurality of groups according to the degree of overlap among the gene mutations included in the plurality of disease-causing factor candidate lists;

A priority class classification unit that divides the classified plurality of groups into a plurality of priority levels, removes the overlapping genetic variation among the plurality of genetic variations included in each priority level, and generates a genetic variation list for each class. An apparatus for generating disease onset information based on time-dependent association using a multi-gene risk score further comprising:
The method of claim 2, wherein the dielectric data pre-processing unit,

Diseases based on time-dependent association using multigene risk scores, characterized in that at least one of GWAS analysis, AI analysis, and meta-analysis is performed on the target disease by receiving genomic data for a large number of people or a plurality of prior literature An outbreak information generating device.
The method of claim 3, wherein the disease-causing factor selection unit,

Genetic data for a large number of individuals are input and genome-wide association analysis is performed for target diseases, and as a result of the above, the P value calculated for each genetic variation is compared with a preset threshold to determine a plurality of genetic variations below the threshold. An apparatus for generating disease onset information based on time-dependent association using a multi-gene risk score, further comprising a GWAS analysis unit for selecting disease-inducing factor candidates.
The method of claim 3, wherein the disease-causing factor selection unit,

The artificial neural network-based disease-inducing factor prediction model inputs genome data for a large number of people labeled with the disease to output an importance score for each genetic mutation, and selects a preset score among the output importance scores for each genetic mutation. An apparatus for generating disease onset information based on time-dependent association using a multi-gene risk score, further comprising an AI analysis unit for selecting a plurality of genetic mutations with an exceeding importance score as disease-causing factor candidates.
The method of claim 3, wherein the disease-causing factor selection unit,

A plurality of prior literature contributed on the subject of genetic variation targeting the target disease is input into a meta-analysis model, an effect size corresponding to the subject of the genetic variation is calculated for each of the plurality of prior literature, and the calculated effect size The reciprocal of the variance of is applied as a weight to the effect size of each prior literature to measure the target disease impact score for each genetic variant, and select multiple genetic variants as disease-inducing factor candidates based on the target disease influence score for each genetic variant. A time-dependent association-based disease onset information generation device using a multi-gene risk score further comprising a meta-analysis performing unit that performs a meta-analysis.
The method of claim 4, wherein the GWAS analysis unit,

For the plurality of gene mutations selected as disease-inducing factor candidates, it is determined whether the location of each genetic mutation has a linkage disequilibrium state, and according to the determination result, one gene having a representativeness for each locus is determined. An apparatus for generating disease onset information based on time-dependent association using a multi-gene risk score, characterized in that only mutations are selected to generate final disease-inducing factor candidates.
The AI analysis unit of claim 5,

Genomic data for a plurality of persons labeled with a disease includes genetic mutation identification code, covariate information, and target disease information.
The AI analysis unit of claim 5,

The artificial neural network-based disease-causing factor prediction model receives genetic mutation identification codes, covariate information, and target disease information included in genome data for a large number of people, and calculates an importance score for each genetic mutation for a target disease. An apparatus for generating disease onset information based on time-dependent association using a multi-gene risk score, characterized in that it is learned to output.
The AI analysis unit of claim 5,

The importance score for each genetic variant is randomly mixed in the order of each genetic variant, then a model is created to define the genetic variant whose importance is to be determined as noise, and the model quantifies and calculates the dependence on the genetic variant. An apparatus for generating disease onset information based on time-dependent association using a multi-gene risk score.
The method of claim 6, wherein the meta-analysis performing unit,

The effect size corresponding to the subject of the genetic mutation for each of the plurality of prior documents calculates an odds ratio and a confidence interval for each prior document, and the genetic mutation for each prior document is determined based on the odds ratio and confidence interval for the target disease. An apparatus for generating disease onset information based on time-dependent association using multigene risk scores, characterized in that for estimating the effect size on
The meta-analysis performer of claim 11,

The effect size is calculated as a weight for each prior document through inverse variance estimation, and the weight is applied to the calculated odds ratio for each prior document, and the odds ratio for each prior document calculated is added to calculate the target disease influence score. An apparatus for generating disease onset information based on time-dependent association using a multi-gene risk score.
The method of claim 3, wherein the gene mutation group classification unit

Polygenic risk score, characterized by classifying genetic mutations into 9 groups according to the degree of overlap among the genetic mutations included in the three disease-inducing factor candidate lists generated by performing the GWAS analysis, AI analysis, and meta-analysis, respectively Apparatus for generating disease onset information based on time-dependent correlation using
14. The method of claim 13, wherein the priority class classification unit,

The nine groups are classified into priority levels of 1, 2, and 3, characterized in that 1 group is included in the 1st class, 4 groups are included in the 2nd class, and 4 groups are included in the 3rd class. A device for generating disease onset information based on time-dependent association using genetic risk scores.
The method of claim 4, wherein the multi-gene risk score calculation unit,

The association with respect to the number of risk alleles of genetic variation in each group is the association with the number of risk alleles of genetic variation in each group derived from the GWAS analysis result. based disease onset information generation device.
According to claim 1,

A multi-gene risk score further comprising a PRS model verification unit that determines whether to use or redesign the PRS model by performing verification of the PRS model according to whether the PRS model is for a continuous target disease or a discrete target disease Apparatus for generating disease onset information based on time-dependent correlation using
The method of claim 1, wherein the examination result data pre-processing unit,

A correlation analysis performing unit that performs a plurality of analyzes for selecting disease-related factor candidates by receiving examination result data or a plurality of disease-related data including examination results of a plurality of persons over time;

a disease-related factor selector selecting at least one disease-related factor according to an overlapping degree among the plurality of disease-related factors selected as disease-related factor candidates for each of the plurality of analyses;

a pre-processing unit for processing data according to pre-set pre-processing standards for disease-related factors requiring secondary processing of data among the plurality of persons' checkup result data for the selected at least one disease-related factor; and

A plurality of groups are formed by grouping the plurality of persons based on the change in individual checkup result values included in the checkup result data of the plurality of persons for at least one disease-related factor over time using a group trend model An apparatus for generating disease onset information based on time-dependent association using a multi-gene risk score, further comprising a data group classification unit to generate.
The method of claim 17, wherein the correlation analysis performing unit,

Characterized in that at least one of disease correlation analysis, big data analysis, and meta-analysis is performed on the target disease by receiving examination result data including examination results over time of a large number of people or multiple disease-related data A device for generating disease onset information based on time-dependent association using a multi-gene risk score.
The method of claim 18, wherein the correlation analysis performing unit,

The correlation analysis of multiple disease-related factors for the possibility of onset of a target disease is performed targeting the examination result data, including the examination results over time of a large number of people, and the disease-related factors derived as highly correlated are correlated with the disease An apparatus for generating disease onset information based on time-dependent association using a multi-gene risk score further comprising a disease association analysis unit for selecting factor candidates.
The method of claim 18, wherein the correlation analysis performing unit,

Further comprising a big data analysis unit that collects a plurality of data by crawling from a database in which text-based disease-related data is stored, and selects disease-related factor candidates by performing text mining on the collected plurality of data A device for generating disease onset information based on time-dependent association using genetic risk scores.
The method of claim 18, wherein the correlation analysis performing unit,

A plurality of disease-related data on the subject of the target disease and its effect on disease-related factors is input into a meta-analysis model, and the effect size for each disease-related factor is calculated for each of the plurality of disease-related data, and the disease is determined according to the effect size. An apparatus for generating disease onset information based on time-dependent association using a multi-gene risk score, further comprising a meta-analysis performing unit for selecting association factor candidates.
The method of claim 18, wherein the disease-related factor selection unit,

A plurality of disease-related factor candidates generated by performing at least one of disease correlation analysis, big data analysis, and meta-analysis are compared, and only disease-related factors included in all of the generated disease-related factor candidates are selected as disease-related factors. An apparatus for generating disease onset information based on time-dependent association using a multi-gene risk score, characterized in that.
The method of claim 17, wherein the preprocessing unit,

Individual checkup result values for disease-related factors included in at least one selected disease-related factor are collected from the checkup result data, and pre-processing is performed to enumerate the collected individual checkup result values in a time series to obtain time-series checkup data for all checkup target periods An apparatus for generating disease onset information based on time-dependent association using a multi-gene risk score, characterized in that for generating.
The method of claim 17, wherein the preprocessing unit,

When a disease-related factor included in at least one disease-related factor selected according to a pre-set preprocessing criterion is classified as something that cannot be used as a tendency criterion or judgment criterion as an individual checkup result value, so that it can be used as a tendency criterion or judgment criterion Time-dependent association-based disease outbreak information using multi-gene risk scores, characterized in that preprocessing is performed to calculate or reprocess from the examination result data according to the pre-set preprocessing criteria to generate time-series examination data for each period of the entire examination subject. Device for generating information.
The method of claim 24, wherein the preprocessing unit,

The preset pre-processing criteria include information on the types of disease-related factors that cannot produce results by inputting individual checkup result values into a group trend model without pre-processing, and information on the pre-processing method of the disease-related factors. An apparatus for generating disease onset information based on time-dependent association using a multi-gene risk score, characterized in that.
18. The method of claim 17, wherein the data group classification unit,

It is characterized by verifying the classification suitability of the classified groups by estimating the trajectory form for the individual examination result values for each disease-related factor included in the examination result data of the persons included in each group and comparing the difference in trajectory form for each group An apparatus for generating disease onset information based on time-dependent association using a multi-gene risk score.
Driven by a disease outbreak information generating device including at least one processor,

A plurality of analyzes are performed by receiving genomic data of a plurality of individuals or a plurality of prior literature, generating a plurality of disease-causing factor candidate lists, and classifying genetic mutations included in the plurality of disease-causing factor candidate lists into a plurality of groups. doing;

At least one disease-related factor is selected by receiving examination result data or a plurality of disease-related data, including examination results over time of a large number of persons, and performing a plurality of analyzes, and using a population trend model to determine the at least one disease-related data. generating a plurality of groups by grouping the plurality of persons based on changes in individual checkup result values included in the checkup result data of the plurality of persons for the disease-related factors of the number of persons;

A PRS model is designed for each group targeting a plurality of genetic variants included in each of a plurality of groups classified for the genome data, and the association with respect to the number of risk alleles of genetic variants in each group using the PRS model Calculating a multi-gene risk score for each gene mutation and a group multi-gene risk score for each group by calculating ? as a weight;

The calculated multigene risk score for each genetic mutation is applied as a covariate to the time-dependent correlation calculation model, and individual examinations for each disease-related factor included in the examination result data of persons included in each group generated for the examination result data inputting the resulting values into the time-dependent correlation calculation model to calculate the magnitude of correlation between independent variables and dependent variables to be input into the COX model at each point in time, and selecting them as time-series characteristic variables;

Applying the time series characteristic variables calculated for the checkup result data for the disease-related factors, and performing COX regression analysis for each group on the checkup result data for the disease-related factors of a large number of applied persons, Calculating the risk of disease occurrence; and

A method of generating disease outbreak information based on time-dependent association using a multi-gene risk score, comprising the step of generating outbreak prediction information by calculating the risk change using the calculated risk of disease occurrence for each group using the difference value.
The method of claim 27, wherein the step of classifying the genetic mutations into a plurality of groups,

Performing a plurality of analyzes to select disease-inducing factor candidates by receiving genomic data or a plurality of prior literature for a plurality of persons;

generating a plurality of disease-causing factor candidate lists including a plurality of gene mutations selected as disease-causing factor candidates for each of the plurality of analyses;

classifying genetic mutations included in the plurality of disease-causing factor candidate lists into a plurality of groups according to the degree of overlap among the gene mutations included in the plurality of disease-causing factor candidate lists;

Dividing the classified plurality of groups into a plurality of priority levels, and removing overlapping genetic mutations among the plurality of genetic variations included in each priority level, leaving only one genetic variation, to create a genetic variation list for each of the plurality of levels. A method for generating disease onset information based on time-dependent association using polygenic risk scores.
29. The method of claim 28, wherein performing the plurality of assays comprises:

Disease-specific risk using a time-varying covariate-based PRS model characterized in that at least one of GWAS analysis, AI analysis, and meta-analysis is performed on the target disease by receiving genomic data or a plurality of prior literature for a large number of people Gene mutation information generating device.
30. The method of claim 29, wherein performing the plurality of assays comprises:

Genetic data for a large number of individuals are input and genome-wide association analysis is performed for target diseases, and as a result of the above, the P value calculated for each genetic variation is compared with a preset threshold to determine a plurality of genetic variations below the threshold. A method for generating disease onset information based on time-dependent association using a polygenic risk score, further comprising selecting a disease-inducing factor candidate.
30. The method of claim 29, wherein performing the plurality of assays comprises:

The artificial neural network-based disease-inducing factor prediction model inputs genome data for a large number of people labeled with the disease to output an importance score for each genetic mutation, and selects a preset score among the output importance scores for each genetic mutation. A method for generating disease onset information based on time-dependent association using a multi-gene risk score, further comprising selecting a plurality of genetic mutations having a greater importance score as disease-inducing factor candidates.
30. The method of claim 29, wherein performing the plurality of assays comprises:

A plurality of prior literature contributed on the subject of genetic variation targeting the target disease is input into a meta-analysis model, an effect size corresponding to the subject of the genetic variation is calculated for each of the plurality of prior literature, and the calculated effect size The reciprocal of the variance of is applied as a weight to the effect size of each prior literature to measure the target disease impact score for each genetic variant, and select multiple genetic variants as disease-inducing factor candidates based on the target disease influence score for each genetic variant. A method for generating disease onset information based on time-dependent association using a multi-gene risk score, further comprising the step of:
The method of claim 30, wherein the step of selecting a plurality of genetic mutations below the threshold as disease-causing factor candidates,

For the plurality of gene mutations selected as disease-inducing factor candidates, it is determined whether the location of each genetic mutation has a linkage disequilibrium state, and according to the determination result, one gene having a representativeness for each locus is determined. A method for generating disease onset information based on time-dependent association using a polygenic risk score, characterized in that only mutations are selected to generate final disease-inducing factor candidates.
The step of selecting a plurality of genetic mutations having an importance score exceeding the preset score as disease-causing factor candidates according to claim 31,

A method for generating disease onset information based on time-dependent association using a polygenic risk score, wherein the genomic data for a plurality of persons labeled with a disease includes a genetic mutation identification code, covariate information, and target disease information.
The step of selecting a plurality of genetic mutations having an importance score exceeding the preset score as disease-causing factor candidates according to claim 31,

The artificial neural network-based disease-causing factor prediction model receives genetic mutation identification codes, covariate information, and target disease information included in genome data for a large number of people, and calculates an importance score for each genetic mutation for a target disease. A method for generating disease onset information based on time-dependent association using a multi-gene risk score, characterized in that it is learned to output.
The step of selecting a plurality of genetic mutations having an importance score exceeding the preset score as disease-causing factor candidates according to claim 31,

The importance score for each genetic variant is randomly mixed in the order of each genetic variant, then a model is created to define the genetic variant whose importance is to be determined as noise, and the model quantifies and calculates the dependence on the genetic variant. A method for generating disease onset information based on time-dependent association using multigene risk scores.
The method of claim 32, wherein the step of selecting a plurality of genetic mutations as disease-inducing factor candidates based on the target disease influence score for each genetic mutation,

The effect size corresponding to the subject of the genetic mutation for each of the plurality of prior documents calculates an odds ratio and a confidence interval for each prior document, and the genetic mutation for each prior document is determined based on the odds ratio and confidence interval for the target disease. A method for generating disease onset information based on time-dependent association using multigene risk scores, characterized in that for estimating the effect size on
The method of claim 37, wherein the step of selecting a plurality of genetic mutations as disease-inducing factor candidates based on the target disease influence score for each genetic mutation,

The effect size is calculated as a weight for each prior document through inverse variance estimation, and the weight is applied to the calculated odds ratio for each prior document, and the odds ratio for each prior document calculated is added to calculate the target disease influence score. A method for generating disease onset information based on time-dependent association using multigene risk scores.
The step of classifying the genetic mutations included in the plurality of disease-causing factor candidate lists into a plurality of groups according to claim 29,

Polygenic risk score, characterized by classifying genetic mutations into 9 groups according to the degree of overlap among the genetic mutations included in the three disease-inducing factor candidate lists generated by performing the GWAS analysis, AI analysis, and meta-analysis, respectively A method for generating disease onset information based on time-dependent association using
The method of claim 39, wherein the step of classifying the genetic mutations included in the plurality of disease-causing factor candidate lists into a plurality of groups,

The nine groups are classified into priority levels of 1, 2, and 3, characterized in that 1 group is included in the 1st class, 4 groups are included in the 2nd class, and 4 groups are included in the 3rd class. A method for generating disease onset information based on time-dependent association using genetic risk scores.
The method of claim 31, wherein the step of calculating the polygenic risk score,

The association with the number of risk alleles of genetic variation in each group is the association with the number of risk alleles in genetic variation in each group derived from the results of the GWAS analysis. A method for generating association-based disease outbreak information.
42. The method of claim 41,

Determining whether to use or redesign the PRS model by performing validation of the PRS model according to whether the PRS model is for a continuous target disease or a discrete target disease. A method for generating disease onset information based on dependency association.
The method of claim 27, wherein the checkup result data pre-processing unit,

performing a plurality of analyzes to select disease-related factor candidates by receiving examination result data including examination results of a plurality of persons over time or a plurality of disease-related data;

selecting at least one disease-related factor according to an overlapping degree among a plurality of disease-related factors selected as disease-related factor candidates for each of the plurality of analyses;

Processing data according to pre-processing criteria set for a disease-related factor requiring secondary processing of data from among a plurality of persons' examination result data for the selected at least one disease-related factor; and

A plurality of groups are formed by grouping the plurality of persons based on the change in individual checkup result values included in the checkup result data of the plurality of persons for at least one disease-related factor over time using a group trend model A method for generating disease onset information based on time-dependent association using a multi-gene risk score, further comprising the step of generating.
The method of claim 43, wherein the step of performing a plurality of analyzes for selecting the disease-related factor candidates,

Characterized in that at least one of disease correlation analysis, big data analysis, and meta-analysis is performed on the target disease by receiving examination result data including examination results over time of a large number of people or multiple disease-related data A method for generating disease onset information based on time-dependent association using multigene risk scores.
The method of claim 44, wherein the step of performing a plurality of analyzes for selecting the disease-related factor candidates,

The correlation analysis of multiple disease-related factors for the possibility of onset of a target disease is performed targeting the examination result data, including the examination results over time of a large number of people, and the disease-related factors derived as highly correlated are correlated with the disease An apparatus for generating risk gene mutation information for each disease using a time-varying covariate-based PRS model, further comprising selecting a factor as a candidate.
The method of claim 44, wherein the step of performing a plurality of analyzes for selecting the disease-related factor candidates,

Collecting a plurality of data by crawling from a database in which text-based disease-related data is stored, and selecting disease-related factor candidates by performing text mining on the collected plurality of data. A method for generating disease onset information based on time-dependent association using scores.
The method of claim 44, wherein the step of performing a plurality of analyzes for selecting the disease-related factor candidates,

A plurality of disease-related data on the subject of the target disease and its effect on disease-related factors is input into a meta-analysis model, and the effect size for each disease-related factor is calculated for each of the plurality of disease-related data, and the disease is determined according to the effect size. A method for generating disease onset information based on time-dependent association using a polygenic risk score, further comprising selecting a candidate for association factors.
The method of claim 44, wherein the step of performing a plurality of analyzes for selecting the disease-related factor candidates,

A plurality of disease-related factor candidates generated by performing at least one of disease correlation analysis, big data analysis, and meta-analysis are compared, and only disease-related factors included in all of the generated disease-related factor candidates are selected as disease-related factors. A method for generating disease onset information based on time-dependent association using multigene risk scores, characterized in that.
44. The method of claim 43, wherein processing the data according to the preset preprocessing criteria comprises:

Individual checkup result values for disease-related factors included in at least one selected disease-related factor are collected from the checkup result data, and pre-processing is performed to enumerate the collected individual checkup result values in a time series to obtain time-series checkup data for all checkup target periods A method for generating disease onset information based on time-dependent association using a multi-gene risk score, characterized in that for generating.
44. The method of claim 43, wherein processing the data according to the preset preprocessing criteria comprises:

When a disease-related factor included in at least one disease-related factor selected according to a pre-set preprocessing criterion is classified as something that cannot be used as a tendency criterion or judgment criterion as an individual checkup result value, so that it can be used as a tendency criterion or judgment criterion Time-dependent correlation-based disease outbreak information using multi-gene risk scores, characterized in that preprocessing of calculation or reprocessing is performed on the examination result data according to the pre-set preprocessing criteria to generate time-series examination data for each period of the entire examination subject. Method for generating information.
51. The method of claim 50, wherein processing the data according to the preset preprocessing criteria comprises:

The preset pre-processing criteria include information on the types of disease-related factors that cannot produce results by inputting individual checkup result values into a group trend model without pre-processing, and information on the pre-processing method of the disease-related factors. A method for generating disease onset information based on time-dependent association using multigene risk scores, characterized in that.
44. The method of claim 43, wherein processing the data according to the preset preprocessing criteria comprises:

It is characterized by verifying the classification suitability of the classified groups by estimating the trajectory form for the individual examination result values for each disease-related factor included in the examination result data of the persons included in each group and comparing the difference in trajectory form for each group A method for generating disease onset information based on time-dependent association using multigene risk scores.