WO2021112593A1 - Procédé de production de métagène basé sur une factorisation matricielle non négative et son application - Google Patents

Procédé de production de métagène basé sur une factorisation matricielle non négative et son application Download PDF

Info

Publication number
WO2021112593A1
WO2021112593A1 PCT/KR2020/017561 KR2020017561W WO2021112593A1 WO 2021112593 A1 WO2021112593 A1 WO 2021112593A1 KR 2020017561 W KR2020017561 W KR 2020017561W WO 2021112593 A1 WO2021112593 A1 WO 2021112593A1
Authority
WO
WIPO (PCT)
Prior art keywords
gene
data
matrix
disease
patient
Prior art date
Application number
PCT/KR2020/017561
Other languages
English (en)
Korean (ko)
Inventor
고영일
윤홍석
이성영
이찬섭
윤성수
Original Assignee
서울대학교병원
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 서울대학교병원 filed Critical 서울대학교병원
Publication of WO2021112593A1 publication Critical patent/WO2021112593A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B5/00ICT specially adapted for modelling or simulations in systems biology, e.g. gene-regulatory networks, protein interaction networks or metabolic networks
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B10/00ICT specially adapted for evolutionary bioinformatics, e.g. phylogenetic tree construction or analysis
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding

Definitions

  • the present invention relates to a non-negative matrix factorization (hereinafter referred to as 'NMF')-based metagene generation method and application, and more particularly, disease-related metagene generation using NMF and domain knowledge It relates to a method, a generating device, and a method for predicting non-genetic data of a subject using the same.
  • 'NMF' non-negative matrix factorization
  • this method is a method that closely reflects the disease-related relationship of genes based on previously known facts, has limitations in finding new disease-related genes, and reflects the complex correlation of disease-related genes in biological systems. can't do it
  • the present inventors have conducted extensive research to consistently describe various biological phenomena and to develop a gene group selection method that can more accurately predict disease-related information compared to conventional genetic markers.
  • the NMF technique and domain knowledge are utilized.
  • a method for generating an extended meta-gene from a known marker gene was developed through a series of processes, and the meta-gene generated through this method has a higher predictive power of disease-related information compared to a known marker gene. It was confirmed that the improvement was completed and the present invention was completed.
  • NMF non-negative matrix factorization
  • each second input matrix (p ⁇ q, (p-1) ⁇ q, (p-2) ⁇ q according to all combinations of 1 to p genes from the first input matrix (p ⁇ q) ... 1 ⁇ q), processing arbitrary values in each of the second input matrices as missing values and performing NMF of the rank value r to reduce the dimensions into a gene matrix and a patient matrix;
  • Another object of the present invention is a method comprising: (a) receiving patient (q) data and genetic data of a subject (t) including genetic data and non-genetic data; (b) selecting a disease-related gene whose association with the non-gene data is known; (c) filtering the genetic data of the patient (q persons) with the selected disease-related genes and gene groups forming a gene network therewith; (d) generating a non-negative matrix factorization (NMF) first input matrix (p ⁇ q) by transforming the filtered genetic data (p pieces) and the patients (q patients) into a matrix form; (e) After generating each second input matrix according to all 1 to p gene combinations from the first input matrix (p ⁇ q), any values in each of the second input matrices are treated as missing values and performing NMF of the rank value r to reduce the dimensions to a genetic matrix and a patient matrix; (f) a gene (N) combination and rank value (r) showing the lowest error by comparing the restored value for the missing value
  • the present invention provides a method comprising: (a) receiving patient data including genetic data and non-genetic data;
  • NMF non-negative matrix factorization
  • each second input matrix (p ⁇ q, (p-1) ⁇ q, (p-2) ⁇ q according to all combinations of 1 to p genes from the first input matrix (p ⁇ q) ... 1 ⁇ q), processing arbitrary values in each of the second input matrices as missing values and performing NMF of the rank value r to reduce the dimensions into a gene matrix and a patient matrix;
  • (h) provides a method for generating a disease-related metagene, comprising the step of evaluating the performance of the generated predictive model.
  • the present invention provides a method comprising: (a) receiving patient (q) data including genetic data and non-genetic data, and genetic data of a subject (t); (b) selecting a disease-related gene whose association with the non-gene data is known; (c) filtering the genetic data of the patient (q persons) with the selected disease-related genes and gene groups forming a gene network therewith; (d) generating a non-negative matrix factorization (NMF) first input matrix (p ⁇ q) by transforming the filtered genetic data (p pieces) and the patients (q patients) into a matrix form; (e) After generating each second input matrix according to all 1 to p gene combinations from the first input matrix (p ⁇ q), any values in each of the second input matrices are treated as missing values and performing NMF of the rank value r to reduce the dimensions to a genetic matrix and a patient matrix; (f) a gene (N) combination and rank value (r) showing the lowest error by comparing
  • the present invention provides a method comprising the steps of: (a) receiving patient data comprising genetic data and non-genetic data;
  • NMF non-negative matrix factorization
  • each second input matrix (p ⁇ q, (p-1) ⁇ q, (p-2) ⁇ q according to all combinations of 1 to p genes from the first input matrix (p ⁇ q) ... 1 ⁇ q), processing arbitrary values in each of the second input matrices as missing values and performing NMF of the rank value r to reduce the dimensions into a gene matrix and a patient matrix;
  • (h) provides a method for generating a disease-related metagene, comprising the step of evaluating the performance of the generated predictive model.
  • the present inventors have a novel concept of a gene group that is highly likely to be directly or indirectly involved in a complex pathology along with a core biomarker gene among a wide range of human genome information, that is, a metagene (metagene). -gene) was applied, and it was confirmed that the metagene generated through the method provided by the present invention exhibits significantly improved disease-related non-gene data predictive power compared to the previously reported biomarker gene.
  • a metagene metagene
  • Step (a) is a step of securing a database for generating meta-genes with high predictive power of non-gene data based on clinically collected patient genetic data and non-gene data.
  • the 'patient' does not mean only a subject suffering from a specific disease, but may be understood to include a healthy subject (ie, a control group).
  • the scope of the patient is not particularly limited and may refer to a subject who has left genetic data and non-genetic data to a medical institution for any cause.
  • the patient may be understood to include a subject from which genetic data and non-genetic data are obtained after death due to a specific disease.
  • the patient is a patient suffering from a specific disease for which association with a metagene is to be analyzed according to the method of the present invention, a patient suspected of having a specific disease, or a patient suspected of having a specific disease but determined to be healthy
  • a patient suspected of having a specific disease and determined to have a specific disease a patient who has been cured after being diagnosed with a specific disease, a patient who has died after being determined to have a specific disease, or a specific disease It may be a patient who recurs after being cured, but is not limited thereto.
  • the 'disease' refers to an abnormal pathology that seeks to secure association with a metagene according to the method of the present invention, and the type is not particularly limited.
  • the disease may be, for example, cancer, an immune disease, an inflammatory disease, a viral disease, an infectious disease, a metabolic disease or a neurodegenerative disease.
  • the 'gene data' refers to genome information analyzed from a biological sample provided from a patient, and means that one or more selected from the group consisting of gene expression level, single nucleotide polymorphism and gene mutation is provided as a vector. and, preferably, the gene expression level.
  • the genetic data includes whole genome sequencing (WGS), whole exome sequencing (WES), microarray, target sequencing, and Sanger sequencing of a biological sample provided from a patient.
  • Data analyzed through gene analysis methods such as sequencing, electrophoresis, next-generation sequencing (NGS), RNA sequencing, polymerase chain reaction (PCR), and electrophoresis.
  • GGS whole genome sequencing
  • WES whole exome sequencing
  • PCR polymerase chain reaction
  • the genetic data may be provided from a pre-established database, or may be data analyzed and processed according to a known genetic analysis method in a biological sample obtained from a patient, if necessary.
  • the 'non-gene' data may include disease diagnosis data, disease prognosis data, drug reactivity data, pathology data, biochemical data, or any combination thereof obtained from a patient, preferably disease diagnosis data , disease prognostic data, drug responsiveness data, or a combination thereof.
  • the disease diagnosis data includes whether a patient is diagnosed with a specific disease, age, gender, and other clinical information at the time of diagnosis, and preferably may mean whether a specific disease is diagnosed.
  • the disease prognostic data refers to the progress after a patient is diagnosed with a specific disease, and includes mortality, relapse rate, cure rate, good or bad degree of disease course, and the like.
  • the drug reactivity data refers to the degree of drug efficacy in patients with a specific disease receiving a specific drug, and the treatment rate, recurrence rate, mortality rate, the degree of good or bad disease course, the drug administration after administration of the drug. the degree of disease progression of the patient at the time point and at the time of discontinuation, the dose concentration of the drug, and the like.
  • the genetic data and the non-gene data are obtained from the same patient, and the data of the patient for which only any one of the genetic data and the non-gene data is obtained is excluded from step (a) of the present invention. desirable.
  • the non-gene data may be converted into numerical data and provided.
  • the non-gene data when the non-gene data is disease diagnosis data of a patient, it may be represented as 1 if there is a history of diagnosis of having a specific disease, and 0 if there is no history.
  • the non-gene data is the patient's disease prognosis data
  • the degree of good or bad prognosis of a specific disease may be expressed in terms of values such as 10 to -10.
  • the non-gene data is drug reactivity data
  • the degree of high and low reactivity to a specific drug may be converted into a numerical value such as 10 to -10.
  • step (b) is a step of selecting any one or more of the patient's non-genetic data, preferably a disease-related gene with a known association with any one of which the practitioner is interested.
  • the disease-related gene for which the association with the non-gene data is known is a 'biomarker' gene generally called a 'diagnostic marker', 'prognostic marker', 'drug responsiveness marker', etc. in the art to which the present invention belongs. Or it can be easily understood as a gene encoding the biomarker protein.
  • the disease-related gene may be a gene for which non-genetic data, for example, a positive or negative correlation with a diagnosis of a disease, a prognosis of a disease, or a degree of responsiveness of a drug is known.
  • the disease-related gene increases the possibility of diagnosis of a specific disease in a patient in which the expression of the specific gene is increased, the prognosis of the disease is poor, or the responsiveness of the drug is increased, or vice versa. It may be a gene known to have a direct correlation with the tendency of expression of non-gene data.
  • the disease-related gene can be selected through a conventionally known database, and the known database is, for example, OMIM (Online Mendelian Inheritance in Man), Genetic Association Database, KEGG DISEASE, PharmGKB, Cancer Gene Census, HuGE Navigator, PharmGKB, ClinVar or Leiden Open Variation Database, but is not limited thereto.
  • the disease-related gene may be selected through a known literature search or newly identified through an experiment.
  • the step (c) is different from the one or more disease-related genes selected in the step (b). Although a direct correlation with the non-gene data is not known, the non-gene data is directly or indirectly related to the step (c). This is a step in which domain knowledge is used to select a gene candidate group that is highly likely to be related to each other.
  • the gene group forming a network with the disease-related gene is related to the biological pathway of the disease-related gene within a network including a functional link of the gene. It may mean a group of genes.
  • gene network is a term to indicate a network intricately connected between genes, in which genes are expressed as nodes and connections between genes are as edges. It may mean a group of expressed genes.
  • the gene network defined in the present invention may include, but is not limited to, expression, protein interaction, and transcriptional regulatory networks.
  • the expression network refers to a gene group in which a relationship having co-expression between genes is identified by mass excavating genes having expression differences in a specific environment or trait by using gene expression data.
  • the protein interaction network refers to a protein network exhibiting physical contact with each other, a protein network in which the function of a specific protein directly affects the function expression of another protein, or a group of genes encoding the same.
  • the transcriptional regulatory network is a network described by the relationship between regulators and targets. More specifically, when the expression of a group of proteins participating in a specific metabolic pathway is determined by transcriptional regulators having specificity in common to them, it is a network having a dependency relationship between these transcriptional regulators and their target genes.
  • the gene network is a concept that can be accessed through many papers and patents, and a person skilled in the art can clearly understand the scope and meaning of the gene network in addition to the above-exemplified gene network.
  • the gene group forming a gene network with the disease-related gene in the present invention is not limited to genes that form a direct network with the disease-related gene.
  • the gene group in step (c) defined in the present invention includes not only gene networks in the conventional sense, but also the gene group selected according to a method comprising the following steps:
  • (c1) a gene group exhibiting the same molecular physiological function as the disease-related gene; and selecting one or more gene groups selected from the group consisting of a group of genes exhibiting the same association with non-gene data with known association with the disease-related gene;
  • step (c2) selecting the gene group selected in step (c1) and the gene group forming the gene network.
  • the 'gene network' defined in step (c2) refers to the gene network in the conventional sense described above.
  • step (c) of the present invention is a process of primarily selecting a gene that is likely to affect the non-gene data by directly or indirectly interacting with a disease-related gene whose association with the non-gene data is known. , it may be desirable to sufficiently expand the scope through literature research and analysis using various domain knowledge in addition to conventionally known gene networks.
  • the known gene network can be secured through a pre-established database, and the database may include, for example, HPRD, BioGrid, IntAct, MINT, DIP, iRefWeb data, pathway map, MsigDB, etc., but is limited thereto no.
  • the 'filtering' refers to only the data on the selected disease-related gene and the gene group forming the gene network with the selected disease-related gene among the genetic data of the patient received in step (a) in the subsequent procedure, and the remaining gene data means not to be used in subsequent procedures.
  • the gene is filtered after receiving the gene data in the step (a), but the disease-related genes and the gene group forming the gene network are first selected through the steps (b) and (c). After selection, the step may be changed to receive only the patient's genetic data for the selected gene group.
  • NMF non-negative matrix factorization
  • step (d) is a step of generating an original value of the first input matrix of p ⁇ q using the gene (p number) data of the patient (q patients) filtered in step (c).
  • each matrix in the first input matrix of p ⁇ q is a vectorized value of the patient's gene data, and preferably may be the filtered expression level of each gene.
  • the vectorized value of the gene data refers to the expression level of the gene in the case of gene expression level that can be expressed as a quantitative value, gene data that is not expressed as a quantitative value, for example, gene mutation, single nucleotide polymorphism (SNP). ), etc., refers to a method of dividing into expressed or non-expressed and expressed as 0 or 1.
  • any values in each of the second input matrices are treated as missing values and performing NMF of the rank value r to reduce the dimensions to a genetic matrix and a patient matrix;
  • steps (e) and (f) are steps of reducing the dimension by performing NMF on the input matrix (p ⁇ q) secured in step (d).
  • steps (e) and (f) of the present invention by optimizing the input matrix secured in step (d), a gene combination that can be best distinguished by a common characteristic among p genes in the input matrix and its NMF data.
  • steps (e) and (f) a final gene group and an optimal NMF rank value r for classifying metagenes can be obtained.
  • a second input matrix is generated based on the first input matrix (p ⁇ q) generated in step (d).
  • the second input matrix includes the same column (patient, q) as the first input matrix, and each input is composed of a combination of 1 to p genes included in the row p of the first input matrix. is a matrix
  • the (p-1) refers to a gene combination in all cases except for one arbitrary gene in p genes
  • (p-2) is a gene combination in all cases except for two arbitrary genes in p genes.
  • (p-3) means a gene combination in all cases except for three arbitrary genes in p genes
  • 1 means each gene included in p genes.
  • a random value in each second input matrix is treated as a missing value.
  • the number of values to be treated as missing values is not particularly limited.
  • the genetic data treated as a missing value in step (e) may be any genetic data, but it is preferable to treat disease-related genetic data whose association with the non-gene data selected in step (b) is known as a missing value. Do.
  • NMF is performed on each second input matrix in which arbitrary values are treated as missing values, and dimensionality reduction is performed into the genetic matrix and the patient matrix.
  • a rank value applied when performing NMF is 2 to r.
  • the rank value may be, for example, 10, 9, 8, 7, 6, 5, 4, 3 or 2, preferably 7, 6, 5, 4, 3, or 2 and more preferably 6, 5, 4, 3 or 2, and most preferably 5, 4 or 3.
  • the gene matrix refers to p ⁇ r, (p-1) ⁇ r, (p-2) ⁇ r ... generated by performing NMF on the second input matrix.
  • 1 ⁇ r means each matrix
  • the patient matrix means each r ⁇ q matrix generated by performing NMF on the second input matrix.
  • the error is AE (Average Error), MAE (Mean absolute error), MAPE (Mean absolute percentage error), MAE (Mean squared error), MSE (Mean square error) and RMSE (root MSE) in a method selected from the group consisting of The performance of each second input matrix is evaluated by numericalizing it accordingly.
  • AE Average Error
  • MAE Mean absolute error
  • MAPE Mean absolute percentage error
  • MAE Mean squared error
  • MSE Mean square error
  • RMSE root MSE
  • the performance of the NMF for each second input matrix and the rank value r is evaluated, and the combination of N genes and the rank value (r) showing the lowest error (that is, showing the best resilience) are selected. and calculates its NMF data (N ⁇ r, r ⁇ q).
  • the performance is evaluated while sequentially performing NMF from p genes to one gene combination, and the gene showing the worst performance (that is, the highest error) is selected According to a greedy method that removes one by one, a combination of genes (N) and a rank value (r) exhibiting optimal performance may be selected, and NMF data thereof may be calculated.
  • NMF is sequentially administered from (pk) genes to one gene combination except k disease-related genes for which association with non-gene data is known in step (b) among the selected p genes.
  • the performance is evaluated while performing, and the gene (N) combinations and rank values showing the best performance according to the greedy method, which removes the genes with the worst performance (that is, the highest error) one by one (r) can be selected and its NMF data can be calculated.
  • the possibility that the disease-related gene whose association with the non-gene data selected in step (b) is known is eliminated in the NMF performance evaluation process is excluded.
  • the 'meta-gene' as defined in the present invention means each column of an N ⁇ r matrix in the NMF data calculated in step (f), and the value (or expression of metagenes for each of the q patients). value) corresponds to an r ⁇ q matrix value in the calculated NMF data. That is, in the calculated NMF data, each column of the N ⁇ r matrix is metagene 1, metagene 2, metagene 3 ... It can be expressed as metagene r, and the value (or expression value) of metagene 1 to metagene r for each patient corresponds to the r ⁇ q matrix value.
  • step (f) when 30 genes (N) and rank value (r) 5 are selected in step (f), and the number of patients providing quantified drug reactivity data is 100, in step (f), 30 A ⁇ 5 matrix (W matrix) and a 5 ⁇ 100 matrix (H matrix) are calculated as NMF data.
  • W matrix ⁇ 5 matrix
  • H matrix a 5 ⁇ 100 matrix
  • Each column in the W matrix is defined as metagenes 1 to 5, and the values (or expression values) of metagenes 1 to 5 of each of the 100 patients correspond to the values of the H matrix.
  • the value (or expression value) of metagene 3 of patient 5 out of 100 patients is the value of column 5 and row 3 of the H matrix.
  • Step (g) is a step of generating a predictive model using the r ⁇ q matrix and non-gene data of the patient (q patients) from the calculated NMF data. That is, generating a predictive model capable of explaining the relationship between the values (or expression values) of metagenes 1 to r of each patient in the r ⁇ q matrix and the patient's non-gene data.
  • the values (or expression values) of metagenes 1 to r of each patient in the r ⁇ q matrix are input data for learning, and the non-gene data of the patients are output data for learning. It is possible to generate a predictive model by machine learning.
  • the "prediction model” refers to a non-genetic data prediction model of a patient. More specifically, by analyzing the correlation between the input data for learning and the output data for learning, the patient's non-gene data is predicted according to the values (or expression values) of metagenes 1 to r of each patient in the r ⁇ q matrix. It means the input/output function.
  • the values (or expression values) of metagenes 1 to r of each patient in the r ⁇ q matrix are used as input data for learning, and the non-gene data of the patients are used as output data for machine learning.
  • the importance of each of the metagenes 1 to r can be evaluated.
  • the "importance” may be understood as “accuracy” or “contribution” as a predictor variable for predicting non-genetic data of a patient. Specifically, among the metagenes 1 to r, the metagene with the lowest importance rank in predicting the patient's non-gene data is sequentially excluded, and the patient's non-gene data is predicted by using the remaining metagenes as input data for learning.
  • each predictive model generated in this way is evaluated at a later stage, so that a predictive model that can most accurately predict the patient's non-genetic data, that is, exhibit the maximum performance, can be selected.
  • the predictive model of step (g) is a logistic regression algorithm, a deep learning algorithm, a decision tree algorithm, a random forest algorithm, a nave Bayes algorithm, a support vector machine algorithm, K-proximity It can be characterized as any one or more machine learning models selected from the group consisting of a K-Nearest Neighbor algorithm, a Gradient Boosting Machine algorithm, a Neural Network algorithm, and an extra tree algorithm. .
  • the step (h) of the present invention is a step of evaluating the prediction accuracy and the prediction precision of the prediction model generated in the step (g).
  • the method used to evaluate the performance of the predictive model in step (h) of the present invention is not particularly limited, and statistical or computational methods commonly used to confirm the correlation between the independent variable (x) and the dependent variable (y) are not particularly limited. enemy methods can be used.
  • correlation analysis and regression analysis, etc. can be used when non-gene data corresponding to the dependent variable is continuous type, and t-test, chi-square test, logistic regression analysis, etc. can be used when non-gene data is discontinuous type.
  • t-test, chi-square test, logistic regression analysis, etc. can be used when non-gene data is discontinuous type.
  • K-nearest neighbor algorithm, decision tree, etc. can be used.
  • the present invention is not limited thereto.
  • the performance of the predictive model is the area under the curve of ROC (AUC), balance accuracy (BA), sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), false positive rate (FPR), It may be assessed by one or more results selected from the group consisting of false detection rate (FDR) and F1 score.
  • the present invention also provides a computer-readable recording medium in which a program for performing the method including each of the above steps is recorded.
  • the present invention also provides an apparatus for generating disease-related metagenes in which each of the above-described steps is driven by a series of processors.
  • the apparatus includes: (a) a data receiving unit for receiving patient data including genetic data and non-genetic data;
  • a filtering unit for filtering the patient's genetic data into the selected disease-related genes and a gene group forming a gene network with the selected disease-related genes
  • an input matrix generator for generating a non-negative matrix factorization (NMF) first input matrix (p ⁇ q) by converting the filtered genetic data (p pieces) and the patients (q patients) into a matrix form;
  • NMF non-negative matrix factorization
  • each second input matrix (p ⁇ q, (p-1) ⁇ q, (p-2) ⁇ q according to all combinations of 1 to p genes from the first input matrix (p ⁇ q) ... 1 ⁇ q), an NMF operation unit that processes arbitrary values in each of the second input matrices as missing values, and performs NMF of a rank value r to reduce dimensions into a gene matrix and a patient matrix;
  • a predictive model generator for generating a predictive model by using the r ⁇ q matrix from the calculated NMF data and non-gene data of the patient (q patients);
  • the data receiving unit, the input unit, the filtering unit, the input matrix generation unit, the NMF calculation unit, the output unit, the metagene selection unit and the verification unit are only divided into separate and independent names according to their respective functions, It may be implemented with one processor.
  • each of the data receiving unit, the input unit, the filtering unit, the input matrix generation unit, the NMF calculation unit, the output unit, the predictive model generation unit, and the verification unit may correspond to one or more processing modules in the processor.
  • the data receiving unit, the input unit, the filtering unit, the input matrix generation unit, the NMF calculation unit, the output unit, the predictive model generation unit, and the verification unit may correspond to separate software algorithm units divided according to respective functions. That is, the implementation form of the data receiving unit, the input unit, the filtering unit, the input matrix generation unit, the NMF calculation unit, the output unit, the predictive model generation unit, and the verification unit in the processor is not limited by any one.
  • the filtering unit of the device provided by the present invention includes a database storage unit for storing a gene network database; and a search unit for searching domain knowledge related to the disease-related gene.
  • the database storage unit may include a plurality of processors for providing gene network information, such as a gene information processor, a protein interaction processor, a signal transduction pathway processor, and the like.
  • the database storage unit stores gene information and gene network information related to a biological pathway of the disease-related gene in a network including a functional link of the disease-related gene.
  • the database storage unit searches for and provides information about a gene network sharing a biological path directly or indirectly with the disease-related gene.
  • the search unit searches not only the gene network information stored in the database storage unit, but also the established online database to search for network information sharing a biological path directly or indirectly with the disease-related gene.
  • the pre-established online database may include, for example, HPRD, BioGrid, IntAct, MINT, DIP, iRefWeb data, pathway map, MsigDB, and the like.
  • the search unit may search articles, patents, reports, etc. in addition to the established online database to search for additional genes that are known to be related to the disease-related genes but are not reflected in the online database. More specifically, the search unit searches for domain knowledge and includes: a gene group exhibiting the same molecular physiological function as the disease-related gene; a group of genes known to have protein-protein interactions with the disease-related genes; and a gene group showing the same association in non-gene data with known association with the disease-related gene may be searched for and the information may be provided.
  • the (c) filtering unit includes a gene group exhibiting the same molecular physiological function as the disease-related gene according to the searched domain knowledge; a group of genes known to have protein-protein interactions with the disease-related genes; And gene network selection for selecting one or more gene groups selected from the group consisting of a group of genes exhibiting the same association to non-gene data with known association with the disease-related gene as a gene group forming a gene network with the disease-related gene It may include more wealth.
  • the (c) filtering unit may further include a constraint input unit for limiting the disease-related gene and the gene group forming the gene network according to the restriction condition set by the user.
  • a constraint input unit for example, a gene group that exhibits the same molecular physiological function as the disease-related gene, a gene group that has protein-protein interaction with the disease-related gene, a ratio known to be related to the disease-related gene -
  • the group of genes provided in the database storage and search units can be set to be limited to a certain range or group.
  • the present invention also provides a method comprising: (a) receiving patient (q) data, and genetic data of (t) subjects, including genetic data and non-genetic data; (b) selecting a disease-related gene whose association with the non-gene data is known; (c) filtering the genetic data of the patient (q persons) with the selected disease-related genes and gene groups forming a gene network therewith; (d) generating a non-negative matrix factorization (NMF) first input matrix (p ⁇ q) by transforming the filtered genetic data (p pieces) and the patients (q patients) into a matrix form; (e) After generating each second input matrix according to all 1 to p gene combinations from the first input matrix (p ⁇ q), any values in each of the second input matrices are treated as missing values and performing NMF of the rank value r to reduce the dimensions to a genetic matrix and a patient matrix; (f) a gene (N) combination and rank value (r) showing the lowest error by comparing the restored value for the missing value generated by
  • Each step of (a) to (j) in the method of the present invention may refer to the previous description of the method for generating a disease-related metagene.
  • the subject means any patient to which only genetic data is provided and non-genetic data is not provided. More specifically, the subject is a patient who wants to diagnose whether he or she has a specific disease, a patient diagnosed with a specific disease, who wants to predict the prognosis, and a patient diagnosed with a specific disease to select an appropriate therapeutic drug including patients who
  • the rank value r in step (h) is the same as the rank value r selected in step (f).
  • step (j) non-gene data of the subject (t persons) is output by using the r ⁇ t matrix generated by NMF in step (h) as an input value of the prediction model.
  • step (h) was a predictive model for predicting reactivity (non-gene data) to a specific drug
  • the values of metagenes 1 to r of each subject in the r ⁇ t matrix as an input value of the predictive model, it is possible to predict the reactivity of the subject (t persons) to the specific drug.
  • step of evaluating the performance of the predictive model may be further performed after step (i).
  • the method of predicting disease-related non-genetic data of a subject may be performed for the purpose of providing information necessary for predicting disease-related non-genetic data of a subject.
  • the present invention also provides a method comprising: (a) receiving patient (q) data, and genetic data of (t) subjects, including genetic data and non-genetic data; (b) selecting a disease-related gene whose association with the non-gene data is known; (c) filtering the genetic data of the patient (q persons) with the selected disease-related genes and gene groups forming a gene network therewith; (d) generating a non-negative matrix factorization (NMF) first input matrix (p ⁇ q) by transforming the filtered genetic data (p pieces) and the patients (q patients) into a matrix form; (e) After generating each second input matrix according to all 1 to p gene combinations from the first input matrix (p ⁇ q), any values in each of the second input matrices are treated as missing values and performing NMF of the rank value r to reduce the dimensions to a genetic matrix and a patient matrix; (f) a gene (N) combination and rank value (r) showing the lowest error by comparing the restored value for the missing value generated by
  • a method of predicting non-gene data from an unknown subject through a metagene generated according to the method of the present invention and a prediction model using the same is a biomarker selected according to a conventional method such as a single biomarker gene or a biomarker gene group. It is possible to provide improved predictive power compared to the non-genetic data prediction method using
  • the metagene generated according to the method of the present invention using domain knowledge and NMF technique can provide accurate information related to disease diagnosis, prognosis prediction, drug reactivity prediction, etc., and thus has a very high potential for use as a new biomarker.
  • FIG. 1 is a result showing the mean-square error between a restored value of a matrix for a missing value and a corresponding original value as a result of performing NMF of the rank value r according to a gene combination selected according to domain knowledge.
  • 2 is a result of outputting the cell line matrix (A) and the gene matrix (B) of the NMF result having the gene combination and rank value selected as having the lowest error through the matrix optimization process.
  • FIG 3 is a view confirming the correlation between drug reactivity (IC50) to Venetoclax and the gene expression level of BLC2 (A) or the weight (B) of metagene 2 selected in Examples of the present invention.
  • 6 is a result showing the error (mean absolute percentage error) between the restored value of the matrix for the missing value and the corresponding original value as a result of performing NMF of the rank value 3 according to the gene combination selected according to domain knowledge.
  • Example 1 Generation of a metagene using an AML cell line and verification of its usefulness
  • Drug responsiveness information ie, non-genetic data
  • Drug responsiveness information ie, non-genetic data
  • the drug reactivity to Venetoclax of each obtained AML cell line is shown in FIG. 2 .
  • the total gene expression level (ie, genetic data) of each AML cell line in which reactivity to Venetoclax was secured was obtained according to the following method:
  • Venetoclax is a BCL2 selective inhibitor.
  • the gene group forming a gene network with BCL2 was selected as follows:
  • BCL2 family A key gene in the intrinsic apoptosis process is the BCL2 family, and a total of 15 genes are known for their pro-apoptotic and anti-apoptotic functions (Cell Death Differ. 2018 Jan;25(1):56-64.). 15 of these BCL2 famaily genes were primarily selected, and the list of selected BCL2 family genes is as follows:
  • genes related to intrinsic apoptosis were collected from MsigDB, a public DB, for the selection of genes limited to intrinsic apoptosis, such as BCL2, as follows.
  • BCL2 family genes Some of them are transcriptionally regulated through the TNFs/NF-kB pathway, and related genes are selected): 38
  • the gene data of the AML cell lines were filtered with the selected gene group, and a first input matrix (p x q) of usable genes (391) X the AML cell line (21) among the selected genes was generated.
  • the usable genes were used as values of the first input matrix after 1) quantile normalization was performed to unify the scale of genes, except for genes expressed in less than 90% of the total number of samples.
  • missing value target genes were selected as follows.
  • BCL2 which is the target of Venetoclax drug
  • NMF is performed on rank(r) in the missing value-processed matrix to generate the resulting matrices W((p-1) ⁇ r), H(r ⁇ 21)
  • the second input matrix is restored by multiplying the generated result matrices (W matrix, H matrix)
  • a gene matrix (W) and a cell line matrix (H) having a combination of 64 genes and a rank value (5) were output through the gene group selection of (1) and the matrix optimization process of (2).
  • each column is designated as metagenes 1 to 5 from the left, and among these metagenes, BCL2, which is known to show a positive correlation with drug reactivity of Venetoclax, has the highest weight metagene 2, and Veneto Among metagenes 1, 3, 4, and 5 with the highest weight of BCL2L2, BCL2L1, BCL2A1, or MCL1, which are known to show a negative correlation with drug reactivity of Clarks, metagene 2 is first selected to determine its usefulness as a biomarker below. verified.
  • the reactivity to Venetoclax of each AML cell line was listed in the order of IC50 and indicated. It was confirmed that there was a positive correlation between the responsiveness to Venetoclax and metagene 2 of each AML cell line indicated in the H matrix of the figure. Specifically, in the matrix H of the figure, it can be seen that cell lines with good reactivity to Venetoclax have the highest weight of metagene 2 among metagenes 1 to 5, and cell lines with poor reactivity to Venetoclax are metagenes. It can be seen that the weight of 2 is low.
  • the metagene 2 (BCL2-related mata-gene) selected according to the method of the present invention shows a higher correlation with the Venetoclax drug reactivity than the expression level of BCL2 alone.
  • the selected metagene 2 (BCL2-related) had a strong negative association (positive association with reactivity) with the IC50 of Venetoclax in each cell line. That is, the higher the weight of metagene 2, the higher the drug reactivity because the IC50 value of Venetoclax is lower.
  • metagenes 1, 4 and 5 were found to have a positive association with IC50 (negative association with reactivity). That is, the higher the weight of metagenes 1, 4, or 5, the higher the IC50 value of Venetoclax, the lower the drug reactivity.
  • metagene 3 was excluded from the biomarker for predicting Venetoclax drug reactivity because it did not show either a positive or a negative association with the IC50 of Venetoclax.
  • metagenes 1,2,4,5 were selected as biomarkers for predicting Venetoclax reactivity.
  • the predictive power of Venetoclax drug reactivity (y value) of the selected metagenes 1,2,4,5 (x value) was evaluated using a linear regression model.
  • Mean square error (MSE) obtained by 100-repeated 5-fold cross-validation was used as an indicator of predictive power evaluation, and the details are as follows:
  • the genetic data of 21 cell lines in which the above genetic data were secured were randomly divided into 5 folds. After training the model with 4 folds, the MSE value is obtained by evaluating the model with the remaining 1 fold. After repeating this for each fold, the average of 5 MSEs obtained is called 5-fold cross-validation MSE (CV-MSE). The average of 100 CV-MSEs obtained by repeating this process 100 times was used for model evaluation.
  • CV-MSE 5-fold cross-validation MSE
  • brown and yellow bars represent the results of training the IC50 (y value) prediction model using BCL2 or BCL2 family gene expression information (x value), respectively, and the orange bar is the same as the number of optimized genes in all gene information before filtering.
  • the results of learning the IC50 (y value) prediction model using the metagene (x value) discovered by performing NMF after random extraction are shown, and the purple bar indicates the metagene (x value) discovered through NMF in all gene information before filtering.
  • ) represents the learning result of the IC50 (y value) prediction model using the green bar
  • the green bar represents the IC50 (y value) prediction model using the metagene (x value) discovered through NMF from the genetic data (apoptosis genes) reduced to domain knowledge.
  • the blue bar represents the learning result, and the blue bar indicates the learning result of the IC50 (y value) prediction model using all (metagenes 1 to 5) metagenes (x values) discovered from gene network data using domain knowledge and matrix-optimized gene information. indicates The red bar indicates the result of training the IC50 (y value) prediction model using the finally selected metagenes 1,2,4,5 (x value).
  • Example 2 Generation of a metagene using patient data and verification of its usefulness
  • Venetoclax is a BCL2 selective inhibitor.
  • the gene group forming a gene network with BCL2 was selected as follows:
  • BCL2 family A key gene in the intrinsic apoptosis process is the BCL2 family, and a total of 15 genes are known for their pro-apoptotic and anti-apoptotic functions (Cell Death Differ. 2018 Jan;25(1):56-64.). 15 of these BCL2 famaily genes were primarily selected, and the list of selected BCL2 family genes is as follows:
  • genes related to intrinsic apoptosis were collected from MsigDB, a public DB, for the selection of genes limited to intrinsic apoptosis, such as BCL2, as follows.
  • BCL2 family genes Some of them are transcriptionally regulated through the TNFs/NF-kB pathway, and related genes are selected): 38
  • the genetic data of the AML patient was filtered with the selected gene group, and a first input matrix (p x q) of usable genes (228) X the AML patient (451 cases) among the selected genes was generated.
  • a second input matrix was generated according to a method including the following steps, and then random values were treated as missing values and NMF was performed to select NMF data representing the optimal gene combination. .
  • missing value target genes were selected as follows.
  • BCL2 which is the target of Venetoclax drug
  • NMF is performed on rank(r) in the missing value-processed matrix to generate the resulting matrices W((p-1) ⁇ r), H(r ⁇ 451)
  • the second input matrix is restored by multiplying the generated result matrices (W matrix, H matrix)
  • a gene matrix (W) and a patient matrix (H) having a combination of 97 genes and a rank value (3) were output through the gene group selection of (1) and the matrix optimization process of (2) above.
  • each column is arranged from left to right: BCL2 metagene, MCL1/BCL2 metagene, BFL1/MCL1 metagene was designated as
  • the genetic data of 153 patients with the above genetic data and drug reactivity information were randomly divided into 70%:30%. After training the model with 70%, evaluate the model with the remaining 30% to obtain the AUROC value. The average of 10 AUROCs obtained by repeating random division 10 times was used for model evaluation.
  • the red bar represents the drug response model results using the generated metagene
  • the blue and light blue bars represent BCL2 family gene expression information (BCL2+MCL1+BFL1) and (BCL2+MCL1+BFL1+BCLXL+BCLW, respectively).
  • a bar marked with DEG indicates a prediction model result using upper gene expression information among differential expression genes (DEG).
  • Bars marked with total and BCL2 family-related genes represent the results of models trained with other machine learning methods (Lasso, Random Forest, Support Vector Machine) using total gene expression information and gene expression information reduced to domain knowledge, respectively.
  • the metagene selected according to the method of the present invention provides improved usefulness as a biomarker than the metagene discovered by applying only a single gene or domain knowledge.
  • the metagene generated according to the method of the present invention using domain knowledge and NMF technique can provide accurate information related to disease diagnosis, prognosis prediction, drug reactivity prediction, etc. very likely

Landscapes

  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Medical Informatics (AREA)
  • General Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Biophysics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Molecular Biology (AREA)
  • Physiology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Genetics & Genomics (AREA)
  • Analytical Chemistry (AREA)
  • Artificial Intelligence (AREA)
  • Bioethics (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Chemical & Material Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Epidemiology (AREA)
  • Evolutionary Computation (AREA)
  • Public Health (AREA)
  • Software Systems (AREA)
  • Animal Behavior & Ethology (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

La présente invention concerne un procédé de sélection de métagène basé sur une factorisation matricielle non négative (NMF) et son application et, plus particulièrement, un procédé de production et un dispositif de production pour des métagènes liés à une maladie par utilisation de NMF et de connaissances de domaine et un procédé de prédiction de données non génétiques d'un sujet à l'aide de ceux-ci. Les métagènes produits à l'aide de techniques de connaissance de domaine et de NMF selon le procédé de la présente invention peuvent fournir des informations précises associées à un diagnostic de maladie, une prédiction de pronostic, une prédiction de réactivité à un médicament et analogues et sont ainsi hautement susceptibles d'être utilisés en tant que nouveaux biomarqueurs.
PCT/KR2020/017561 2019-12-03 2020-12-03 Procédé de production de métagène basé sur une factorisation matricielle non négative et son application WO2021112593A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
KR10-2019-0159376 2019-12-03
KR20190159376 2019-12-03

Publications (1)

Publication Number Publication Date
WO2021112593A1 true WO2021112593A1 (fr) 2021-06-10

Family

ID=76221834

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/KR2020/017561 WO2021112593A1 (fr) 2019-12-03 2020-12-03 Procédé de production de métagène basé sur une factorisation matricielle non négative et son application

Country Status (2)

Country Link
KR (1) KR102659917B1 (fr)
WO (1) WO2021112593A1 (fr)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20100118644A (ko) * 2009-04-29 2010-11-08 충북대학교 산학협력단 질병네트워크로부터 질병단백체를 발굴하는 방법
CN104462817A (zh) * 2014-12-09 2015-03-25 西北师范大学 基于蒙特卡洛和非负矩阵因子分解的基因选择和癌症分类方法
KR20180118984A (ko) * 2017-04-24 2018-11-01 (주) 노보믹스 위암의 생물학적 특성에 기반한 군 구분 및 예후 예측 시스템
KR20190000168A (ko) * 2017-06-22 2019-01-02 한국과학기술원 질병 연관 세포기능에 연결된 마커 기반으로 멀티마커 패널을 선정하는 시스템 및 방법
CN109797221A (zh) * 2019-03-13 2019-05-24 上海市第十人民医院 一种用于对肌层浸润性膀胱癌进行分子分型和/或预后预测的生物标记物组合及其应用
US20190348147A1 (en) * 2017-01-31 2019-11-14 Myriad Women's Health, Inc. Systems and methods for inferring genetic ancestry from low-coverage genomic data

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20100118644A (ko) * 2009-04-29 2010-11-08 충북대학교 산학협력단 질병네트워크로부터 질병단백체를 발굴하는 방법
CN104462817A (zh) * 2014-12-09 2015-03-25 西北师范大学 基于蒙特卡洛和非负矩阵因子分解的基因选择和癌症分类方法
US20190348147A1 (en) * 2017-01-31 2019-11-14 Myriad Women's Health, Inc. Systems and methods for inferring genetic ancestry from low-coverage genomic data
KR20180118984A (ko) * 2017-04-24 2018-11-01 (주) 노보믹스 위암의 생물학적 특성에 기반한 군 구분 및 예후 예측 시스템
KR20190000168A (ko) * 2017-06-22 2019-01-02 한국과학기술원 질병 연관 세포기능에 연결된 마커 기반으로 멀티마커 패널을 선정하는 시스템 및 방법
CN109797221A (zh) * 2019-03-13 2019-05-24 上海市第十人民医院 一种用于对肌层浸润性膀胱癌进行分子分型和/或预后预测的生物标记物组合及其应用

Also Published As

Publication number Publication date
KR102659917B1 (ko) 2024-04-23
KR20210069599A (ko) 2021-06-11

Similar Documents

Publication Publication Date Title
WO2023080379A1 (fr) Appareil de génération d'informations d'apparition de maladie basé sur une corrélation temporelle à l'aide d'un score de risque polygénique et son procédé
WO2020101108A1 (fr) Plateforme de modèle d'intelligence artificielle et procédé de fonctionnement de plateforme de modèle d'intelligence artificielle
WO2023172025A1 (fr) Procédé de prédiction d'informations relatives à une association entre une paire d'entités à l'aide d'un modèle de codage d'informations de série chronologique, et système de prédiction généré à l'aide de celui-ci
WO2016082267A1 (fr) Procédé et système de reconnaissance vocale
WO2020222539A1 (fr) Dispositif concentrateur, système multi-dispositif comprenant le dispositif concentrateur et une pluralité de dispositifs, et son procédé de fonctionnement
WO2018194206A1 (fr) Module d'auto-apprentissage de modèle de réseau neuronal artificiel à l'aide d'une combinaison de chaînes de neuro-blocs
WO2022154457A1 (fr) Procédé de localisation d'action, dispositif, équipement électronique et support de stockage lisible par ordinateur
WO2022010255A1 (fr) Procédé, système et support lisible par ordinateur permettant la déduction de questions approfondies destinées à une évaluation automatisée de vidéo d'entretien à l'aide d'un modèle d'apprentissage automatique
WO2021112593A1 (fr) Procédé de production de métagène basé sur une factorisation matricielle non négative et son application
WO2023080766A1 (fr) Appareil pour générer des informations de mutation de gène à risque spécifique à une maladie à l'aide d'un modèle prs reposant sur une covariable variant dans le temps, et procédé associé
US20200216916A1 (en) Method for estimating additive and dominant genetic effects of single methylation polymorphisms (smps) on quantitative traits
Talib et al. Fuzzy decision-making framework for sensitively prioritizing autism patients with moderate emergency level
WO2020060161A1 (fr) Système d'analyse statistique et méthode d'analyse statistique utilisant une interface conversationnelle
WO2021182881A1 (fr) Multiples biomarqueurs pour le diagnostic du cancer du sein et utilisation associée
WO2019093695A1 (fr) Procédé d'analyse de données d'échantillon sur la base d'un réseau de modules génomiques
WO2023033275A1 (fr) Procédé et système de génération d'un modèle de prédiction d'âge biologique personnalisé
WO2021086127A1 (fr) Dispositif concentrateur, système multi-dispositif comprenant le dispositif concentrateur et une pluralité de dispositifs, et procédé de fonctionnement du dispositif concentrateur et du système multi-dispositif
WO2023063528A1 (fr) Dispositif et procédé pour générer des informations d'apparition de maladie au moyen d'une analyse de facteurs associés à une maladie sur la base de la variabilité temporelle
WO2023229279A1 (fr) Procédé de détermination de l'âge à l'aide d'un microbiome
WO2024117794A1 (fr) Procédé basé sur l'intelligence artificielle pour le diagnostic du cancer et la prédiction des types de cancer utilisant les caractéristiques d'un acide nucléique acellulaire
WO2023204488A1 (fr) Procédé et système de prédiction de la perte de cheveux selon le microbiome du cuir chevelu
WO2022145590A1 (fr) Appareil et procédé de prédiction de temps de rétention dans une analyse chromatographique d'un analyte
CN1281203A (zh) 用于求解有限域上方程系统的设备和用于逆运算外延域元素的设备
WO2020204390A1 (fr) Procédé de vérification d'échantillons ngs et dispositif l'utilisant
WO2017014483A1 (fr) Procédé d'analyse d'effet de débordement technique

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20895997

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20895997

Country of ref document: EP

Kind code of ref document: A1