CN105701365B - It was found that the method and related system of cancer related gene, process for preparing medicine - Google Patents

It was found that the method and related system of cancer related gene, process for preparing medicine Download PDF

Info

Publication number
CN105701365B
CN105701365B CN201610019087.6A CN201610019087A CN105701365B CN 105701365 B CN105701365 B CN 105701365B CN 201610019087 A CN201610019087 A CN 201610019087A CN 105701365 B CN105701365 B CN 105701365B
Authority
CN
China
Prior art keywords
mirna
gene
sample
data
expression
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201610019087.6A
Other languages
Chinese (zh)
Other versions
CN105701365A (en
Inventor
杨利英
曹阳
袁细国
张军英
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xidian University
Original Assignee
Xidian University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xidian University filed Critical Xidian University
Priority to CN201610019087.6A priority Critical patent/CN105701365B/en
Publication of CN105701365A publication Critical patent/CN105701365A/en
Application granted granted Critical
Publication of CN105701365B publication Critical patent/CN105701365B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids

Landscapes

  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Biophysics (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention discloses a kind of methods finding cancer related gene using miRNA expression data, based on the general cancer project PanCancer under cancer gene group collection of illustrative plates TCGA, with statistical analysis and machine learning algorithm, analyzing processing, identification and the relevant gene of complex disease are carried out to gene expression data;Including:Sample data arranges;It is for statistical analysis to miRNA data;MiRNA is sorted by Change in Mean rate;Selected target gene;Extract corresponding disease sample and normal sample;Gene in the mRNA samples come out to said extracted using Relief algorithms is ranked up.The present invention can find with the relevant multiple risk genes of the complex diseases such as cancer, the Biological target therapy of complex disease, bio-pharmaceutical development, pathogenesis are illustrated and risk profile etc. is all significant.

Description

It was found that the method and related system of cancer related gene, process for preparing medicine
Technical field
The invention belongs to technical field of data processing more particularly to a kind of utilization miRNA expression data to find that cancer is related The method of gene.
Background technology
Bioinformatics is the new branch of science that a life science and computer science are combined, and studies adopting for biological information Collection processing, storage, propagates, analysis and explains etc., and by comprehensively utilizing biology, computer science and information technology disclose The biology secret that complicated biological data is contained.Gene is the carrier of hereditary information, helps to deepen to probing into for gene To the understanding of disease.The gene number that the mankind are currently known is more than 20,000, the mRNA gene expressions that corresponding sequencing data obtains Data reach 20,000 multidimensional, and the relevant gene of each disease is different, some disease related genes advantageously, it has been found that but It is most of related gene up for further studying.It needs to locate as it can be seen that directly carrying out mRNA gene expression datas analysis High dimensional data is managed, computation complexity is very big.
MiRNA is the tiny RNA that raw, length is about 20-24 nucleotide in one kind, and miRNA has 1000 known to the mankind It is multiple, there are a variety of important adjustment effects in the cell.MiRNA can regulate and control many genes in human body, i.e., each miRNA Can there are multiple target genes, multiple miRNA that can also adjust the same gene.The mode of miRNA controlling genes in total there are three types of. The first mode of action is cutting target cdna molecule structure, and in this case, the two shows as complete complementary in structure, The function of miRNA is closely similar with siRNA, and most of miRNA in plant is this mode of action.Second of effect side Formula is to hinder target gene translation, in this case, the two is cashed in structure be it is not fully complementary, it is this not fully complementary to cause Target gene translation is obstructed, and influences the stability of gene expression therewith, it is this to find most modes of action all in non-plant organisms Kind mode, for example the lin-4 of Caenorhabditis elegans is exactly influence Caenorhabditis elegans growth and development in this way, but planting This mode of action is rarely found in object.The third mode of action is that front two ways combines, some parts miRNA and target gene Complementation combines, and at this moment will appear as cutting target gene, and remainder is not fully integrated with target gene, at this moment will appear as hindering Target gene is hindered to be translated.It is small in view of miRNA expression data dimensions, data are expressed by handling miRNA, obtain the risk of disease Then miRNA is analyzed using the target gene mRNA data of miRNA, can reach prediction while reducing data dimension The purpose of disease related gene.
The prior art directly handles complex disease mRNA gene expression data dimension height and computationally intensive.
Invention content
The purpose of the present invention is to provide a kind of methods finding cancer related gene using miRNA expression data, it is intended to Solve the problems, such as that the prior art directly handles complex disease mRNA gene expression data dimension height and computationally intensive.
The invention is realized in this way a method of finding cancer related gene, the profit using miRNA expression data Find the method for cancer related gene based on the general cancer project under cancer gene group collection of illustrative plates TCGA with miRNA expression data PanCancer carries out analyzing processing, identification and complicated disease with statistical analysis and machine learning algorithm to gene expression data The relevant gene of disease, including:
Sample data arranges, and the miRNA expression data and mRNA expression data, two kinds of data for obtaining certain disease include Disease sample and corresponding normal sample;
It is for statistical analysis to miRNA data, the mean expression value of normal sample and disease sample is acquired respectively, this process Exclude the influence of zero;
MiRNA is sorted by Change in Mean rate, the bigger ranking of change rate is more forward, screens in the top 10 MiRNA is as correlation miRNA;
Using five microRNA target prediction softwares of miRanda, miRDB, miRWalk, RNA22, Targetscan as prediction The tool of mRNA, obtains the target gene of corresponding miRNA, and selected target gene follows following condition:For five target genes used Forecasting software, it is assumed that K indicates while predicting the maximum value of the forecasting software number of identical target gene, NkIt indicates simultaneously by K The gene number of target gene software prediction, it is at least pre- simultaneously by R (0≤R≤K) a microRNA target prediction software as pre-selection gene It measures;
According to the mRNA chosen, go out corresponding disease sample and normal sample from initial mRNA expression extracting data;
Using Relief algorithms to said extracted come out mRNA samples in gene be ranked up, by importance from greatly to Minispread takes preceding 45 genes as the disease related gene of prediction.
Further, when analyzing miRNA data, the influence of zero is excluded, when seeking miRNA mean values, is first found out The number m of nonzero value, then acquires miRNA sample expression value summation Sum, then calculating sample average is in each sample Sum/m, normal sample expression value mean value be n, disease sample expression value mean value be c, then corresponding expression value change rate is | N-c |/n determines that before sample expression value change rate ranking 10 miRNA is relevant according to the Change in Mean rate of miRNA miRNA。
Further, it at least to be predicted simultaneously by R (0≤R≤K) a microRNA target prediction software as pre-selection gene, Nk> When 10, R=K;Work as Nk< 10 and Nk-1When > 10, R=K-1;Similarly, if Nk-1< 10 and Nk-2> 10, then R=K-2, with such It pushes away.
Further, the feature selection approach of selection is Relief algorithms, and feature weight calculation formula is as follows:
Wherein, si(i=1 ..., p) indicates i-th of sample, and p is number of samples;SamejIndicate siJ-th of similar sample This, MissjIndicate siJ-th of foreign peoples's sample, k indicate neighbour's number;wf(f=1 ..., q) indicates the weight of feature f, i.e., The significance level of f-th of pre-selection gene, q are the number for preselecting gene;R indicates frequency in sampling;
Function diff is defined as follows:
Wherein, sifIndicate values of the feature f on i-th of sample, sjfIndicate values of the feature f on j-th of sample, MaxfIndicate the maximum values of feature f in the sample, MinfThen indicate the minimum values of feature f in the sample, frequency in sampling r=10, closely Adjacent number k=20, Relief algorithm iteration number are 30 times, calculate the weight W={ w of each feature1,w2,…,wq, and according to Weight W sorts to mRNA.
Another object of the present invention is to provide a kind of sides finding cancer related gene using miRNA expression data The system of method, the system comprises:
Sample data sorting module, the miRNA for obtaining certain disease expresses data and mRNA expresses data, two kinds of numbers According to comprising disease sample and corresponding normal sample;
Statistical analysis module acquires normal sample and disease sample for for statistical analysis to miRNA data respectively Mean expression value, this process will exclude the influence of zero;
Ranking module is screened, for miRNA to sort by Change in Mean rate, the bigger ranking of change rate is more forward, screening 10 miRNA in the top are as correlation miRNA;
Selected target gene module, for using five miRanda, miRDB, miRWalk, RNA22, Target scan targets Tool of the predictive genes software as prediction mRNA, obtains the target gene of corresponding miRNA, selected target gene follows following condition: For five microRNA target prediction softwares used, it is assumed that K is indicated while being predicted the forecasting software number of identical target gene most Big value, NkIt indicates simultaneously by the gene number of K target gene software prediction;
Extraction module, for according to the mRNA chosen, going out corresponding disease sample from initial mRNA expression extracting data And normal sample;
Sorting module, the gene in mRNA samples for being come out to said extracted using Relief algorithms are ranked up, It is arranged from big to small by importance, takes preceding 45 genes as the disease related gene of prediction.
Further, the statistical analysis module further comprises:
Nonzero value seeks unit, when for seeking miRNA mean values, first finds out the number m of nonzero value in each sample;
Sample average computing unit, for acquiring miRNA sample expression value summation Sum, then calculating sample average is Sum/m;
Expression value change rate computing unit, normal sample expression value mean value are n, and disease sample expression value mean value is c, then Obtaining corresponding expression value change rate is | n-c |/n;
Ranking unit, for according to the Change in Mean rate of miRNA, determining 10 before sample expression value change rate ranking MiRNA is relevant miRNA.
Another object of the present invention is to provide a kind of application utilization miRNA expression data to find cancer related gene Method Biological target therapy system.
Another object of the present invention is to provide a kind of application utilization miRNA expression data to find cancer related gene Method bio-pharmaceutical development technology.
Another object of the present invention is to provide a kind of application utilization miRNA expression data to find cancer related gene Method pathogenesis illustrate system.
Another object of the present invention is to provide a kind of application utilization miRNA expression data to find cancer related gene Method pathogenic Risk Forecast System.
The method provided by the invention for finding cancer related gene using miRNA expression data, is based on cancer gene group picture The general cancer project PanCancer under TCGA (The Cancer Genome Atlas) is composed, with statistical analysis and machine learning Algorithm carries out analyzing processing, identification and the relevant gene of complex disease to gene expression data.The present invention can have found with cancer etc. The relevant multiple risk genes of complex disease illustrate the Biological target therapy of complex disease, bio-pharmaceutical development, pathogenesis And risk profile etc. is all significant, can be directed to the risk genes design gene target therapy acquired;According to genetic marker The drug or developing new drug object for selecting sensibility high;Related gene based on discovery can analyze the development of complex disease Journey, to determine its Forming Mechanism;Tumor susceptibility gene detection can also be carried out to the risk genes of prediction, to reduce risk.This Invention causes greatly sample data to be difficult to handle in view of mRNA expression data volume, therefore uses data volume small and have regulation and control to make mRNA The miRNA used is as analysis site, and what the prior art was handled is the mRNA expression data of 20,000 multidimensional, and this method analysis is 1 The miRNA of thousand multidimensional expresses data, and dimension reduces 20 times, therefore computation complexity reduces, and calculates time shortening, avoids The unfavorable factors such as overlong time are calculated caused by data volume is big.The present invention can quickly locate pathogenic mRNA using miRNA, It is not confined to certain complex disease or certain cancer, but can this method be utilized to analyze phase all complex diseases Correlation gene.The present invention is to determine target gene by analyzing the miRNA expression data of certain disease, and number is expressed by target gene mRNA According to risk genes are filtered out, any and relevant information of disease is not needed in addition.Therefore, as long as providing the miRNA of certain disease And mRNA expresses data, so that it may which, to be analyzed using this method, applicability is wide.
Description of the drawings
Fig. 1 is the method flow diagram provided in an embodiment of the present invention that cancer related gene is found using miRNA expression data.
Specific implementation mode
In order to make the purpose , technical scheme and advantage of the present invention be clearer, with reference to embodiments, to the present invention It is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, it is not used to Limit the present invention.
The present invention targets relationship using between miRNA expression data volume smaller feature and miRNA and mRNA, passes through analysis MiRNA expresses data to obtain cancer related gene, to solve number when the prior art directly utilizes mRNA to express data analysis According to the excessive problem of amount.
The application principle of the present invention is explained in detail below in conjunction with the accompanying drawings.
As shown in Figure 1, the embodiment of the present invention using miRNA expression data find cancer related gene method include with Lower step:
S101:Differential expression based on miRNA in normal sample and disease sample filters out relevant with cancer miRNA;
S102:Using the mapping relations between miRNA and mRNA, target on the mRNA that the miRNA is acted on;
S103:By finding the related gene of cancer to targeting the analysis of mRNA expression data.
The data that the present invention uses, including the normal and disease sample in miRNA and mRNA, are all from the general cancer of TGCA Research project.
Steps are as follows for the specific implementation of the present invention.
Step 1, data processing
Sample data is divided into following four groups:Normal sample miRNA expresses data, disease sample miRNA expresses data, Normal sample mRNA expression data, disease sample mRNA express data, need exist for miRNA the and mRNA titles pair for ensuring sample It should be consistent.
Step 2 filters out the high miRNA of expression value change rate
1. for miRNA data, the equal of each miRNA corresponding expression values is found out in normal and two class sample data of disease Value.Because the case where to have miRNA expression values in sample now be 0, needs to count the number of zero when processing, if The sum of nonzero value is m in the sample of some miRNA, and sample value summation is Sum, then the mean value of the miRNA samples is Sum/m. Calculate that each miRNA is normal and the mean value of disease sample in this approach.
2. according to sample average, the change rate of each miRNA is calculated, if the normal sample mean value of some miRNA is n, Disease sample mean value is c, then change rate is | n-c |/n.
3. being ranked up to all miRNA according to change rate, the big miRNA of preceding 10 change rates is selected.
Step 3 obtains the target gene for selecting miRNA
Using five kinds of microRNA target prediction softwares of miRanda, miRDB, miRWalk, RNA22 and Targetscan, selected Determine the target gene of miRNA.For five microRNA target prediction softwares used, it is assumed that K is indicated while being predicted identical target gene The maximum value of forecasting software number, NkIt indicates simultaneously by the gene number of K target gene software prediction.Process require that as pre- Gene is selected at least by a microRNA target prediction softwares of R (0≤R≤K) while to predict, NkWhen > 10, R=K;Work as Nk< 10 and Nk-1 When > 10, R=K-1;Similarly, if Nk-1< 10 and Nk-2> 10, then R=K-2, and so on.
Step 4:Related gene is screened using Relief algorithms
According to selected target gene mRNA, normal sample and disease sample are filtered out in mRNA expresses data, and by two Class sample data combines, and is sorted from big to small according to importance to mRNA using Relief algorithms, and wherein Relief is calculated Frequency in sampling r=10 in method, neighbour number k=20, algorithm iteration number are 30 times.According to Relief ranking results, before selecting 45 related genes as prediction.
The application effect of the present invention is explained in detail with reference to experiment.
Experiment one chooses the breast cancer expression data (BRCA) in TCGA PanCancer projects and is used as experimental subjects, number 1045 miRNA and 20530 mRNA are shared in.Breast cancer expression data are handled according to above-mentioned experimental procedure:
1. importing miRNA sample datas, all normal sample data are screened first, are ensured in normal sample without complete zero The case where, if there is the normal sample data all zero of some miRNA, the miRNA data is same in deleting normal sample When, also delete the data of the correspondence miRNA in disease sample.
2. for the miRNA data that screening is completed, average respectively to normal and disease sample, and calculate change rate.
3. it is sorted to miRNA according to change rate, the miRNA for selecting preceding 10 change rates big after sequence.This experiment is finally selected Fixed following 10 MicroRNA:hsa-mir-133b,hsa-mir-133a,hsa-mir-208b, hsa-mir-206,hsa- mir-551b,hsa-mir-145,hsa-mir-378,hsa-mir-451,hsa-mir-144, hsa-mir-1。
4. couple selected miRNA obtains 725 target genes using aforementioned five kinds of microRNA target predictions software prediction target gene mRNA。
5. selecting the corresponding data of 725 target gene mRNA from mRNA data, then utilize Relief algorithms to mRNA Importance ranking is carried out, frequency in sampling r=10, neighbour number k=20 are set, algorithm iteration number is 30 times, selects preceding 45 A important mRNA is related gene.45 mRNA are as follows:RXFP2, GYPA, OTX2, PRDM9, CYP11B1, MMD2, CHRNA4, NEUROD1, PABPC1L2B, RIT2, CNTN5, NEUROD4, SLC4A1, PRDM7, FBXO40, GABRG2, GPR6, ZIC3, SPINLW1, DMRT1, CYP3A4, DPCR1, LHX9, ISL2, LIPI, SOST, HHLA2, S100A7, RIPPLY1, TRHDE, BMP3, KCNMB2, PAX5, PAX3, ANGPT4, DSCAM, EREG, OR7D2, DRD1, GFRA3, LEP, GPR26, LIX1, ZIC1, GDAP1L1.
Effect of the gene that lower surface analysis acquires in breast cancer and its function connection between known breast cancer important gene System, the correlation of these genes and breast cancer is illustrated with this, to verify the validity of this research institute extracting method.
NEUROD1 is the alkaline bHLH transcription factors of NeuroD families, it can combine the transcription factor production of other bHLH Raw heterodimer simultaneously activates a kind of special DNA sequence dna transcription for being E-box, it additionally aids various kinds of cell differentiation access tune Control.Heidi Fiegl have found that NEUROD1 showing for aberrant methylation occurs in the tumour of breast tissue and lung neoplasm sample As, and the level to methylate in tumor progression higher grade sample is higher.The protein of SLC4A1 codings is AE albumen A member of family, the albumen play the role of prodigious in red blood cell, a kind of transport protein medium can be used as to help corresponding Agent across cell membranes.A Gorbatenko's research shows that SLC4A1 is lowered in all breast cancer subtypes, this explanation SLC4A1, which may become breast cancer disease, generates certain influence.CYP3A4 can Codocyte cytochrome p 450 enzyme, this enzyme takes part in The metabolic process of half drug now, such as Paracetamol, codeine, ciclosporin A and diazepam and erythromycin, simultaneously Also participate in some steroids and carcinogenic metabolism.C Keshava have found that the variation of CYP3A4 may result in swashing for breast cancer Plain metaboilic level is lacked of proper care, while may also can activate allogene that cancer is caused to generate, and is a significant correlation of breast cancer Gene.There are the surface of monocyte, this albumen can combine the protein of HHLA2 codings with the receptor on lymphocyte, To adjust cell-mediated immunity, and inhibit the proliferation of monocyte.M Janakiram are by analyzing TCGA correlated expressions Data find that the copy number of HHLA2 increases 29% in breast cancer, cause HHLA2 to will appear in breast cancer and excessively express Situation, this side illustration HHLA2, which becomes breast cancer disease, generates certain influence.The albumen of S100A7 codings belongs to S100 albumen race A member, S100 albumen are widely present in cytoplasm and nucleus, and participate in many cell processes, such as cell cycle and differentiation Regulation and control.Specific work of research situations and S100A7 of the S100A7 in breast cancer in breast cancer is described in detail in Emberley With mode and expression, same Haddadd also elaborates the relationship between S100A7 and breast cancer.PAX3 is PAX transcription factors A member of family, the homeodomain in the box-like domain matched it includes one and a pairing, these genes are in development of fetus Play the role of in journey very important.PAX3 expressions in breast cancer clinic have been described in detail in WJ Tan in its article, And it analyzes PAX3 and is influenced caused by breast cancer.LEP has encoded out a kind of protein secreted by leucocyte, and LEP is mainly right It adjusts weight to play an important role in the process, it can inhibit the intake of food and adjust energy expenditure, Cleveland descriptions Correlativity between LEP genetic mutations and the incidence of breast cancer, this illustrates that the abnormal expression of LEP may not only lead to body Weight is unbalance, it is also possible to lead to the lesion of breast cancer.
The gene that can be seen that prediction by above-mentioned analysis can have an impact breast cancer disease change, but these genes are specific Pathogenesis also need to relevant technical staff and make deep analysis.
Below in David databases path analysis tool and STRING-DB databases to predicted gene carry out it is whole Body is analyzed.Both analysis methods can cause from side illustration predicted gene by the important gene to cancer acts The generation of disease demonstrates the relevance between predicted gene and cancer gene.The breast cancer important gene of this experimental selection has PIK3CA, TP53, PTEN, AKT1 and SF3B1.
Find that there are risk accesses between predicted gene and important gene using David databases, and in predicted gene There is also relevant risk accesses.There are accesses by PIK3CA, AKT1 in EREG and important gene, between LEP and PIK3CA, AKT1 There are risk accesses, it is notable that there is also relevant logical by DRD1, GABRG2 and RXFP2 in LEP and screening-gene Road, as shown in table 1.Path analysis, which also found between LEP, EREG and breast cancer important gene PIK3CA, AKT1, has biological generation Thank to the contact of aspect, and there are pathways between DRD1, GABRG2, RXF2 and LEP, this mutual contact may be to lead Cause the pathogenetic source of disease.
The access that 1 breast cancer related gene of table participates in
For breast cancer important gene together with 45 genes screened, checked between them on STRING-DB Interaction relationship, analysis result is as shown in table 2.Some genes and other genes do not have any contact, as OR7D2, HHLA2, DPCR1 etc., this does not illustrate that they do not act on breast cancer, these genes may act solely on breast cancer (such as HHLA2, before analyzed the gene pairs breast cancer lesion generate influence), it is also possible to it is important with other of breast cancer There is interaction in gene.There are many interactions between remaining gene, these genes constitute a relational network, predict Gene may be positive by certain biological function or negatively influence breast cancer important gene in network, to generate Breast cancer lesion.
Association between 2 breast cancer related gene of table and important gene
Experiment two, the kidney expression data (KIRC) chosen in TCGA PanCancer projects are used as experimental subjects, data In share 1045 miRNA and 20530 mRNA.
Using with experiment one identical method, sample changed rate is ranked up, selected after sequence preceding 10 samples as Target miRNA.Selected miRNA is as follows:hsa-mir-200c,hsa-mir-514b,hsa-mir-506, hsa-mir-508, hsa-mir-514-2,hhsa-mir-141,hsa-mir-514-3,hsa-mir-514-1,hsa-mir-184,h sa-mir- 934.The target gene of miRNA is selected using aforementioned five kinds of microRNA target prediction software predictions, obtains 504 mRNA.From initial mRNA These mRNA sample datas are selected in expression data, then Relief algorithms are utilized to calculate weight, choose preceding 45 mRNA conducts Target mRNA.45 selected mRNA are as follows:ODAM, KLHL1, TAC1, NPY2R, HYAL4, FOXE1, TTR, SLC6A14, GLRA3, FUT9, GRIA2, KCNA1, CXorf41, TFAP2B, SFTPB, CRISP1, PDE6H, AGXT2L1, LHFPL4, SLC30A8, STXBP5L, TMEM196, IL1F5, ASTN1, CRISP3, HTR2C, LIN28B, TRIM42, KIAA1486, COL9A1, GCM1, TNNI1, SCG3, ANXA10, BTC, SORCS1, KCND2, LRRN1, MSTN, ERBB4, PRG4, NAPB, ARHGAP12, C12orf53, RAD52.
Effect of the gene that lower surface analysis acquires in kidney, to illustrate the correlation of these genes and kidney, to test Demonstrate,prove the validity of this research institute extracting method.
KLHL1 is a protein coding gene, belongs to a member of muscular tissue protein family, has table in the histocyte of kidney It reaches, also has expression in many brain tissues.This illustrates that the mutation of KLHL1 may cause certain cells in kidney function occur Sex chromosome mosaicism, to influence the canceration of renal tract.NPY2R coding protein be Y2 in neuropeptide (NPY) Y receptor, NPY by Body participates in various biological behavior, including the intake of food, stimulation antianxiety, daily rhythmicity pain modulation and transmission and hangs down Body hormone release control.There are 293 kinds of cells by by comprising the gene regulation including NPY2R in mankind's kidney, it is seen that NPY2R is in kidney It plays and plays an important role in function, NPY2R gene normal expressions whethers can play a role in nephrosis.KCNA1 belongs to gene packet Containing the Jia Dao 6-TM families including active Ca2 (+-), there is certain contribution to the formation of the tetramer, while also participating in subclass man The albumen of race is formed, such as KV1.1, KV1.2, KCNQ2 and KCNQ3 etc..The mutation of KCNA1 affects the function of renal tract, Find that KCNA1 mutates in non-functional region to being overexpressed when mankind's kidney portion cell is analyzed, while also to KV1.1. Functional region produce negative impact.FOXE1 belongs to a member of transcription factor family, which may be in thyroid gland disease There are relative influences in the mutation of disease, and then play a role to kidney portion lesion.It is reported that including FOXE1 Gene can influence kidney portion by influencing thyroid function, lead to Renal Malformation and lesion.The enzyme of SLC6A14 codings is molten A member in carrier families 6 is solved, dissolving carrier families serve mainly to facilitate sodium and chlorine element and are transported in human nerve matter, The coding albumen also takes part in neutral and cationic amino acid transhipment, while the also carrier as β alanines.It is found The case where SLC6A14 is overexpressed in kidney portion pathological tissues, this overexpression illustrate that the variation of SLC6A14 may be to kidney portion Function has an impact.The Simon Rex oligosaccharides fucosyltransferase of FUT9 codings belongs to a member in glycosyl transferase family, main It is present in golgiosome, important function is also functioned in organ embryo development procedure, FUT9 is also responsible for regulation and control CD15 in maturation Expression in granulocyte.FUT9 expresses 1.8 times of reduction in kidney can cause CD24A to increase by 1.8 times in the expression of kidney, directly The normal performance of renal function can be seriously affected by connecing.It, can be indirectly by although FUT9 will not directly result in the lesion of kidney Influencing CD24A expression influences kidney portion function, cannot despise the effect of kidney.STXBP5L is a kind of important paralog The protein of gene, coding can be combined with syntaxin.Neuron phase interactions of the STXBP5 as protein and cynapse With largely existing in renal tract, also function to prodigious effect to kidney portion function effect, illustrate the variation of STXBP5L to kidney Function has a significant impact.
Below with the KEGG pathway tools and STRING-DB databases of David databases to the kidney phase of discovery Correlation gene makees global analysis.Here the KIRC important genes selected have TP53, CDH1, VEGFA, MUC1 and EGFR.By dividing Analysis finds, the gene of prediction with have between the gene that be associated with access, and predict that there is also be associated with access between important gene.BTC、 The ERBB and important gene EGFR of kidney also has VEGFA and HTR2C there are access is associated there are risk access, with EGFR, with There are access figures to also have VEGFA, GRIA2 and TP53 there are access by EGFR, the kidney related gene HTR2C that predicts in addition, There is also risk access between GRIA2, GLRA3, NPY2R, as shown in table 3.
The access that 3 kidney related gene of table participates in
The kidney related gene and important gene of prediction are interacted using STRING-DB databases and checked, is tied Fruit is as shown in table 4.From table 4, it can be seen that prediction gene between and its between important gene exist much contact, illustrate in advance Cls gene may act on kidney important gene by certain modes, to influence the normal expression of important gene, lead to kidney Lesion.Tetra- genes of TTR, KCNA1, FOXE1 and ODAM will especially be paid attention to, they are related with many kidney important genes, can Multiple important genes can be acted on simultaneously.
Association between 4 kidney related gene of table and important gene
The operation principle of the present invention:
The regulating and controlling effect expressed data by analyzing the smaller miRNA of dimension and utilize miRNA to gene, in higher-dimension A low-dimensional subset is targeted out in mRNA expression data, and then determines the important of gene in each dimension using Relief algorithms Property, thus filter out the related gene of complex disease.Relief algorithms are one kind proposed by Kira and Rendell in 1992 Feature weight algorithm, is trained by sample, and the classified weight of sample characteristics is obtained according to training, and weight means more greatly this Feature is bigger to the meaning of classification.Feature weight calculation formula is as follows in Relief algorithms:
Wherein, si(i=1 ..., p) indicates i-th of sample, and p is number of samples;SamejIndicate siJ-th of similar sample This, MissjIndicate siJ-th of foreign peoples's sample, k indicate neighbour's number;wf(f=1 ..., q) indicates the weight of feature f, i.e., The significance level of f-th of pre-selection gene, q are the number for preselecting gene;R indicates frequency in sampling.Function diff is defined as follows:
Wherein, sifIndicate values of the feature f on i-th of sample, sjfIndicate values of the feature f on j-th of sample, MaxfIndicate the maximum values of feature f in the sample, MinfThen indicate the minimum values of feature f in the sample.Sampling time according to setting Number and neighbour's number, Relief algorithms acquire the weight W={ w of each feature by successive ignition1,w2,…,wq, then basis Weight W is ranked up feature.
The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all essences in the present invention All any modification, equivalent and improvement etc., should all be included in the protection scope of the present invention made by within refreshing and principle.

Claims (10)

1. a kind of method finding cancer related gene using miRNA expression data, which is characterized in that described to utilize miRNA tables Find the method for cancer related gene based on the general cancer project PanCancer under cancer gene group collection of illustrative plates TCGA, fortune up to data With statistical analysis and machine learning algorithm, analyzing processing, identification and the relevant gene of complex disease are carried out to gene expression data; Including:
Sample data arranges, and obtains the miRNA expression data and mRNA expression data of certain disease, two kinds of data include disease Sample and corresponding normal sample;
It is for statistical analysis to miRNA data, the mean expression value of normal sample and disease sample is acquired respectively, this process will arrange Except the influence of zero;
MiRNA is sorted by Change in Mean rate, the bigger ranking of change rate is more forward, screens 10 miRNA in the top and makees For related miRNA;
Using five microRNA target prediction softwares of miRanda, miRDB, miRWalk, RNA22, Targetscan as prediction mRNA Tool, obtain the target gene of corresponding miRNA, selected target gene follows following condition:For five microRNA target predictions used Software, it is assumed that K indicates while predicting the maximum value of the forecasting software number of identical target gene, NkIt indicates simultaneously by K target base Because of the gene number of software prediction, at least to be predicted simultaneously by R microRNA target prediction software as pre-selection gene;0≤R≤K;
According to the mRNA chosen, go out corresponding disease sample and normal sample from initial mRNA expression extracting data;
Gene in the mRNA samples come out to said extracted using Relief algorithms is ranked up, and is arranged from big to small by importance Row take preceding 45 genes as the disease related gene of prediction.
2. the method for finding cancer related gene using miRNA expression data as described in claim 1, which is characterized in that right When miRNA data are analyzed, the influence of zero is excluded, when seeking miRNA mean values, first finds out nonzero value in each sample Then number m acquires miRNA sample expression value summation Sum, then it is Sum/m to calculate sample average, and normal sample expression value is equal Value be n, disease sample expression value mean value be c, then corresponding expression value change rate is | n-c |/n, according to the mean value of miRNA change Rate, the miRNA for determining before sample expression value change rate ranking 10 are relevant miRNA.
3. the method for finding cancer related gene using miRNA expression data as described in claim 1, which is characterized in that make It at least to be predicted simultaneously by R microRNA target prediction software for pre-selection gene, 0≤R≤K;NkWhen > 10, R=K;Work as Nk< 10 and Nk-1When > 10, R=K-1;Similarly, if Nk-1< 10 and Nk-2> 10, then R=K-2.
4. the method for finding cancer related gene using miRNA expression data as described in claim 1, which is characterized in that choosing The feature selection approach taken is Relief algorithms, and feature weight calculation formula is as follows:
Wherein, siIndicate i-th of sample, i=1 ..., p;P is number of samples;SamejIndicate siJ-th of similar sample, MissjIndicate siJ-th of foreign peoples's sample, k indicate neighbour's number;wfThe weight of (f=1 ..., q) expression feature f, i.e., f-th The significance level of gene is preselected, q is the number for preselecting gene;R indicates frequency in sampling;
Function diff is defined as follows:
Wherein, sifIndicate values of the feature f on i-th of sample, sjfIndicate values of the feature f on j-th of sample, MaxfTable Show the maximum values of feature f in the sample, MinfThen indicate the minimum values of feature f in the sample, frequency in sampling r=10, neighbour Number k=20, Relief algorithm iteration numbers are 30 times, calculate the weight W={ w of each feature1,w2,…,wq, and according to weight W sorts to mRNA.
5. a kind of system of the method as described in claim 1 finding cancer related gene using miRNA expression data, feature It is, the system comprises:
Sample data sorting module, the miRNA for obtaining certain disease expresses data and mRNA expresses data, and two kinds of data are equal Including disease sample and corresponding normal sample;
Statistical analysis module acquires being averaged for normal sample and disease sample for for statistical analysis to miRNA data respectively Expression value, this process will exclude the influence of zero;
Ranking module is screened, for miRNA to sort by Change in Mean rate, the bigger ranking of change rate is more forward, screens ranking 10 forward miRNA are as correlation miRNA;
Selected target gene module, for pre- using five target genes of miRanda, miRDB, miRWalk, RNA22, Targetscan Tool of the software as prediction mRNA is surveyed, obtains the target gene of corresponding miRNA, selected target gene follows following condition:For institute Five microRNA target prediction softwares, it is assumed that K indicates while predicting the maximum value of the forecasting software number of identical target gene, Nk It indicates simultaneously by the gene number of K target gene software prediction, it is at least same by R microRNA target prediction software as pre-selection gene When predict, 0≤R≤K;NkWhen > 10, R=K;Work as Nk< 10 and Nk-1When > 10, R=K-1;Similarly, if Nk-1< 10 and Nk-2 > 10, then R=K-2;
Extraction module goes out corresponding disease sample and just for according to the mRNA that chooses from initial mRNA expression extracting data Normal sample;
Sorting module, the gene in mRNA samples for being come out to said extracted using Relief algorithms is ranked up, by weight The property wanted arranges from big to small, takes preceding 45 genes as the disease related gene of prediction.
6. system as claimed in claim 5, which is characterized in that the statistical analysis module further comprises:
Nonzero value seeks unit, when for seeking miRNA mean values, first finds out the number m of nonzero value in each sample;
Sample average computing unit, for acquiring miRNA sample expression value summation Sum, then it is Sum/m to calculate sample average;
Expression value change rate computing unit, normal sample expression value mean value be n, disease sample expression value mean value be c, then accordingly Expression value change rate be | n-c |/n;
Ranking unit, for according to the Change in Mean rate of miRNA, determining that before sample expression value change rate ranking 10 miRNA is Relevant miRNA.
7. a kind of using the method for utilizing miRNA expression data to find cancer related gene described in claim 1-4 any one Biological target therapy system.
8. a kind of using the method for utilizing miRNA expression data to find cancer related gene described in claim 1-4 any one Bio-pharmaceutical preparation method.
9. a kind of using the method for utilizing miRNA expression data to find cancer related gene described in claim 1-4 any one Pathogenesis system.
10. a kind of using the method for utilizing miRNA expression data to find cancer related gene described in claim 1-4 any one Pathogenic Risk Forecast System.
CN201610019087.6A 2016-01-12 2016-01-12 It was found that the method and related system of cancer related gene, process for preparing medicine Active CN105701365B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610019087.6A CN105701365B (en) 2016-01-12 2016-01-12 It was found that the method and related system of cancer related gene, process for preparing medicine

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610019087.6A CN105701365B (en) 2016-01-12 2016-01-12 It was found that the method and related system of cancer related gene, process for preparing medicine

Publications (2)

Publication Number Publication Date
CN105701365A CN105701365A (en) 2016-06-22
CN105701365B true CN105701365B (en) 2018-09-07

Family

ID=56226286

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610019087.6A Active CN105701365B (en) 2016-01-12 2016-01-12 It was found that the method and related system of cancer related gene, process for preparing medicine

Country Status (1)

Country Link
CN (1) CN105701365B (en)

Families Citing this family (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP6280997B1 (en) * 2016-10-31 2018-02-14 株式会社Preferred Networks Disease onset determination device, disease onset determination method, disease feature extraction device, and disease feature extraction method
CN108182346B (en) * 2016-12-08 2021-07-30 杭州康万达医药科技有限公司 Establishment method and application of machine learning model for predicting toxicity of siRNA to certain cells
CN107066835B (en) * 2017-01-19 2020-03-17 东南大学 System for discovering and integrating rectal cancer related gene and functional analysis thereof
CN106845104B (en) * 2017-01-19 2019-04-09 东南大学 Utilize the method and system and application of TCGA database resource discovery carcinoma of the rectum correlation microRNA molecule marker
CN107358062B (en) * 2017-06-02 2020-05-22 西安电子科技大学 Construction method of double-layer gene regulation and control network
CN107516021B (en) * 2017-08-03 2019-11-19 北京百迈客生物科技有限公司 A kind of data analysing method based on high-flux sequence
CN108664764A (en) * 2018-05-14 2018-10-16 浙江大学 A kind of colon cancer cancer cell detector that parameter is optimal
US11410745B2 (en) 2018-06-18 2022-08-09 International Business Machines Corporation Determining potential cancer therapeutic targets by joint modeling of survival events
CN109065181B (en) * 2018-06-29 2021-01-01 迈凯基因科技有限公司 Multi-database interaction method and device based on broad search
CN109063420B (en) * 2018-06-29 2020-08-11 迈凯基因科技有限公司 Colorectal cancer gene variation and drug interpretation multi-database interaction system
CN109036572B (en) * 2018-06-29 2020-08-11 迈凯基因科技有限公司 Multi-database interaction method and device
CN109033750B (en) * 2018-07-18 2021-11-16 广州大学 Method for quantifying influence degree of miRNA on related disease genes
CN111602201B (en) * 2018-12-21 2023-08-01 北京哲源科技有限责任公司 Method for obtaining deterministic event in cell, electronic device and storage medium
CN109694912B (en) * 2019-02-28 2022-06-10 深圳市亚辉龙生物科技股份有限公司 Application of methylation sites, nucleic acid composition for detecting methylation, kit and detection method thereof
CN114333991A (en) * 2020-09-30 2022-04-12 北京瑷格干细胞科技有限公司 Method for screening disease markers by bioinformatics and application thereof
CN112708673B (en) * 2021-03-26 2021-06-25 广州市妇女儿童医疗中心 Application of PRDM9 transposon fusion as congenital megacolon disease marker
CN114369653A (en) * 2021-03-26 2022-04-19 广州市妇女儿童医疗中心 Hirschsprung's disease diagnosis marker and application thereof
CN113838527B (en) * 2021-09-26 2023-09-01 平安科技(深圳)有限公司 Method and device for generating target gene prediction model and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102321735A (en) * 2010-11-25 2012-01-18 上海聚类生物科技有限公司 Method for searching target gene of reverse miRNA
CN105063209A (en) * 2015-08-10 2015-11-18 北京吉因加科技有限公司 Quantitative detection method of exosome miRNA (micro ribonucleic acid)

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3074530A1 (en) * 2013-11-26 2016-10-05 Integragen A method for predicting responsiveness to a treatment with an egfr inhibitor
EP3140422A1 (en) * 2014-05-03 2017-03-15 The Regents of The University of California Methods of identifying biomarkers associated with or causative of the progression of disease, in particular for use in prognosticating primary open angle glaucoma

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102321735A (en) * 2010-11-25 2012-01-18 上海聚类生物科技有限公司 Method for searching target gene of reverse miRNA
CN105063209A (en) * 2015-08-10 2015-11-18 北京吉因加科技有限公司 Quantitative detection method of exosome miRNA (micro ribonucleic acid)

Also Published As

Publication number Publication date
CN105701365A (en) 2016-06-22

Similar Documents

Publication Publication Date Title
CN105701365B (en) It was found that the method and related system of cancer related gene, process for preparing medicine
JP6854792B2 (en) Pathway Recognition Algorithm Using Data Integration for Genome Models (PARADIGM)
CN112888459B (en) Convolutional neural network system and data classification method
JP7487163B2 (en) Detection and diagnosis of cancer evolution
Ceccarelli et al. Molecular profiling reveals biologically discrete subsets and pathways of progression in diffuse glioma
JP2024016039A (en) Integrated machine-learning framework to estimate homologous recombination deficiency
CN109689891A (en) The method of segment group spectrum analysis for cell-free nucleic acid
Gusev Computational methods for analysis of cellular functions and pathways collectively targeted by differentially expressed microRNA
US20180330049A1 (en) Methods for classification of glioma
CN103403182A (en) Detection of genetic or molecular aberrations associated with cancer
Bhattacharyya et al. MicroRNA signatures highlight new breast cancer subtypes
CN107358062B (en) Construction method of double-layer gene regulation and control network
CN108475300B (en) Custom-made drug selection method and system using genomic base sequence mutation information and survival information of cancer patient
JP2023535962A (en) Methods to identify spatial chromosomal instabilities such as homologous repair defects in low-coverage next-generation sequencing data
Zhu et al. Fusing multiple biological networks to effectively predict miRNA-disease associations
Széll et al. The enigmatic world of mRNA-like ncRNAs: their role in human evolution and in human diseases
Zhou et al. Integrative analysis of ceRNA network reveals functional lncRNAs in intrahepatic cholangiocarcinoma
Kafaie et al. A network approach to prioritizing susceptibility genes for genome‐wide association studies
CN115443507A (en) Identification of methylation patterns that identify or are indicative of a cancer condition
US20220172811A1 (en) A method of treatment or prophylaxis
Xiao et al. Differential expression pattern-based prioritization of candidate genes through integrating disease-specific expression data
Sha et al. Feature selection for polygenic risk scores using genetic algorithm and network science
Xu et al. AutoOmics: New multimodal approach for multi-omics research
Jin et al. Predicting miRNA-disease association via graph attention learning and multiplex adaptive modality fusion
Nakashima et al. An overview of bioinformatics methods for analyzing autism spectrum disorders

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant