CN106021983A - DNA and protein level mutation analysis method - Google Patents
DNA and protein level mutation analysis method Download PDFInfo
- Publication number
- CN106021983A CN106021983A CN201610319389.5A CN201610319389A CN106021983A CN 106021983 A CN106021983 A CN 106021983A CN 201610319389 A CN201610319389 A CN 201610319389A CN 106021983 A CN106021983 A CN 106021983A
- Authority
- CN
- China
- Prior art keywords
- sudden change
- mutation
- gene
- cds
- dna
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
Landscapes
- Bioinformatics & Cheminformatics (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Physics & Mathematics (AREA)
- Engineering & Computer Science (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Biotechnology (AREA)
- Biophysics (AREA)
- Genetics & Genomics (AREA)
- Molecular Biology (AREA)
- Chemical & Material Sciences (AREA)
- Bioinformatics & Computational Biology (AREA)
- Analytical Chemistry (AREA)
- Evolutionary Biology (AREA)
- General Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Theoretical Computer Science (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention discloses a DNA and protein level mutation analysis method. The method comprises the steps of 1, reading a gene mutation file, and formatting the file into a standard name; 2, indexing a transcript sequence, gene information and gene transcript annotation information, thereby forming an amino acid codon corresponding relationship table; judging a mutation level and a mutation mode, and judging whether a mutation name is protein level mutation, genome DNA level mutation or CDS code area mutation; and 4, entering different level mutation mapping processes according to a judging result of the step 3, thereby obtaining mapping relationships of three mutation names. According to the method, the mapping relationships of various mutation names are output by carrying on phenotypic correlation gene mutation and polymorphic sites mined from literatures, thereby finishing annotating correspondence between pathogenic variation mined from the literatures, and gene mutation and polymorphic sites identified by sequencing.
Description
Technical field
The invention belongs to gene information data processing field, especially relate to a kind of DNA and protein level sudden change point
Analysis method.
Background technology
In after Watson and Crick finds DNA double helical structure more than 50 year, explore genovariation in research mankind's disease
Sick generation development and prophylactic treatment play key player, the Human Genome Project complete more disease and phenotypic correlation
The identification of genovariation opens wide space.Measure from micro-array chip, sanger order-checking high pass till now in recent years
Sequence, along with the progress of technology, increasing genovariation and polymorphic site are detected.They disclose from molecular level
Disease and the mechanism of numerous phenotype, bring new hope for untiing life secret and conquering a stubborn disease.
But, gene mutation and the polymorphic site of different research worker identifications lack unified expression, as pressed down in name
Oncogene TP53 there occurs the sudden change of T to A base at genomic locations 7579553, and the genomic locations that directly uses having is made
For indicate named (TP53:g.7579553T > A), some use gene coding region variation be named (TP53:
C.134T > A), the variation result using final protein level also having is named (TP53:p.L45Q).Even if same
On the sudden change of protein level describes, identify gene mutation or polymorphic time result also in final name not with reference to the difference of gene order
With, even to obscure so that cannot use, the most only the L45Q of TP53 gene is suddenlyd change, the reference transcript that different researchs use is just
Relate to NM_001126112, NM_000546, NM_001126113 and NM_001126114.The name side of this varying level
Formula ultimately results in the very difficult achievements in research based on forefathers of research worker later and efficiently and accurately carries out unified analysis
And annotation.Gene mutation that the human breast cancer such as the most reported at literature mining is relevant and polymorphic site, find 4000
Many PubMed documents, excavate 3600 several genes sudden change and polymorphic sites altogether, but are a lack of consistent unnamed gene mode,
It is difficult to the result applying these literature mining in next step analysis.
In recent years, it is more and more universal that sequencing technologies of future generation is applied, and substantial amounts of analysis of biological information software produces therewith.
Under this background, research worker can quickly utilize analysis of biological information software and the flow process genome to magnanimity of existing maturation
Sequencing data resolves, as identified gene mutation and polymorphic site.By the research of forefathers, understand the most further
Annotate these sudden changes just can be applied, as applied in accurate medical treatment, carry out the personalized medicine of disease, diagnoses and treatment etc..
Owing to substantial amounts of previous research does not exist unified standard to the name of gene mutation, it is difficult to the result to resolving and further notes
Release and understand.
Summary of the invention
In view of this, the present invention proposes a kind of DNA and protein level mutation analysis method, accepts the phenotype of literature mining
Associated gene mutation and polymorphic site, the mapping relations of output various mutations name, to have reached to annotate the cause of literature mining
The purpose such as corresponding between the different gene mutation identified with order-checking of pathological changes and polymorphic site.
For reaching above-mentioned purpose, the technical scheme is that and be achieved in that: a kind of DNA and protein level sudden change point
Analysis method, comprises the following steps:
1) read gene mutation file, format and be processed into title;
2) index transcript sequence, gene information and gene transcripts annotation information, the corresponding pass of structure amino acid codes
It it is table;
3) level of sudden change generation, the pattern of sudden change are judged;Judge that sudden change name is protein level sudden change or gene
Group DNA level sudden change or the sudden change of CDS coding region;
4) according to step 3) judged result, respectively enter the sudden change of different level and map flow process, obtain three kinds of sudden change lives
The mapping relations of name.
Further, step 1) described formatting is processed into title, and method is:
101) judge that gene mutation file still transcribes real name containing having plenty of gene name;
102) step 2 is then entered containing gene name);
103) then remove transgenic name after transcript version number containing transcribing real name, enter step 2).
Further, step 2) step of described structure amino acid codes mapping table is:
201) gene name and the mapping relations transcribed between real name are built;
202) extract the position, CDS coding region of transcript and base sequence and map corresponding amino acid code subsequence.
Further, step 4) in, the mapping flow process of the named protein level that suddenlys change sudden change is:
401), after reading in protein level sudden change result, according to the figure place of amino acid mutation, corresponding CDS coding region is calculated
The position undergone mutation in territory;
402) previous step can list the sudden change of all possible CDS coding region, to these CDS coding region sudden change references
The base of sequence position removes not matching result, obtains CDS coding region mutated site after filtering;
403) position occurred according to CDS sudden change, uses transcript structures annotation information, finds catastrophe point on genome
Site and sequence change.
Further, step 4) in, the mapping flow process of named genomic DNA level of suddenling change sudden change is:
411) for the sudden change result of genomic DNA level, according to the CDS region of this gene in gene structure comment file
Illustrate, calculate the position that the sudden change of corresponding CDS coding region occurs;
412) DNA sequence of this section of CDS out and is converted into corresponding aminoacid sequence according to region annotation extraction,
After obtain the change situation of corresponding protein level.
Further, step 4) in, the mapping flow process of the named CDS coding region sudden change that suddenlys change is:
421) known CDS sudden change occurs position and the base of sudden change change, and sport position from transcript pair according to CDS
The index of the sequential file of the mRNA answered calculates the DNA sequence that this CDS region is corresponding;
422) DNA sequence is changed into corresponding aminoacid sequence by base aminoacid relation table, the ammonia before and after sudden change
Base acid sequence compares, and orients position and amino acid whose change that aminoacid changes, thus maps out the prominent of protein level
Become result;
423) travel through the CDS region in this gene structure annotation information, calculate the genomic locations and alkali changed
Base changes, thus maps out the catastrophe of genomic DNA level.
Relative to prior art, a kind of DNA of the present invention and protein level mutation analysis method have following excellent
Gesture:
The present invention is using gene mutation file as input, through automatically identifying, it is judged that sudden change name be at DNA, RNA or
Protein level, and then carry out judging that sudden change is at each by the gene transcripts comment file of REFSEQ and Sequence annotation file
Position that level occurs and base and amino acid change.The present invention accepts the phenotype correlation gene sudden change of literature mining and polymorphic position
Point, the mapping relations of output various mutations name, to have reached the base that the pathogenic variation annotating literature mining identifies with order-checking
Because of purpose corresponding etc. between sudden change and polymorphic site.
Accompanying drawing explanation
The accompanying drawing of the part constituting the present invention is used for providing a further understanding of the present invention, and the present invention's is schematic real
Execute example and illustrate for explaining the present invention, being not intended that inappropriate limitation of the present invention.In the accompanying drawings:
Fig. 1 is the method flow schematic diagram of the present invention.
Fig. 2 is the risk mutational site file of the heredopathia of the embodiment of the present invention.
Fig. 3 is the mapping result file of the embodiment of the present invention.
Detailed description of the invention
It should be noted that in the case of not conflicting, the feature in embodiments of the invention and embodiment can be mutual
Combination.
Describe the present invention below with reference to the accompanying drawings and in conjunction with the embodiments in detail.
The principle explanation of the present invention:
The mapping of varying level sudden change, the actually location positioning of different aspects and sudden change result calculates, for not
The sudden change of same level, needs to take different mapping modes and step.Present invention is generally directed to the sudden change of mixed and disorderly different aspects
Name cannot directly carry out unifying the situation of application, the sudden change result relationship map of structure at all levels out, convenient to abrupt junction
The further use of fruit.
As it is shown in figure 1, specifically comprise the following steps that
First, it is gene transcripts structure and sequence and aminoacid and the index of base relation.REFSEQ be one steady
Permanent gene annotation data base, uses its gene mechanism comment file provided and sequential file to build Hash table, reaches
From rapidly from gene mapping transcript, then to transcript structures, such as intron district, exon 1 etc..Aminoacid and base are (close
Numeral) relation in correspondence with each other also stored with Hash table, in order to be rapidly performed by turning of aminoacid sequence and base sequence
Change.
Next to that the data type of file to be mapped judges.Generally, research worker be not given standard Gene Name or
Transcript title, needing the file to submitting to carry out markization, reaching the form of standard comments to carry out next step this time
Mapping.
It is finally the calculating of mapping relations:
For the sudden change of protein level, after reading in protein level sudden change result, according to the figure place of amino acid mutation, calculate
Go out the position that corresponding CDS coding region is undergone mutation.Because amino acid whose degeneracy, this process can list all possible CDS
Coding region suddenlys change, and finally the base of these CDS coding region sudden change reference sequences positions is removed not matching result.
CDS sudden change is obtained after filtering.Next the position occurred according to CDS sudden change, uses transcript structures annotation information, finds prominent
Height site on genome and sequence change.
For the sudden change of CDS coding region level, sporting position according to CDS can be from the sequence of mRNA corresponding to transcript
The index of file calculates the DNA sequence that this CDS region is corresponding, then changes into DNA sequence by base aminoacid relation table
Change into corresponding aminoacid sequence, the aminoacid sequence before and after sudden change compared, orient position that aminoacid changes and
Amino acid whose change, thus map out the sudden change result of protein level, further, travel through in this gene structure annotation information
CDS region, calculates the genomic locations and sequence change changed, thus maps out the sudden change feelings of genomic DNA level
Condition.
For the sudden change result of genomic DNA level, say according to the CDS region of this gene in gene structure comment file
Bright, calculate the position that the sudden change of corresponding CDS coding region occurs.Then the DNA sequence of this section of CDS is also extracted according to region annotation
Out and be converted into corresponding aminoacid sequence, the change situation of corresponding protein level is finally obtained.
The destination file mapped contains genomic DNA, and the sudden change correspondence of CDS coding region (RNA) and protein level is closed
System.User can be as required by unified for the sudden change result of a certain aspect research being applied to next step.
The instantiation implementing said method is as follows:
Hand digging common genetic is sick from PubMed article risk mutational site is as in figure 2 it is shown, utilize that patient's is complete
Exon sequencing result and bio information sudden change (single base mutation and small fragment insertion and deletion) digging tool and flow process can arrive
Corresponding sudden change annotation result, the generally sudden change of gene DNA level.And literature research personnel frequently with sudden change description side
Formula is the sudden change of CDS coding region and protein level sudden change result.The most here application literature mining as a result, it is desirable to first to collect
Sudden change is mapped to the sudden change of gene DNA level.
According to said method, the destination file obtaining mapping contains genomic DNA, CDS coding region (RNA) and protein
The sudden change corresponding relation of level, as shown in Figure 3.
The foregoing is only presently preferred embodiments of the present invention, not in order to limit the present invention, all essences in the present invention
Within god and principle, any modification, equivalent substitution and improvement etc. made, should be included within the scope of the present invention.
Claims (6)
1. a DNA and protein level mutation analysis method, it is characterised in that comprise the following steps:
1) read gene mutation file, format and be processed into title;
2) index transcript sequence, gene information and gene transcripts annotation information, structure amino acid codes mapping table;
3) level of sudden change generation, the pattern of sudden change are judged;Judge that sudden change name is protein level sudden change or genomic DNA
Level sudden change or the sudden change of CDS coding region;
4) according to step 3) judged result, respectively enter the sudden change of different level and map flow process, obtain three kinds of sudden change names
Mapping relations.
A kind of DNA the most according to claim 1 and protein level mutation analysis method, it is characterised in that step 1) institute
Stating formatting and be processed into title, method is:
101) judge that gene mutation file still transcribes real name containing having plenty of gene name;
102) step 2 is then entered containing gene name);
103) then remove transgenic name after transcript version number containing transcribing real name, enter step 2).
A kind of DNA the most according to claim 1 and protein level mutation analysis method, it is characterised in that step 2) institute
The step stating structure amino acid codes mapping table is:
201) gene name and the mapping relations transcribed between real name are built;
202) extract the position, CDS coding region of transcript and base sequence and map corresponding amino acid code subsequence.
A kind of DNA the most according to claim 1 and protein level mutation analysis method, it is characterised in that step 4) in,
The mapping flow process of the named protein level that suddenlys change sudden change is:
401), after reading in protein level sudden change result, according to the figure place of amino acid mutation, calculate corresponding CDS coding region and send out
The position of raw sudden change;
402) previous step can list the sudden change of all possible CDS coding region, to these CDS coding region sudden change reference sequences
The base of position removes not matching result, obtains CDS coding region mutated site after filtering;
403) position occurred according to CDS sudden change, uses transcript structures annotation information, finds catastrophe point position on genome
Point and sequence change.
A kind of DNA the most according to claim 1 and protein level mutation analysis method, it is characterised in that step 4) in,
The mapping flow process of named genomic DNA level of suddenling change sudden change is:
411) for the sudden change result of genomic DNA level, say according to the CDS region of this gene in gene structure comment file
Bright, calculate the position that the sudden change of corresponding CDS coding region occurs;
412) DNA sequence of this section of CDS out and is converted into corresponding aminoacid sequence according to region annotation extraction, finally obtains
Change situation to corresponding protein level.
A kind of DNA the most according to claim 1 and protein level mutation analysis method, it is characterised in that step 4) in,
Suddenly change named CDS coding region sudden change mapping flow process be:
421) known CDS sudden change occurs position and the base of sudden change change, and sport position according to CDS corresponding from transcript
The index of the sequential file of mRNA calculates the DNA sequence that this CDS region is corresponding;
422) DNA sequence is changed into corresponding aminoacid sequence by base aminoacid relation table, the aminoacid before and after sudden change
Gene comparision, orients position and amino acid whose change that aminoacid changes, thus maps out the abrupt junction of protein level
Really;
423) travel through the CDS region in this gene structure annotation information, calculate the genomic locations changed and base changes
Become, thus map out the catastrophe of genomic DNA level.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610319389.5A CN106021983B (en) | 2016-05-13 | 2016-05-13 | A kind of DNA and protein level mutation analysis method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610319389.5A CN106021983B (en) | 2016-05-13 | 2016-05-13 | A kind of DNA and protein level mutation analysis method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106021983A true CN106021983A (en) | 2016-10-12 |
CN106021983B CN106021983B (en) | 2018-07-24 |
Family
ID=57100773
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610319389.5A Expired - Fee Related CN106021983B (en) | 2016-05-13 | 2016-05-13 | A kind of DNA and protein level mutation analysis method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106021983B (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108363904A (en) * | 2018-02-07 | 2018-08-03 | 南京林业大学 | A kind of CodonNX systems and its optimization method for the optimization of xylophyta genetic codon |
CN109961825A (en) * | 2019-03-29 | 2019-07-02 | 郑州大学 | A method of the protein structure partial 3 d modeling based on gene SNP site mutation |
CN110060742A (en) * | 2019-03-15 | 2019-07-26 | 南京派森诺基因科技有限公司 | A kind of gtf document analysis method and tool |
CN111073998A (en) * | 2018-10-19 | 2020-04-28 | 深圳华大生命科学研究院 | Virus genome mutation detection method, device and storage medium |
CN111128300A (en) * | 2019-12-26 | 2020-05-08 | 上海市精神卫生中心(上海市心理咨询培训中心) | Protein interaction influence judgment method based on mutation information |
CN113380325A (en) * | 2021-05-26 | 2021-09-10 | 杭州电子科技大学 | Method for detecting amino acid mutation based on codon mutation site |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20030190649A1 (en) * | 2002-03-25 | 2003-10-09 | Jan Aerts | Data mining of SNP databases for the selection of intragenic SNPs |
US20050246106A1 (en) * | 2003-09-18 | 2005-11-03 | Applera Corporation | Methods and systems for identifying genes, splice variants, and transcripts using an evidence mapping approach |
CN105420374A (en) * | 2015-12-22 | 2016-03-23 | 武汉菲沙基因信息有限公司 | Induced totipotential stem cell early-stage application mutation detection method |
CN105420351A (en) * | 2015-10-16 | 2016-03-23 | 深圳华大基因研究院 | Method and system for determining individual gene mutation |
-
2016
- 2016-05-13 CN CN201610319389.5A patent/CN106021983B/en not_active Expired - Fee Related
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20030190649A1 (en) * | 2002-03-25 | 2003-10-09 | Jan Aerts | Data mining of SNP databases for the selection of intragenic SNPs |
US20050246106A1 (en) * | 2003-09-18 | 2005-11-03 | Applera Corporation | Methods and systems for identifying genes, splice variants, and transcripts using an evidence mapping approach |
CN105420351A (en) * | 2015-10-16 | 2016-03-23 | 深圳华大基因研究院 | Method and system for determining individual gene mutation |
CN105420374A (en) * | 2015-12-22 | 2016-03-23 | 武汉菲沙基因信息有限公司 | Induced totipotential stem cell early-stage application mutation detection method |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108363904A (en) * | 2018-02-07 | 2018-08-03 | 南京林业大学 | A kind of CodonNX systems and its optimization method for the optimization of xylophyta genetic codon |
CN108363904B (en) * | 2018-02-07 | 2019-06-28 | 南京林业大学 | A kind of CodonNX system and its optimization method for the optimization of xylophyta genetic codon |
CN111073998A (en) * | 2018-10-19 | 2020-04-28 | 深圳华大生命科学研究院 | Virus genome mutation detection method, device and storage medium |
CN110060742A (en) * | 2019-03-15 | 2019-07-26 | 南京派森诺基因科技有限公司 | A kind of gtf document analysis method and tool |
CN109961825A (en) * | 2019-03-29 | 2019-07-02 | 郑州大学 | A method of the protein structure partial 3 d modeling based on gene SNP site mutation |
CN109961825B (en) * | 2019-03-29 | 2022-12-02 | 郑州大学 | Protein structure local three-dimensional modeling method based on gene SNP site mutation |
CN111128300A (en) * | 2019-12-26 | 2020-05-08 | 上海市精神卫生中心(上海市心理咨询培训中心) | Protein interaction influence judgment method based on mutation information |
CN111128300B (en) * | 2019-12-26 | 2023-03-24 | 上海市精神卫生中心(上海市心理咨询培训中心) | Protein interaction influence judgment method based on mutation information |
CN113380325A (en) * | 2021-05-26 | 2021-09-10 | 杭州电子科技大学 | Method for detecting amino acid mutation based on codon mutation site |
Also Published As
Publication number | Publication date |
---|---|
CN106021983B (en) | 2018-07-24 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106021983A (en) | DNA and protein level mutation analysis method | |
US11761035B2 (en) | Methods and systems for generation and error-correction of unique molecular index sets with heterogeneous molecular lengths | |
US20230272483A1 (en) | Systems and methods for analyzing circulating tumor dna | |
CN108350494B (en) | Systems and methods for genomic analysis | |
KR102113896B1 (en) | Noninvasive prenatal molecular karyotyping from maternal plasma | |
CN106202936A (en) | A kind of disease risks Forecasting Methodology and system | |
CN116042833A (en) | Alignment and variant sequencing analysis pipeline | |
CN111192634A (en) | Method for processing genomic data | |
TW201926095A (en) | Models for targeted sequencing | |
CN1385702A (en) | Method for supply clinical diagnosis | |
JP7041614B2 (en) | Multi-level architecture for pattern recognition in biometric data | |
CN105132407B (en) | A kind of cast-off cells DNA low frequencies mutation enrichment sequence measurement | |
CN112218957A (en) | Systems and methods for determining tumor fraction in cell-free nucleic acids | |
JP2022093592A (en) | Quality evaluation method | |
CN105925665A (en) | Kit, database establishment method, and method and system for detecting area target variation | |
WO2017220508A1 (en) | Methods for processing next-generation sequencing genomic data | |
CN106021980B (en) | A kind of DNA and protein level mutation analysis system | |
CN111180013B (en) | Device for detecting blood disease fusion gene | |
Aparicio-Puerta et al. | liqDB: a small-RNAseq knowledge discovery database for liquid biopsy studies | |
JP2015089364A (en) | Cancer diagnostic method by multiplex somatic mutation, development method of cancer pharmaceutical, and cancer diagnostic device | |
Chen et al. | Bioinformatics analysis for cell-free tumor DNA sequencing data | |
CN108504750B (en) | Method and system for determining flora SNP site set and application thereof | |
US20180106806A1 (en) | Tumor Analytical Methods | |
CN107735787A (en) | System and method for introduces a collection measure | |
CN102831331B (en) | Primer design developing method of length polymorphism sign based on restriction enzyme digestion database-establishing pair-end sequencing |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20180724 |