CN106021983A - DNA and protein level mutation analysis method - Google Patents

DNA and protein level mutation analysis method Download PDF

Info

Publication number
CN106021983A
CN106021983A CN201610319389.5A CN201610319389A CN106021983A CN 106021983 A CN106021983 A CN 106021983A CN 201610319389 A CN201610319389 A CN 201610319389A CN 106021983 A CN106021983 A CN 106021983A
Authority
CN
China
Prior art keywords
sudden change
mutation
gene
cds
dna
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201610319389.5A
Other languages
Chinese (zh)
Other versions
CN106021983B (en
Inventor
薛成海
龚永辉
王晓君
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wankangyuan (tianjin) Gene Technology Co Ltd
Original Assignee
Wankangyuan (tianjin) Gene Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wankangyuan (tianjin) Gene Technology Co Ltd filed Critical Wankangyuan (tianjin) Gene Technology Co Ltd
Priority to CN201610319389.5A priority Critical patent/CN106021983B/en
Publication of CN106021983A publication Critical patent/CN106021983A/en
Application granted granted Critical
Publication of CN106021983B publication Critical patent/CN106021983B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations

Landscapes

  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Biotechnology (AREA)
  • Biophysics (AREA)
  • Genetics & Genomics (AREA)
  • Molecular Biology (AREA)
  • Chemical & Material Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Analytical Chemistry (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a DNA and protein level mutation analysis method. The method comprises the steps of 1, reading a gene mutation file, and formatting the file into a standard name; 2, indexing a transcript sequence, gene information and gene transcript annotation information, thereby forming an amino acid codon corresponding relationship table; judging a mutation level and a mutation mode, and judging whether a mutation name is protein level mutation, genome DNA level mutation or CDS code area mutation; and 4, entering different level mutation mapping processes according to a judging result of the step 3, thereby obtaining mapping relationships of three mutation names. According to the method, the mapping relationships of various mutation names are output by carrying on phenotypic correlation gene mutation and polymorphic sites mined from literatures, thereby finishing annotating correspondence between pathogenic variation mined from the literatures, and gene mutation and polymorphic sites identified by sequencing.

Description

A kind of DNA and protein level mutation analysis method
Technical field
The invention belongs to gene information data processing field, especially relate to a kind of DNA and protein level sudden change point Analysis method.
Background technology
In after Watson and Crick finds DNA double helical structure more than 50 year, explore genovariation in research mankind's disease Sick generation development and prophylactic treatment play key player, the Human Genome Project complete more disease and phenotypic correlation The identification of genovariation opens wide space.Measure from micro-array chip, sanger order-checking high pass till now in recent years Sequence, along with the progress of technology, increasing genovariation and polymorphic site are detected.They disclose from molecular level Disease and the mechanism of numerous phenotype, bring new hope for untiing life secret and conquering a stubborn disease.
But, gene mutation and the polymorphic site of different research worker identifications lack unified expression, as pressed down in name Oncogene TP53 there occurs the sudden change of T to A base at genomic locations 7579553, and the genomic locations that directly uses having is made For indicate named (TP53:g.7579553T > A), some use gene coding region variation be named (TP53: C.134T > A), the variation result using final protein level also having is named (TP53:p.L45Q).Even if same On the sudden change of protein level describes, identify gene mutation or polymorphic time result also in final name not with reference to the difference of gene order With, even to obscure so that cannot use, the most only the L45Q of TP53 gene is suddenlyd change, the reference transcript that different researchs use is just Relate to NM_001126112, NM_000546, NM_001126113 and NM_001126114.The name side of this varying level Formula ultimately results in the very difficult achievements in research based on forefathers of research worker later and efficiently and accurately carries out unified analysis And annotation.Gene mutation that the human breast cancer such as the most reported at literature mining is relevant and polymorphic site, find 4000 Many PubMed documents, excavate 3600 several genes sudden change and polymorphic sites altogether, but are a lack of consistent unnamed gene mode, It is difficult to the result applying these literature mining in next step analysis.
In recent years, it is more and more universal that sequencing technologies of future generation is applied, and substantial amounts of analysis of biological information software produces therewith. Under this background, research worker can quickly utilize analysis of biological information software and the flow process genome to magnanimity of existing maturation Sequencing data resolves, as identified gene mutation and polymorphic site.By the research of forefathers, understand the most further Annotate these sudden changes just can be applied, as applied in accurate medical treatment, carry out the personalized medicine of disease, diagnoses and treatment etc.. Owing to substantial amounts of previous research does not exist unified standard to the name of gene mutation, it is difficult to the result to resolving and further notes Release and understand.
Summary of the invention
In view of this, the present invention proposes a kind of DNA and protein level mutation analysis method, accepts the phenotype of literature mining Associated gene mutation and polymorphic site, the mapping relations of output various mutations name, to have reached to annotate the cause of literature mining The purpose such as corresponding between the different gene mutation identified with order-checking of pathological changes and polymorphic site.
For reaching above-mentioned purpose, the technical scheme is that and be achieved in that: a kind of DNA and protein level sudden change point Analysis method, comprises the following steps:
1) read gene mutation file, format and be processed into title;
2) index transcript sequence, gene information and gene transcripts annotation information, the corresponding pass of structure amino acid codes It it is table;
3) level of sudden change generation, the pattern of sudden change are judged;Judge that sudden change name is protein level sudden change or gene Group DNA level sudden change or the sudden change of CDS coding region;
4) according to step 3) judged result, respectively enter the sudden change of different level and map flow process, obtain three kinds of sudden change lives The mapping relations of name.
Further, step 1) described formatting is processed into title, and method is:
101) judge that gene mutation file still transcribes real name containing having plenty of gene name;
102) step 2 is then entered containing gene name);
103) then remove transgenic name after transcript version number containing transcribing real name, enter step 2).
Further, step 2) step of described structure amino acid codes mapping table is:
201) gene name and the mapping relations transcribed between real name are built;
202) extract the position, CDS coding region of transcript and base sequence and map corresponding amino acid code subsequence.
Further, step 4) in, the mapping flow process of the named protein level that suddenlys change sudden change is:
401), after reading in protein level sudden change result, according to the figure place of amino acid mutation, corresponding CDS coding region is calculated The position undergone mutation in territory;
402) previous step can list the sudden change of all possible CDS coding region, to these CDS coding region sudden change references The base of sequence position removes not matching result, obtains CDS coding region mutated site after filtering;
403) position occurred according to CDS sudden change, uses transcript structures annotation information, finds catastrophe point on genome Site and sequence change.
Further, step 4) in, the mapping flow process of named genomic DNA level of suddenling change sudden change is:
411) for the sudden change result of genomic DNA level, according to the CDS region of this gene in gene structure comment file Illustrate, calculate the position that the sudden change of corresponding CDS coding region occurs;
412) DNA sequence of this section of CDS out and is converted into corresponding aminoacid sequence according to region annotation extraction, After obtain the change situation of corresponding protein level.
Further, step 4) in, the mapping flow process of the named CDS coding region sudden change that suddenlys change is:
421) known CDS sudden change occurs position and the base of sudden change change, and sport position from transcript pair according to CDS The index of the sequential file of the mRNA answered calculates the DNA sequence that this CDS region is corresponding;
422) DNA sequence is changed into corresponding aminoacid sequence by base aminoacid relation table, the ammonia before and after sudden change Base acid sequence compares, and orients position and amino acid whose change that aminoacid changes, thus maps out the prominent of protein level Become result;
423) travel through the CDS region in this gene structure annotation information, calculate the genomic locations and alkali changed Base changes, thus maps out the catastrophe of genomic DNA level.
Relative to prior art, a kind of DNA of the present invention and protein level mutation analysis method have following excellent Gesture:
The present invention is using gene mutation file as input, through automatically identifying, it is judged that sudden change name be at DNA, RNA or Protein level, and then carry out judging that sudden change is at each by the gene transcripts comment file of REFSEQ and Sequence annotation file Position that level occurs and base and amino acid change.The present invention accepts the phenotype correlation gene sudden change of literature mining and polymorphic position Point, the mapping relations of output various mutations name, to have reached the base that the pathogenic variation annotating literature mining identifies with order-checking Because of purpose corresponding etc. between sudden change and polymorphic site.
Accompanying drawing explanation
The accompanying drawing of the part constituting the present invention is used for providing a further understanding of the present invention, and the present invention's is schematic real Execute example and illustrate for explaining the present invention, being not intended that inappropriate limitation of the present invention.In the accompanying drawings:
Fig. 1 is the method flow schematic diagram of the present invention.
Fig. 2 is the risk mutational site file of the heredopathia of the embodiment of the present invention.
Fig. 3 is the mapping result file of the embodiment of the present invention.
Detailed description of the invention
It should be noted that in the case of not conflicting, the feature in embodiments of the invention and embodiment can be mutual Combination.
Describe the present invention below with reference to the accompanying drawings and in conjunction with the embodiments in detail.
The principle explanation of the present invention:
The mapping of varying level sudden change, the actually location positioning of different aspects and sudden change result calculates, for not The sudden change of same level, needs to take different mapping modes and step.Present invention is generally directed to the sudden change of mixed and disorderly different aspects Name cannot directly carry out unifying the situation of application, the sudden change result relationship map of structure at all levels out, convenient to abrupt junction The further use of fruit.
As it is shown in figure 1, specifically comprise the following steps that
First, it is gene transcripts structure and sequence and aminoacid and the index of base relation.REFSEQ be one steady Permanent gene annotation data base, uses its gene mechanism comment file provided and sequential file to build Hash table, reaches From rapidly from gene mapping transcript, then to transcript structures, such as intron district, exon 1 etc..Aminoacid and base are (close Numeral) relation in correspondence with each other also stored with Hash table, in order to be rapidly performed by turning of aminoacid sequence and base sequence Change.
Next to that the data type of file to be mapped judges.Generally, research worker be not given standard Gene Name or Transcript title, needing the file to submitting to carry out markization, reaching the form of standard comments to carry out next step this time Mapping.
It is finally the calculating of mapping relations:
For the sudden change of protein level, after reading in protein level sudden change result, according to the figure place of amino acid mutation, calculate Go out the position that corresponding CDS coding region is undergone mutation.Because amino acid whose degeneracy, this process can list all possible CDS Coding region suddenlys change, and finally the base of these CDS coding region sudden change reference sequences positions is removed not matching result. CDS sudden change is obtained after filtering.Next the position occurred according to CDS sudden change, uses transcript structures annotation information, finds prominent Height site on genome and sequence change.
For the sudden change of CDS coding region level, sporting position according to CDS can be from the sequence of mRNA corresponding to transcript The index of file calculates the DNA sequence that this CDS region is corresponding, then changes into DNA sequence by base aminoacid relation table Change into corresponding aminoacid sequence, the aminoacid sequence before and after sudden change compared, orient position that aminoacid changes and Amino acid whose change, thus map out the sudden change result of protein level, further, travel through in this gene structure annotation information CDS region, calculates the genomic locations and sequence change changed, thus maps out the sudden change feelings of genomic DNA level Condition.
For the sudden change result of genomic DNA level, say according to the CDS region of this gene in gene structure comment file Bright, calculate the position that the sudden change of corresponding CDS coding region occurs.Then the DNA sequence of this section of CDS is also extracted according to region annotation Out and be converted into corresponding aminoacid sequence, the change situation of corresponding protein level is finally obtained.
The destination file mapped contains genomic DNA, and the sudden change correspondence of CDS coding region (RNA) and protein level is closed System.User can be as required by unified for the sudden change result of a certain aspect research being applied to next step.
The instantiation implementing said method is as follows:
Hand digging common genetic is sick from PubMed article risk mutational site is as in figure 2 it is shown, utilize that patient's is complete Exon sequencing result and bio information sudden change (single base mutation and small fragment insertion and deletion) digging tool and flow process can arrive Corresponding sudden change annotation result, the generally sudden change of gene DNA level.And literature research personnel frequently with sudden change description side Formula is the sudden change of CDS coding region and protein level sudden change result.The most here application literature mining as a result, it is desirable to first to collect Sudden change is mapped to the sudden change of gene DNA level.
According to said method, the destination file obtaining mapping contains genomic DNA, CDS coding region (RNA) and protein The sudden change corresponding relation of level, as shown in Figure 3.
The foregoing is only presently preferred embodiments of the present invention, not in order to limit the present invention, all essences in the present invention Within god and principle, any modification, equivalent substitution and improvement etc. made, should be included within the scope of the present invention.

Claims (6)

1. a DNA and protein level mutation analysis method, it is characterised in that comprise the following steps:
1) read gene mutation file, format and be processed into title;
2) index transcript sequence, gene information and gene transcripts annotation information, structure amino acid codes mapping table;
3) level of sudden change generation, the pattern of sudden change are judged;Judge that sudden change name is protein level sudden change or genomic DNA Level sudden change or the sudden change of CDS coding region;
4) according to step 3) judged result, respectively enter the sudden change of different level and map flow process, obtain three kinds of sudden change names Mapping relations.
A kind of DNA the most according to claim 1 and protein level mutation analysis method, it is characterised in that step 1) institute Stating formatting and be processed into title, method is:
101) judge that gene mutation file still transcribes real name containing having plenty of gene name;
102) step 2 is then entered containing gene name);
103) then remove transgenic name after transcript version number containing transcribing real name, enter step 2).
A kind of DNA the most according to claim 1 and protein level mutation analysis method, it is characterised in that step 2) institute The step stating structure amino acid codes mapping table is:
201) gene name and the mapping relations transcribed between real name are built;
202) extract the position, CDS coding region of transcript and base sequence and map corresponding amino acid code subsequence.
A kind of DNA the most according to claim 1 and protein level mutation analysis method, it is characterised in that step 4) in, The mapping flow process of the named protein level that suddenlys change sudden change is:
401), after reading in protein level sudden change result, according to the figure place of amino acid mutation, calculate corresponding CDS coding region and send out The position of raw sudden change;
402) previous step can list the sudden change of all possible CDS coding region, to these CDS coding region sudden change reference sequences The base of position removes not matching result, obtains CDS coding region mutated site after filtering;
403) position occurred according to CDS sudden change, uses transcript structures annotation information, finds catastrophe point position on genome Point and sequence change.
A kind of DNA the most according to claim 1 and protein level mutation analysis method, it is characterised in that step 4) in, The mapping flow process of named genomic DNA level of suddenling change sudden change is:
411) for the sudden change result of genomic DNA level, say according to the CDS region of this gene in gene structure comment file Bright, calculate the position that the sudden change of corresponding CDS coding region occurs;
412) DNA sequence of this section of CDS out and is converted into corresponding aminoacid sequence according to region annotation extraction, finally obtains Change situation to corresponding protein level.
A kind of DNA the most according to claim 1 and protein level mutation analysis method, it is characterised in that step 4) in, Suddenly change named CDS coding region sudden change mapping flow process be:
421) known CDS sudden change occurs position and the base of sudden change change, and sport position according to CDS corresponding from transcript The index of the sequential file of mRNA calculates the DNA sequence that this CDS region is corresponding;
422) DNA sequence is changed into corresponding aminoacid sequence by base aminoacid relation table, the aminoacid before and after sudden change Gene comparision, orients position and amino acid whose change that aminoacid changes, thus maps out the abrupt junction of protein level Really;
423) travel through the CDS region in this gene structure annotation information, calculate the genomic locations changed and base changes Become, thus map out the catastrophe of genomic DNA level.
CN201610319389.5A 2016-05-13 2016-05-13 A kind of DNA and protein level mutation analysis method Expired - Fee Related CN106021983B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610319389.5A CN106021983B (en) 2016-05-13 2016-05-13 A kind of DNA and protein level mutation analysis method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610319389.5A CN106021983B (en) 2016-05-13 2016-05-13 A kind of DNA and protein level mutation analysis method

Publications (2)

Publication Number Publication Date
CN106021983A true CN106021983A (en) 2016-10-12
CN106021983B CN106021983B (en) 2018-07-24

Family

ID=57100773

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610319389.5A Expired - Fee Related CN106021983B (en) 2016-05-13 2016-05-13 A kind of DNA and protein level mutation analysis method

Country Status (1)

Country Link
CN (1) CN106021983B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108363904A (en) * 2018-02-07 2018-08-03 南京林业大学 A kind of CodonNX systems and its optimization method for the optimization of xylophyta genetic codon
CN109961825A (en) * 2019-03-29 2019-07-02 郑州大学 A method of the protein structure partial 3 d modeling based on gene SNP site mutation
CN110060742A (en) * 2019-03-15 2019-07-26 南京派森诺基因科技有限公司 A kind of gtf document analysis method and tool
CN111073998A (en) * 2018-10-19 2020-04-28 深圳华大生命科学研究院 Virus genome mutation detection method, device and storage medium
CN111128300A (en) * 2019-12-26 2020-05-08 上海市精神卫生中心(上海市心理咨询培训中心) Protein interaction influence judgment method based on mutation information
CN113380325A (en) * 2021-05-26 2021-09-10 杭州电子科技大学 Method for detecting amino acid mutation based on codon mutation site

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030190649A1 (en) * 2002-03-25 2003-10-09 Jan Aerts Data mining of SNP databases for the selection of intragenic SNPs
US20050246106A1 (en) * 2003-09-18 2005-11-03 Applera Corporation Methods and systems for identifying genes, splice variants, and transcripts using an evidence mapping approach
CN105420374A (en) * 2015-12-22 2016-03-23 武汉菲沙基因信息有限公司 Induced totipotential stem cell early-stage application mutation detection method
CN105420351A (en) * 2015-10-16 2016-03-23 深圳华大基因研究院 Method and system for determining individual gene mutation

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030190649A1 (en) * 2002-03-25 2003-10-09 Jan Aerts Data mining of SNP databases for the selection of intragenic SNPs
US20050246106A1 (en) * 2003-09-18 2005-11-03 Applera Corporation Methods and systems for identifying genes, splice variants, and transcripts using an evidence mapping approach
CN105420351A (en) * 2015-10-16 2016-03-23 深圳华大基因研究院 Method and system for determining individual gene mutation
CN105420374A (en) * 2015-12-22 2016-03-23 武汉菲沙基因信息有限公司 Induced totipotential stem cell early-stage application mutation detection method

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108363904A (en) * 2018-02-07 2018-08-03 南京林业大学 A kind of CodonNX systems and its optimization method for the optimization of xylophyta genetic codon
CN108363904B (en) * 2018-02-07 2019-06-28 南京林业大学 A kind of CodonNX system and its optimization method for the optimization of xylophyta genetic codon
CN111073998A (en) * 2018-10-19 2020-04-28 深圳华大生命科学研究院 Virus genome mutation detection method, device and storage medium
CN110060742A (en) * 2019-03-15 2019-07-26 南京派森诺基因科技有限公司 A kind of gtf document analysis method and tool
CN109961825A (en) * 2019-03-29 2019-07-02 郑州大学 A method of the protein structure partial 3 d modeling based on gene SNP site mutation
CN109961825B (en) * 2019-03-29 2022-12-02 郑州大学 Protein structure local three-dimensional modeling method based on gene SNP site mutation
CN111128300A (en) * 2019-12-26 2020-05-08 上海市精神卫生中心(上海市心理咨询培训中心) Protein interaction influence judgment method based on mutation information
CN111128300B (en) * 2019-12-26 2023-03-24 上海市精神卫生中心(上海市心理咨询培训中心) Protein interaction influence judgment method based on mutation information
CN113380325A (en) * 2021-05-26 2021-09-10 杭州电子科技大学 Method for detecting amino acid mutation based on codon mutation site

Also Published As

Publication number Publication date
CN106021983B (en) 2018-07-24

Similar Documents

Publication Publication Date Title
CN106021983A (en) DNA and protein level mutation analysis method
US11761035B2 (en) Methods and systems for generation and error-correction of unique molecular index sets with heterogeneous molecular lengths
US20230272483A1 (en) Systems and methods for analyzing circulating tumor dna
CN108350494B (en) Systems and methods for genomic analysis
KR102113896B1 (en) Noninvasive prenatal molecular karyotyping from maternal plasma
CN106202936A (en) A kind of disease risks Forecasting Methodology and system
CN116042833A (en) Alignment and variant sequencing analysis pipeline
CN111192634A (en) Method for processing genomic data
TW201926095A (en) Models for targeted sequencing
CN1385702A (en) Method for supply clinical diagnosis
JP7041614B2 (en) Multi-level architecture for pattern recognition in biometric data
CN105132407B (en) A kind of cast-off cells DNA low frequencies mutation enrichment sequence measurement
CN112218957A (en) Systems and methods for determining tumor fraction in cell-free nucleic acids
JP2022093592A (en) Quality evaluation method
CN105925665A (en) Kit, database establishment method, and method and system for detecting area target variation
WO2017220508A1 (en) Methods for processing next-generation sequencing genomic data
CN106021980B (en) A kind of DNA and protein level mutation analysis system
CN111180013B (en) Device for detecting blood disease fusion gene
Aparicio-Puerta et al. liqDB: a small-RNAseq knowledge discovery database for liquid biopsy studies
JP2015089364A (en) Cancer diagnostic method by multiplex somatic mutation, development method of cancer pharmaceutical, and cancer diagnostic device
Chen et al. Bioinformatics analysis for cell-free tumor DNA sequencing data
CN108504750B (en) Method and system for determining flora SNP site set and application thereof
US20180106806A1 (en) Tumor Analytical Methods
CN107735787A (en) System and method for introduces a collection measure
CN102831331B (en) Primer design developing method of length polymorphism sign based on restriction enzyme digestion database-establishing pair-end sequencing

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20180724