CN106021983A

CN106021983A - DNA and protein level mutation analysis method

Info

Publication number: CN106021983A
Application number: CN201610319389.5A
Authority: CN
Inventors: 薛成海; 龚永辉; 王晓君
Original assignee: Wankangyuan (tianjin) Gene Technology Co Ltd
Current assignee: Wankangyuan (tianjin) Gene Technology Co Ltd
Priority date: 2016-05-13
Filing date: 2016-05-13
Publication date: 2016-10-12
Anticipated expiration: 2036-05-13
Also published as: CN106021983B

Abstract

The invention discloses a DNA and protein level mutation analysis method. The method comprises the steps of 1, reading a gene mutation file, and formatting the file into a standard name; 2, indexing a transcript sequence, gene information and gene transcript annotation information, thereby forming an amino acid codon corresponding relationship table; judging a mutation level and a mutation mode, and judging whether a mutation name is protein level mutation, genome DNA level mutation or CDS code area mutation; and 4, entering different level mutation mapping processes according to a judging result of the step 3, thereby obtaining mapping relationships of three mutation names. According to the method, the mapping relationships of various mutation names are output by carrying on phenotypic correlation gene mutation and polymorphic sites mined from literatures, thereby finishing annotating correspondence between pathogenic variation mined from the literatures, and gene mutation and polymorphic sites identified by sequencing.

Description

A kind of DNA and protein level mutation analysis method

Technical field

The invention belongs to gene information data processing field, especially relate to a kind of DNA and protein level sudden change point Analysis method.

Background technology

In after Watson and Crick finds DNA double helical structure more than 50 year, explore genovariation in research mankind's disease Sick generation development and prophylactic treatment play key player, the Human Genome Project complete more disease and phenotypic correlation The identification of genovariation opens wide space.Measure from micro-array chip, sanger order-checking high pass till now in recent years Sequence, along with the progress of technology, increasing genovariation and polymorphic site are detected.They disclose from molecular level Disease and the mechanism of numerous phenotype, bring new hope for untiing life secret and conquering a stubborn disease.

But, gene mutation and the polymorphic site of different research worker identifications lack unified expression, as pressed down in name Oncogene TP53 there occurs the sudden change of T to A base at genomic locations 7579553, and the genomic locations that directly uses having is made For indicate named (TP53:g.7579553T > A), some use gene coding region variation be named (TP53: C.134T > A), the variation result using final protein level also having is named (TP53:p.L45Q).Even if same On the sudden change of protein level describes, identify gene mutation or polymorphic time result also in final name not with reference to the difference of gene order With, even to obscure so that cannot use, the most only the L45Q of TP53 gene is suddenlyd change, the reference transcript that different researchs use is just Relate to NM_001126112, NM_000546, NM_001126113 and NM_001126114.The name side of this varying level Formula ultimately results in the very difficult achievements in research based on forefathers of research worker later and efficiently and accurately carries out unified analysis And annotation.Gene mutation that the human breast cancer such as the most reported at literature mining is relevant and polymorphic site, find 4000 Many PubMed documents, excavate 3600 several genes sudden change and polymorphic sites altogether, but are a lack of consistent unnamed gene mode, It is difficult to the result applying these literature mining in next step analysis.

In recent years, it is more and more universal that sequencing technologies of future generation is applied, and substantial amounts of analysis of biological information software produces therewith. Under this background, research worker can quickly utilize analysis of biological information software and the flow process genome to magnanimity of existing maturation Sequencing data resolves, as identified gene mutation and polymorphic site.By the research of forefathers, understand the most further Annotate these sudden changes just can be applied, as applied in accurate medical treatment, carry out the personalized medicine of disease, diagnoses and treatment etc.. Owing to substantial amounts of previous research does not exist unified standard to the name of gene mutation, it is difficult to the result to resolving and further notes Release and understand.

Summary of the invention

In view of this, the present invention proposes a kind of DNA and protein level mutation analysis method, accepts the phenotype of literature mining Associated gene mutation and polymorphic site, the mapping relations of output various mutations name, to have reached to annotate the cause of literature mining The purpose such as corresponding between the different gene mutation identified with order-checking of pathological changes and polymorphic site.

For reaching above-mentioned purpose, the technical scheme is that and be achieved in that: a kind of DNA and protein level sudden change point Analysis method, comprises the following steps:

1) read gene mutation file, format and be processed into title；

2) index transcript sequence, gene information and gene transcripts annotation information, the corresponding pass of structure amino acid codes It it is table；

3) level of sudden change generation, the pattern of sudden change are judged；Judge that sudden change name is protein level sudden change or gene Group DNA level sudden change or the sudden change of CDS coding region；

4) according to step 3) judged result, respectively enter the sudden change of different level and map flow process, obtain three kinds of sudden change lives The mapping relations of name.

Further, step 1) described formatting is processed into title, and method is:

101) judge that gene mutation file still transcribes real name containing having plenty of gene name；

102) step 2 is then entered containing gene name)；

103) then remove transgenic name after transcript version number containing transcribing real name, enter step 2).

Further, step 2) step of described structure amino acid codes mapping table is:

201) gene name and the mapping relations transcribed between real name are built；

202) extract the position, CDS coding region of transcript and base sequence and map corresponding amino acid code subsequence.

Further, step 4) in, the mapping flow process of the named protein level that suddenlys change sudden change is:

401), after reading in protein level sudden change result, according to the figure place of amino acid mutation, corresponding CDS coding region is calculated The position undergone mutation in territory；

402) previous step can list the sudden change of all possible CDS coding region, to these CDS coding region sudden change references The base of sequence position removes not matching result, obtains CDS coding region mutated site after filtering；

403) position occurred according to CDS sudden change, uses transcript structures annotation information, finds catastrophe point on genome Site and sequence change.

Further, step 4) in, the mapping flow process of named genomic DNA level of suddenling change sudden change is:

411) for the sudden change result of genomic DNA level, according to the CDS region of this gene in gene structure comment file Illustrate, calculate the position that the sudden change of corresponding CDS coding region occurs；

412) DNA sequence of this section of CDS out and is converted into corresponding aminoacid sequence according to region annotation extraction, After obtain the change situation of corresponding protein level.

Further, step 4) in, the mapping flow process of the named CDS coding region sudden change that suddenlys change is:

421) known CDS sudden change occurs position and the base of sudden change change, and sport position from transcript pair according to CDS The index of the sequential file of the mRNA answered calculates the DNA sequence that this CDS region is corresponding；

422) DNA sequence is changed into corresponding aminoacid sequence by base aminoacid relation table, the ammonia before and after sudden change Base acid sequence compares, and orients position and amino acid whose change that aminoacid changes, thus maps out the prominent of protein level Become result；

423) travel through the CDS region in this gene structure annotation information, calculate the genomic locations and alkali changed Base changes, thus maps out the catastrophe of genomic DNA level.

Relative to prior art, a kind of DNA of the present invention and protein level mutation analysis method have following excellent Gesture:

The present invention is using gene mutation file as input, through automatically identifying, it is judged that sudden change name be at DNA, RNA or Protein level, and then carry out judging that sudden change is at each by the gene transcripts comment file of REFSEQ and Sequence annotation file Position that level occurs and base and amino acid change.The present invention accepts the phenotype correlation gene sudden change of literature mining and polymorphic position Point, the mapping relations of output various mutations name, to have reached the base that the pathogenic variation annotating literature mining identifies with order-checking Because of purpose corresponding etc. between sudden change and polymorphic site.

Accompanying drawing explanation

The accompanying drawing of the part constituting the present invention is used for providing a further understanding of the present invention, and the present invention's is schematic real Execute example and illustrate for explaining the present invention, being not intended that inappropriate limitation of the present invention.In the accompanying drawings:

Fig. 1 is the method flow schematic diagram of the present invention.

Fig. 2 is the risk mutational site file of the heredopathia of the embodiment of the present invention.

Fig. 3 is the mapping result file of the embodiment of the present invention.

Detailed description of the invention

It should be noted that in the case of not conflicting, the feature in embodiments of the invention and embodiment can be mutual Combination.

Describe the present invention below with reference to the accompanying drawings and in conjunction with the embodiments in detail.

The principle explanation of the present invention:

The mapping of varying level sudden change, the actually location positioning of different aspects and sudden change result calculates, for not The sudden change of same level, needs to take different mapping modes and step.Present invention is generally directed to the sudden change of mixed and disorderly different aspects Name cannot directly carry out unifying the situation of application, the sudden change result relationship map of structure at all levels out, convenient to abrupt junction The further use of fruit.

As it is shown in figure 1, specifically comprise the following steps that

First, it is gene transcripts structure and sequence and aminoacid and the index of base relation.REFSEQ be one steady Permanent gene annotation data base, uses its gene mechanism comment file provided and sequential file to build Hash table, reaches From rapidly from gene mapping transcript, then to transcript structures, such as intron district, exon 1 etc..Aminoacid and base are (close Numeral) relation in correspondence with each other also stored with Hash table, in order to be rapidly performed by turning of aminoacid sequence and base sequence Change.

Next to that the data type of file to be mapped judges.Generally, research worker be not given standard Gene Name or Transcript title, needing the file to submitting to carry out markization, reaching the form of standard comments to carry out next step this time Mapping.

It is finally the calculating of mapping relations:

For the sudden change of protein level, after reading in protein level sudden change result, according to the figure place of amino acid mutation, calculate Go out the position that corresponding CDS coding region is undergone mutation.Because amino acid whose degeneracy, this process can list all possible CDS Coding region suddenlys change, and finally the base of these CDS coding region sudden change reference sequences positions is removed not matching result. CDS sudden change is obtained after filtering.Next the position occurred according to CDS sudden change, uses transcript structures annotation information, finds prominent Height site on genome and sequence change.

For the sudden change of CDS coding region level, sporting position according to CDS can be from the sequence of mRNA corresponding to transcript The index of file calculates the DNA sequence that this CDS region is corresponding, then changes into DNA sequence by base aminoacid relation table Change into corresponding aminoacid sequence, the aminoacid sequence before and after sudden change compared, orient position that aminoacid changes and Amino acid whose change, thus map out the sudden change result of protein level, further, travel through in this gene structure annotation information CDS region, calculates the genomic locations and sequence change changed, thus maps out the sudden change feelings of genomic DNA level Condition.

For the sudden change result of genomic DNA level, say according to the CDS region of this gene in gene structure comment file Bright, calculate the position that the sudden change of corresponding CDS coding region occurs.Then the DNA sequence of this section of CDS is also extracted according to region annotation Out and be converted into corresponding aminoacid sequence, the change situation of corresponding protein level is finally obtained.

The destination file mapped contains genomic DNA, and the sudden change correspondence of CDS coding region (RNA) and protein level is closed System.User can be as required by unified for the sudden change result of a certain aspect research being applied to next step.

The instantiation implementing said method is as follows:

Hand digging common genetic is sick from PubMed article risk mutational site is as in figure 2 it is shown, utilize that patient's is complete Exon sequencing result and bio information sudden change (single base mutation and small fragment insertion and deletion) digging tool and flow process can arrive Corresponding sudden change annotation result, the generally sudden change of gene DNA level.And literature research personnel frequently with sudden change description side Formula is the sudden change of CDS coding region and protein level sudden change result.The most here application literature mining as a result, it is desirable to first to collect Sudden change is mapped to the sudden change of gene DNA level.

According to said method, the destination file obtaining mapping contains genomic DNA, CDS coding region (RNA) and protein The sudden change corresponding relation of level, as shown in Figure 3.

The foregoing is only presently preferred embodiments of the present invention, not in order to limit the present invention, all essences in the present invention Within god and principle, any modification, equivalent substitution and improvement etc. made, should be included within the scope of the present invention.

Claims

1. a DNA and protein level mutation analysis method, it is characterised in that comprise the following steps:

1) read gene mutation file, format and be processed into title；

2) index transcript sequence, gene information and gene transcripts annotation information, structure amino acid codes mapping table；

3) level of sudden change generation, the pattern of sudden change are judged；Judge that sudden change name is protein level sudden change or genomic DNA Level sudden change or the sudden change of CDS coding region；

4) according to step 3) judged result, respectively enter the sudden change of different level and map flow process, obtain three kinds of sudden change names Mapping relations.

A kind of DNA the most according to claim 1 and protein level mutation analysis method, it is characterised in that step 1) institute Stating formatting and be processed into title, method is:

102) step 2 is then entered containing gene name)；

A kind of DNA the most according to claim 1 and protein level mutation analysis method, it is characterised in that step 2) institute The step stating structure amino acid codes mapping table is:

A kind of DNA the most according to claim 1 and protein level mutation analysis method, it is characterised in that step 4) in, The mapping flow process of the named protein level that suddenlys change sudden change is:

401), after reading in protein level sudden change result, according to the figure place of amino acid mutation, calculate corresponding CDS coding region and send out The position of raw sudden change；

402) previous step can list the sudden change of all possible CDS coding region, to these CDS coding region sudden change reference sequences The base of position removes not matching result, obtains CDS coding region mutated site after filtering；

403) position occurred according to CDS sudden change, uses transcript structures annotation information, finds catastrophe point position on genome Point and sequence change.

A kind of DNA the most according to claim 1 and protein level mutation analysis method, it is characterised in that step 4) in, The mapping flow process of named genomic DNA level of suddenling change sudden change is:

411) for the sudden change result of genomic DNA level, say according to the CDS region of this gene in gene structure comment file Bright, calculate the position that the sudden change of corresponding CDS coding region occurs；

412) DNA sequence of this section of CDS out and is converted into corresponding aminoacid sequence according to region annotation extraction, finally obtains Change situation to corresponding protein level.

A kind of DNA the most according to claim 1 and protein level mutation analysis method, it is characterised in that step 4) in, Suddenly change named CDS coding region sudden change mapping flow process be:

421) known CDS sudden change occurs position and the base of sudden change change, and sport position according to CDS corresponding from transcript The index of the sequential file of mRNA calculates the DNA sequence that this CDS region is corresponding；

422) DNA sequence is changed into corresponding aminoacid sequence by base aminoacid relation table, the aminoacid before and after sudden change Gene comparision, orients position and amino acid whose change that aminoacid changes, thus maps out the abrupt junction of protein level Really；

423) travel through the CDS region in this gene structure annotation information, calculate the genomic locations changed and base changes Become, thus map out the catastrophe of genomic DNA level.