CN111798926B - Pathogenic gene locus database and establishment method thereof - Google Patents

Pathogenic gene locus database and establishment method thereof Download PDF

Info

Publication number
CN111798926B
CN111798926B CN202010612454.XA CN202010612454A CN111798926B CN 111798926 B CN111798926 B CN 111798926B CN 202010612454 A CN202010612454 A CN 202010612454A CN 111798926 B CN111798926 B CN 111798926B
Authority
CN
China
Prior art keywords
mutation
pathogenic
site
data
database
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010612454.XA
Other languages
Chinese (zh)
Other versions
CN111798926A (en
Inventor
刘晶星
于世辉
喻长顺
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Kingmed Diagnostics Group Co ltd
Guangzhou Kingmed Diagnostics Central Co Ltd
Original Assignee
Guangzhou Kingmed Diagnostics Group Co ltd
Guangzhou Kingmed Diagnostics Central Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Kingmed Diagnostics Group Co ltd, Guangzhou Kingmed Diagnostics Central Co Ltd filed Critical Guangzhou Kingmed Diagnostics Group Co ltd
Priority to CN202010612454.XA priority Critical patent/CN111798926B/en
Publication of CN111798926A publication Critical patent/CN111798926A/en
Application granted granted Critical
Publication of CN111798926B publication Critical patent/CN111798926B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/10Ploidy or copy number detection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/50Mutagenesis

Abstract

The invention relates to a pathogenic gene locus database and an establishment method thereof, belonging to the technical field of disease gene detection. The establishment method of the pathogenic gene locus database comprises the following steps: acquiring clinical verified pathogenic gene locus data information as reference data; acquiring a pathogenic gene locus due to amino acid change in the reference data, and expanding codons of amino acids at the locus; acquiring a pathogenic gene locus in the reference data due to change of a shearing locus, and expanding other mutation forms of the locus; screening the data, removing sites with mutation occurrence frequency higher than a preset threshold value, and combining the rest high-risk pathogenic mutation sites and high-risk pathogenic shearing sites with the reference data to form the pathogenic gene site database. The database records a large number of site records with high risk of pathogenicity, can reduce the possibility of omission, and greatly improves the accuracy and efficiency of clinical interpretation work.

Description

Pathogenic gene locus database and establishment method thereof
Technical Field
The invention relates to the technical field of disease gene detection, in particular to a pathogenic gene locus database and an establishment method thereof.
Background
Gene mutation distinguishes polymorphism and pathogenicity, and about 400 ten thousand mutations exist on the genome of each person, wherein most of the mutations are normal non-pathogenic sites, namely polymorphic sites, and the pathogenic sites are required to be verified by a complex flow, and are a long-term accumulation process.
At present, a plurality of databases for recording pathogenicity sites, such as HGMD, clinVar and the like, but all the databases record mutations which are actually generated, namely the mutations supported by real sample cases, are obtained after comparison and verification with clinical symptoms, namely the most common sites recorded in the databases.
In practice, it is not shown in the database because it is difficult to collect a sufficient sample size of unusual sites for the study of pathogenicity, but because of the diversity of gene mutations and disease symptom relationships (different mutations of the same gene may cause different symptoms) and heterogeneity (one symptom may be caused by a plurality of different gene mutations), the currently found pathogenic sites are very low, i.e., the significance of many mutations is unknown, and although these single rare sites are rare, their total amount is high.
The data with unvalified meanings plays a very important role in prompting the detection of pathogenic gene mutation, and if the gene detection is carried out only by relying on common sites recorded in a database, a plurality of meaningful sites are ignored, so that the influence on the genes with complex heterozygosity pathogeny is very large, the difficulty of detection work is greatly increased, and the diagnosis efficiency is reduced.
Disclosure of Invention
In view of the above, it is necessary to provide a database of pathogenic gene loci that can mine unverified high risk loci for later use and can make it easier for an analyst to find the presence of such loci by increasing the risk weights of the loci when analyzing the detected mutation loci, thereby reducing the detection difficulty and improving the diagnostic efficiency.
A method for establishing a pathogenic gene locus database comprises the following steps:
acquiring reference data: acquiring clinical verified pathogenic gene locus data information as reference data;
expansion yields mutation site data: acquiring pathogenic gene loci in the reference data due to amino acid change, expanding codons of amino acids of the loci, analyzing preset mutation generation conditions, obtaining high-risk pathogenic mutation locus data, and counting for later use;
expansion yields cleavage site data: acquiring pathogenic gene loci in the reference data due to change of the shearing loci, expanding other mutation forms of the loci to obtain high-risk pathogenic shearing locus data, and counting for later use;
extension site screening: and screening the obtained high-risk pathogenic mutation site data and high-risk pathogenic shearing site data, removing sites with mutation occurrence frequency higher than a preset threshold value, and combining the rest high-risk pathogenic mutation sites and high-risk pathogenic shearing sites with the reference data to form the pathogenic gene site database.
The inventor finds in practice that, since all the sites which are found in the database of various pathogenic sites and are generated in the real sample and verified, there are a large number of sites which are related to the sites and have high pathogenic risk, and although the sites with high risk are not verified, the sites can be mined out for later use by the method, so that the detection difficulty is reduced and the diagnosis efficiency is improved.
It will be appreciated that in the step of expanding the data for the mutation site, the amino acid is considered to change the pathogenic gene site, and the core is to consider a single base substitution mutation site, thereby changing the codon of the amino acid, and finally changing the amino acid, thereby causing the disease. Therefore, the preset mutation generation condition is classified and analyzed according to the codon corresponding to the amino acid and the possible condition after single base substitution. If one amino acid corresponds to 3 codons, there may be at most 9 codon patterns in a permutation and combination, and then corresponds to the corresponding amino acid (or stop codon), thereby analyzing the risk of the disease at the evaluation site.
In one embodiment, the reference data is derived from an HGMD database and/or a ClinVar database. It will be appreciated that the reference data sources are not limited, but merely a database that is as authoritative as possible.
In one embodiment, in the step of expanding to obtain mutation site data, the preset mutation generating conditions include the following three types:
the I type mutation is that the amino acid corresponding to the mutated codon is consistent with the reference data;
class II mutation to the post-mutation codon as the stop codon;
class III mutations are those in which the amino acid corresponding to the mutated codon is inconsistent with the reference data and is not a stop codon.
It will be appreciated that the above class III are missense mutations other than class I and class II.
In one embodiment, a class I mutation is determined when both class I and class II mutations are satisfied. It will be appreciated that the same applies to the case of class I and class II, i.e., the case where a mutation in the original database is a termination mutation. So when an extended new mutation is also a termination mutation, it is preferentially judged as class I. It is also understood that class II refers specifically to the fact that in the case where the mutation in the original database is not a termination mutation, the classification defined at the time of termination mutation is extended, and the risk of causing disease in class II is lower than in class I.
In one embodiment, in the step of obtaining the cleavage site data by the extension, the cleavage site extension Is specifically to mutate a mutation site in the reference data to a nucleotide different from the reference data, that Is, an Is mutation.
In one embodiment, in the extended site screening step, the predetermined threshold is 5%. The inventor performs screening and adjustment through the unit large sample data, and finally discovers that 5% is used as a threshold value, so that the method has a good effect, can show possible high risk sites as far as possible, and can avoid reducing risk prompt significance caused by excessive inclusion of nonsensical mutation.
In one embodiment, in the step of screening the extended loci, for loci without a definite occurrence frequency of mutation in the population, and loci obtained by screening that the occurrence frequency of mutation in the population is lower than a predetermined threshold value, the following filtering is performed:
1) Searching a local sample library for a sample with the site, and if the number of samples is smaller than the preset number of samples, reserving the site as a high-risk pathogenic site; if the number of the samples is greater than or equal to the preset number of the samples, judging the sample to be confirmed, and carrying out the next step;
2) And acquiring the clinical information corresponding to the sample to be confirmed, if the clinical information of the sample exceeding the preset proportion is related to the gene function of the site, reserving the site as a high-risk pathogenic site, and if the clinical information of the sample less than or equal to the preset proportion is related to the gene function of the site, rejecting the site.
It will be appreciated that, due to the existence of biological polymorphisms, if all mutations associated with validated mutations are considered as high risk pathogenic loci into the database, this may lead to a reduced meaning of risk cues, and therefore, the initially screened locus data should be filtered, leaving only high risk loci, thereby increasing the value of the invention in establishing a pathogenic genetic locus database.
In one embodiment, the predetermined number of samples is 10 and the predetermined ratio is 1/3. The inventor performs screening and adjustment through the unit large sample data, and finally discovers that the database is built by the parameters, and has better effect.
The invention also discloses a pathogenic gene locus database obtained by the establishment method of the pathogenic gene locus database.
The invention also discloses an automatic pathogenic gene analysis system, which comprises:
the data acquisition module is used for acquiring gene detection data of the sample to be detected;
the data analysis module Is used for substituting the gene detection data into the pathogenic gene locus database for comparison after bioinformatics analysis to obtain information of I type mutation, II type mutation, III type mutation and/or IS type mutation in the sample to be detected;
the judging and outputting module is used for outputting the site mutation information according to the risk level, wherein the risk level is sequentially as follows from high to low: class I mutations, class Is mutations, class II mutations, class III mutations.
Compared with the prior art, the invention has the following beneficial effects:
according to the method for establishing the pathogenic gene locus database, provided by the invention, the pathogenic gene locus data is enriched through mutation expansion of amino acid change and mutation expansion of shearing loci, and then the expanded loci are removed and screened, so that the pathogenic gene locus database which is rich in high-risk pathogenic loci and has a good practical value is finally obtained. Thus making it easier for an analyst to find the presence of these other pathogenic risks associated with validating pathogenic sites, thereby reducing the difficulty of detection and improving diagnostic efficiency.
The pathogenic gene locus database of the invention records a large number of locus records with high pathogenic risk, and can rapidly locate the high-risk pathogenic locus by carrying out matching analysis on the gene detection locus and the locus records, thereby reducing the possibility of omission and greatly improving the accuracy and efficiency of clinical interpretation work.
The pathogenic gene locus database can be used in a pathogenic gene automatic analysis system, and mutation loci possibly having pathogenic risks are all obtained by analysis through an automatic analysis process, so that the requirements on the experience of analysts in the belief generation process are reduced, the detection and analysis difficulty is reduced, and the diagnosis efficiency is improved.
Drawings
FIG. 1 shows amino acid codon table.
Detailed Description
In order that the invention may be readily understood, a more complete description of the invention will be rendered by reference to the appended drawings. Preferred embodiments of the present invention are shown in the drawings. This invention may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used herein in the description of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. The term "and/or" as used herein includes any and all combinations of one or more of the associated listed items.
The data used in the examples below were collected from the company in daily samples for inspection.
Example 1
A database of pathogenic loci is created by the following method:
1. reference data is acquired.
And obtaining clinically verified pathogenic gene locus data information serving as reference data.
For example, the clinically validated pathogenic gene locus data information may be obtained from databases that record pathogenic loci, such as HGMD, clinVar, etc., and in this embodiment, extended based on the HGMD database.
2. And expanding to obtain mutation site data.
The amino acid changes caused by the base mutation are the most abundant mutation types recorded in a database of pathogenic sites, and the base mutation causing the amino acid changes is often various, some of the amino acid changes need to be mutated into specific amino acid to cause diseases, some of the amino acid changes need to be mutated into stop codon to cause diseases, some of the amino acid changes only need to be changed to cause diseases, but only sites with published research results are recorded in the database. Therefore, the following expansion is performed in the present embodiment:
acquiring pathogenic gene loci due to amino acid changes in the reference data, expanding codons of amino acids at the loci, analyzing preset mutation generation conditions, and classifying all base mutations in one codon into three types of mutations according to the amino acid changes, referring to an amino acid codon table shown in fig. 1:
i, amino acid changes after mutation are consistent with those in a database;
II, a stop codon after mutation;
III, amino acid changes after mutation are inconsistent with and non-stop codons in the database (i.e., missense mutations except for the first two).
When I and II are satisfied simultaneously, the class I mutation is preferentially judged.
Through the process, high-risk pathogenic mutation site data are obtained through expansion, and statistics is carried out for standby.
The DMD genes in the HGMD database are specifically described below as examples:
example one:
if the 268 th amino acid of the DMD gene is Leu, the codon is TTA, and only one record exists in the database, namely [ c.804A > C; leu268Phe ]. Through examination of the codon table, 9 single base mutations can be generated at the codon, and the following mutations are respectively generated:
1) 2 is a termination mutation: [ c.803T > A; leu268term ], [ c.803T > G; leu268term ];
2) 5 are missense mutations: [ c.804A > C; leu268Phe ], [ c.264A > T; leu268Phe ], [ c.803T > C; leu268Ser ] [ c.802T > A; leu268Ile ] [ c.802T > G; leu268Val ];
3) There are 2 synonymous mutations: [ c.804A > G; leu268Leu ], [ c.802T > C; leu268Leu ];
pathogenic mutation site reference data in database with HGMD [ c.804a > C; comparison of Leu268Phe yields, [ c.264A > T; leu268Phe ] is scalable to class I; [ c.803T > A; leu268term ], [ c.803T > G; leu268term ] is scalable to class II; [ c.803T > C; leu268Ser ] [ c.802T > A; leu268Ile ] [ c.802T > G; leu268Val ] is scalable to class III.
Example two:
the 333 rd amino acid of the DMD gene is Ser, the codon is TCA, and only one record exists in the database, namely [ c.998C > A; ser333term ]. Through checking an amino acid codon table, 9 single base mutations at the codon are respectively as follows:
1) There are 2 termination mutations: [ c.998C > A, p.Ser333Term ], [ c.998C > G, p.Ser333Term ];
2) There are 4 missense mutations: [ c.998C > T, p.Ser333Leu ], [ c.997T > C, p.Ser333Pro ], [ c.997T > A, p.Ser333Thr ], [ c.997T > G, p.Ser333Ala ];
3) There are 3 synonymous mutations: [ c.999A > T, p.Ser333ser ], [ c.999A > C, p.Ser333ser ], [ c.999A > G, p.Ser333ser ];
pathogenic mutation site reference data [ c.998c > a in database with HGMD; comparison of p.Ser333term reveals that [ c.998C > G, p.Ser333term ] can be extended to class I or class II, and that class I overlaps class II, i.e., the mutation in the original database is the termination mutation, so that class I is preferentially determined when an extended new mutation is also the termination mutation. Thus, [ c.998C > G, p.Ser333Term ] extends to class I; the 4 missense mutations extend to class III.
3. Expansion yields cleavage site data.
Some base sites at the edges of introns and exons are critical for the cleavage of introns, some introns have other critical cleavage sites in addition to the 4 bases at + -1 and + -2 of the universal introns. Similarly, since only sites from which research results have been published are recorded in the conventional database, the present embodiment expands other mutant forms not recorded in the database into type is based on the coordinates of these key cleavage sites, as follows.
And acquiring pathogenic gene loci in the reference data due to the change of the shearing loci, expanding other mutation forms of the loci to obtain high-risk pathogenic shearing locus data, and counting for later use.
The DMD genes in the HGMD database are illustrated as examples: [ c.265-463A > G ] is the shearing pathogenic site of a DMD gene recorded in the database, and other mutation forms [ c.265-463A > C ], [ c.265-463A > T ], [ c.265-463delA ] at the same coordinates are expanded into type IS mutation.
Through the process, high-risk pathogenic shearing site data are obtained through expansion, and statistics is carried out for standby.
4. Extension site screening.
And screening the obtained high-risk pathogenic mutation site data and high-risk pathogenic shearing site data, removing sites with mutation occurrence frequency higher than a preset threshold value, and combining the rest high-risk pathogenic mutation sites and high-risk pathogenic shearing sites with the reference data to form the pathogenic gene site database.
Specifically, after population frequency annotation of the extended sites, polymorphic sites (mainly class is and class III) can be excluded with a threshold of 5%.
For example, a point mutation c.293-13C > G (population frequency 0.2%) in the CYP21A2 gene intron affects cleavage, resulting in very severe 21-hydroxylase deficiency. And c.293-13C > A is a common polymorphic site in a crowd, the crowd frequency is up to 63%, and c.293-13C > T has no official statistical data of crowd frequency.
The base C at the position is a human genome reference sequence base, G is a pathogenic base recorded in an HGMD database, A and T are extended bases, and because A is a common polymorphic site in a crowd, pathogenicity of the polymorphic site can be eliminated, so that the polymorphic site can be eliminated, and only T is reserved as an extended risk pathogenic site.
For sites without official statistics of population frequency, and sites with population frequency below 5%, screening was performed using clinical information from the local sample library summarized by the inventors, using the following protocol:
1) Searching a sample with the position in a local sample library, if the number of the samples is less than 10, reserving the position, otherwise, performing the next step;
2) Clinical information of the samples is obtained, and if more than 1/3 of the clinical information of the samples is related to the gene in which the site is located, the site is reserved.
After the above treatment is removed and screened, the remaining high risk pathogenic mutation sites and high risk pathogenic cleavage sites are combined with the reference data obtained from the HGMD database in the first step, to form a pathogenic gene locus database in this embodiment.
5. Risk analysis of the extension site.
The pathogenicity of termination mutations is required to be as the case may be, termination mutations at different genes and at different exon positions may be at different risk of pathogenicity (termination mutations at certain exons of some genes are benign), while missense mutations are more detrimental.
The pathogenicity of the I type locus is more definite because the amino acid change is consistent with that in the database, and the IS type locus is the base change at the same coordinate of the shearing locus recorded in the database, and the pathogenicity is also more definite. The pathogenicity risks of class II and class III sites decrease in sequence. In practical applications, the judgment can be made according to the risk levels of several types of sites. The risk level is as follows from high to low: class I mutations, is mutations, class II mutations, class III mutations were reported.
6. Database listing site comparison.
And comparing the number of recorded mutation sites of the pathogenic gene locus database established by the expansion with the number of recorded mutation sites of the HGMD database, wherein the specific number is shown in the following table.
TABLE 1 site data alignment
HGMD recording mutation count 257,152
Number of class I sites 25,717
Number of class is loci 72,426
Number of class II sites 35,262
Class III site count 823,915
From the table, the proportion of the extended I, IS and II sites with higher risks to the total recorded site number of the HGMD exceeds 50%, and the number of the III sites with lower risks exceeds 3 times of the total recorded site number of the HGMD, so that the high-risk sites in the database are greatly enriched.
Example 2
An automated pathogenic gene analysis system comprising:
the data acquisition module is used for acquiring gene detection data of the sample to be detected;
the data analysis module Is used for substituting the gene detection data into the pathogenic gene locus database established in the embodiment 1 for comparison after bioinformatics analysis to obtain information of I type mutation, II type mutation, III type mutation and/or IS type mutation in the sample to be detected;
the judging and outputting module is used for outputting the site mutation information according to the risk level, wherein the risk level is sequentially as follows from high to low: class I mutations, class Is mutations, class II mutations, class III mutations.
Example 3
Sample data for detecting genes related to hereditary hearing loss were analyzed using the pathogenic gene automatic analysis system of example 2.
Mutations were detected in the conventional protocol for chr7:107350577A > G, which were retrieved in the HGMD raw data base as a risk site for the SLC26A4 gene [ c.2168A > G, p.His723Arg ]. The literature reports that the mutation is the most common mutation in east Asia deafness with enlarged vestibular aqueduct, but the SLC26A4 gene is recessive inheritance, and two pathogenic sites are needed for pathopoiesia.
After searching using the HGMD database extended in example 1, the patient was found to have a mutation of chr7:107323982G > A, which is also a causative gene locus of high risk pathogenicity, as follows.
As shown in Table 2, in the HGMD raw database, G > C and G > T at the chr7:107323982 position are both marked as pathogenic sites (DM markers), and G > T is considered to have a conflict in the pathogenic manner in different viewpoints, one viewpoint is considered to cause amino acid changes (later M markers) to cause diseases, and the other viewpoint is considered to cause shear changes (later S markers) to cause diseases; g > T at position 107323981 of chr7:5725 within the same codon range as chr7:107323982 is also marked as a pathogenic site.
Although the original database of HGMD does not contain the chr7:107323982G > A mutation of the patient in this case, the analysis of the database of pathogenic gene loci established by the example 1 shows that the chr7:107323982G > A is developed everywhere, one is class IS, and three is class III.
TABLE 2 analysis of pathogenic Gene loci
Note that: in the classification level, raw is the pathogenic site recorded in the original database, and raw_db is the information corresponding to the recorded site in the original database (HGMD). 1,8,9,10 sites are the recording sites of the HGMD original database, the other sites are expansion sites, 11,14,15,16 sites carried by patients in the present case, and the expansion sites are respectively expanded from the sites of 10,1,8,9. The codon at Gly334 of the gene is GGG, and the amino acid mutation type of the expansion site can be correspondingly shown according to the codon.
The above cases show that if only the HGMD original database is used, the chr7:107323982g > a mutation is easily omitted from the detection result of the patient, and because the position G > a is not a common mutation, various databases (including the thousand genome, dbSNP, HGMD, clinvar, etc.) are not recorded, and the site is easily found after analysis by using the pathogenic gene site database established in example 1, and the higher pathogenic risk is clarified, thereby reducing the detection difficulty and improving the diagnosis efficiency.
Example 4
Other cases were analyzed by using the disease causing gene automatic analysis system of example 2, and the disease causing gene locus database obtained in example 1 was evaluated by referring to the method of example 3. The results are as follows
Other missing report cases are exemplified by:
1. experiment No. NP15D3999 sample.
This sample was analyzed by the method of example 3, and the results are shown in tables 3 and 4 below.
TABLE 3 pathogenic Gene locus expansion
Note that: the codon at Thr1513 of the gene is ACT, and the amino acid mutation type of the expansion site can be correspondingly shown according to the codon.
TABLE 4 analysis of pathogenic loci of NP15D3999 samples
2. Experiment No. NP19S2603 sample.
This sample was analyzed by the method of example 3, and the results are shown in tables 5 and 6.
TABLE 5 pathogenic Gene locus expansion
Note that: the mutation site Is located in an intron, does not encode amino acids, but affects the cleavage and extends to an Is class mutation site.
TABLE 6 analysis of pathogenic gene loci of NP19S2603 samples
3. Test No. TP18D664 sample.
This sample was analyzed by the method of example 3, and the results are shown in tables 7 and 8 below.
TABLE 7 pathogenic Gene locus expansion
/>
Note that: the codon at Gly662 of the gene is GGG, and the amino acid mutation type of the expansion site can be correspondingly shown according to the codon.
TABLE 8 analysis of pathogenic gene loci of TP18D664 samples
4. Experiment No. NP20S768 sample.
This sample was analyzed by the method of example 3, and the results are shown in tables 9 and 10.
TABLE 9 pathogenic Gene locus expansion
/>
Note that: the codon at Phe306 of the gene is TTC, and the amino acid mutation type of the expansion site can be correspondingly shown according to the codon.
TABLE 10 analysis of pathogenic loci of sample NP20S768
In summary, in the conventional technology, the current interpretation of gene detection results is highly dependent on the pathogenicity sites recorded in the existing databases, which are the results of studies of gene data by humans for decades. However, in the actual genetic testing interpretation work, there are still a large number of suspicious mutation sites which are not recorded by the database, and the basic research on the suspicious mutation sites is impossible due to the timeliness of the clinical testing work.
By the method, a large number of site records with high pathogenic risk are obtained after the database is expanded, and the high-risk pathogenic sites can be rapidly positioned by carrying out matching analysis on the gene detection sites and the site records, so that the possibility of omission is reduced, and the accuracy and the efficiency of clinical interpretation work are greatly improved.
The technical features of the above-described embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above-described embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.
The above examples illustrate only a few embodiments of the invention, which are described in detail and are not to be construed as limiting the scope of the invention. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the invention, which are all within the scope of the invention. Accordingly, the scope of protection of the present invention is to be determined by the appended claims.

Claims (9)

1. The method for establishing the pathogenic gene locus database is characterized by comprising the following steps of:
acquiring reference data: acquiring clinical verified pathogenic gene locus data information as reference data;
expansion yields mutation site data: acquiring pathogenic gene loci in the reference data due to amino acid change, expanding codons of amino acids of the loci, analyzing preset mutation generation conditions, obtaining high-risk pathogenic mutation locus data, and counting for later use;
expansion yields cleavage site data: acquiring pathogenic gene loci in the reference data due to change of the shearing loci, expanding other mutation forms of the loci to obtain high-risk pathogenic shearing locus data, and counting for later use;
extension site screening: and screening the obtained high-risk pathogenic mutation site data and high-risk pathogenic shearing site data, removing sites with mutation occurrence frequency higher than a preset threshold value, and combining the rest high-risk pathogenic mutation sites and high-risk pathogenic shearing sites with the reference data to form the pathogenic gene site database.
2. The method of claim 1, wherein the reference data is derived from HGMD database and/or ClinVar database.
3. The method of claim 1, wherein in the step of expanding the obtained mutation site data, the predetermined mutation generation conditions include the following three types:
the I type mutation is that the amino acid corresponding to the mutated codon is consistent with the reference data;
class II mutation to the post-mutation codon as the stop codon;
class III mutations are those in which the amino acid corresponding to the mutated codon is inconsistent with the reference data and is not a stop codon.
4. The method for creating a database of pathogenic loci according to claim 3, wherein the class I mutation is determined when the class I mutation and the class II mutation are satisfied at the same time.
5. The method of claim 3, wherein in the step of expanding the data of the cleavage site, the cleavage site Is specifically a mutation of a mutation site in the reference data to a nucleotide different from the reference data, i.e., an Is-type mutation.
6. The method of claim 1, wherein the predetermined threshold value in the extended site screening step is 5%.
7. The method according to claim 1, wherein in the extended site screening step, for sites having no definite occurrence frequency of mutation in the population, and for sites having occurrence frequency of mutation in the population lower than a predetermined threshold value obtained by screening, the following filtering is performed:
1) Searching a local sample library for a sample with the site, and if the number of samples is smaller than the preset number of samples, reserving the site as a high-risk pathogenic site; if the number of the samples is greater than or equal to the preset number of the samples, judging the sample to be confirmed, and carrying out the next step;
2) And acquiring the clinical information corresponding to the sample to be confirmed, if the clinical information of the sample exceeding the preset proportion is related to the gene function of the site, reserving the site as a high-risk pathogenic site, and if the clinical information of the sample less than or equal to the preset proportion is related to the gene function of the site, rejecting the site.
8. The method for creating a database of pathogenic loci according to claim 7, wherein the predetermined number of samples is 10 and the predetermined ratio is 1/3.
9. An automated pathogenic gene analysis system comprising:
the data acquisition module is used for acquiring gene detection data of the sample to be detected;
the data analysis module Is used for substituting the gene detection data into the pathogenic gene locus database of claim 5 for comparison after bioinformatics analysis to obtain information of I type mutation, II type mutation, III type mutation and/or IS type mutation in the sample to be detected;
the judging and outputting module is used for outputting the site mutation information according to the risk level, wherein the risk level is sequentially as follows from high to low: class I mutations, class Is mutations, class II mutations, class III mutations.
CN202010612454.XA 2020-06-30 2020-06-30 Pathogenic gene locus database and establishment method thereof Active CN111798926B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010612454.XA CN111798926B (en) 2020-06-30 2020-06-30 Pathogenic gene locus database and establishment method thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010612454.XA CN111798926B (en) 2020-06-30 2020-06-30 Pathogenic gene locus database and establishment method thereof

Publications (2)

Publication Number Publication Date
CN111798926A CN111798926A (en) 2020-10-20
CN111798926B true CN111798926B (en) 2023-09-29

Family

ID=72811445

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010612454.XA Active CN111798926B (en) 2020-06-30 2020-06-30 Pathogenic gene locus database and establishment method thereof

Country Status (1)

Country Link
CN (1) CN111798926B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112599210B (en) * 2020-12-16 2022-04-12 首都医科大学附属北京同仁医院 Data management method and device, electronic equipment and storage medium

Citations (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102566931A (en) * 2011-12-31 2012-07-11 奇智软件(北京)有限公司 Method and device for displaying suspended window
CN104955961A (en) * 2012-12-11 2015-09-30 塞勒密斯株式会社 Method for synthesizing gene library using codon combination and mutagenesis
CN105045468A (en) * 2015-07-23 2015-11-11 深圳市万普拉斯科技有限公司 Processing method and apparatus for suspended notice in mobile terminal
CN105930043A (en) * 2016-04-15 2016-09-07 苏州佳世达电通有限公司 Message display method and electronic device
KR101693504B1 (en) * 2015-12-28 2017-01-17 (주)신테카바이오 Discovery system for disease cause by genetic variants using individual whole genome sequencing data
CN107229841A (en) * 2017-05-24 2017-10-03 重庆金域医学检验所有限公司 A kind of genetic mutation appraisal procedure and system
CN107247890A (en) * 2017-06-30 2017-10-13 张巍 A kind of gene data system for clinical diagnosis and prediction
CN107345248A (en) * 2017-06-26 2017-11-14 思畅信息科技(上海)有限公司 Gene and site methods of risk assessment and its system based on big data
CN108710782A (en) * 2018-05-16 2018-10-26 为朔医学数据科技(北京)有限公司 Genotype conversion method, device and electronic equipment
CN108920898A (en) * 2018-07-27 2018-11-30 中国科学院水生生物研究所 A kind of method of quick analysis eukaryotic protein genomics data
CN108920901A (en) * 2018-07-24 2018-11-30 中国医学科学院北京协和医院 A kind of sequencing data mutation analysis system
CN109243534A (en) * 2018-08-31 2019-01-18 郑州金域临床检验中心有限公司 Analytical equipment, equipment and the storage medium of mutated gene based on NGS
CN109686439A (en) * 2018-12-04 2019-04-26 东莞博奥木华基因科技有限公司 Data analysing method, system and the storage medium of hereditary disease genetic test
CN109725947A (en) * 2017-10-30 2019-05-07 华为技术有限公司 A kind of processing method and terminal of unread message
CN109766037A (en) * 2018-12-27 2019-05-17 维沃移动通信有限公司 Reminding method and terminal device
CN109920481A (en) * 2019-01-31 2019-06-21 北京诺禾致源科技股份有限公司 The genetic mutation unscrambling data library BRCA1/2 and its construction method
CN110379458A (en) * 2019-07-15 2019-10-25 中国人民解放军陆军军医大学第一附属医院 Pathogenicity variation site determination method, device, computer equipment and storage medium
CN110544537A (en) * 2019-07-29 2019-12-06 北京荣之联科技股份有限公司 Generation method of single-gene genetic disease gene analysis report and electronic equipment thereof
CN111139291A (en) * 2020-01-14 2020-05-12 首都医科大学附属北京安贞医院 High-throughput sequencing analysis method for monogenic hereditary diseases
WO2020097660A1 (en) * 2018-11-15 2020-05-22 The University Of Sydney Methods of identifying genetic variants

Patent Citations (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102566931A (en) * 2011-12-31 2012-07-11 奇智软件(北京)有限公司 Method and device for displaying suspended window
CN104955961A (en) * 2012-12-11 2015-09-30 塞勒密斯株式会社 Method for synthesizing gene library using codon combination and mutagenesis
CN105045468A (en) * 2015-07-23 2015-11-11 深圳市万普拉斯科技有限公司 Processing method and apparatus for suspended notice in mobile terminal
KR101693504B1 (en) * 2015-12-28 2017-01-17 (주)신테카바이오 Discovery system for disease cause by genetic variants using individual whole genome sequencing data
CN105930043A (en) * 2016-04-15 2016-09-07 苏州佳世达电通有限公司 Message display method and electronic device
CN107229841A (en) * 2017-05-24 2017-10-03 重庆金域医学检验所有限公司 A kind of genetic mutation appraisal procedure and system
CN107345248A (en) * 2017-06-26 2017-11-14 思畅信息科技(上海)有限公司 Gene and site methods of risk assessment and its system based on big data
CN107247890A (en) * 2017-06-30 2017-10-13 张巍 A kind of gene data system for clinical diagnosis and prediction
CN109725947A (en) * 2017-10-30 2019-05-07 华为技术有限公司 A kind of processing method and terminal of unread message
CN108710782A (en) * 2018-05-16 2018-10-26 为朔医学数据科技(北京)有限公司 Genotype conversion method, device and electronic equipment
CN108920901A (en) * 2018-07-24 2018-11-30 中国医学科学院北京协和医院 A kind of sequencing data mutation analysis system
CN108920898A (en) * 2018-07-27 2018-11-30 中国科学院水生生物研究所 A kind of method of quick analysis eukaryotic protein genomics data
CN109243534A (en) * 2018-08-31 2019-01-18 郑州金域临床检验中心有限公司 Analytical equipment, equipment and the storage medium of mutated gene based on NGS
WO2020097660A1 (en) * 2018-11-15 2020-05-22 The University Of Sydney Methods of identifying genetic variants
CN109686439A (en) * 2018-12-04 2019-04-26 东莞博奥木华基因科技有限公司 Data analysing method, system and the storage medium of hereditary disease genetic test
CN109766037A (en) * 2018-12-27 2019-05-17 维沃移动通信有限公司 Reminding method and terminal device
CN109920481A (en) * 2019-01-31 2019-06-21 北京诺禾致源科技股份有限公司 The genetic mutation unscrambling data library BRCA1/2 and its construction method
CN110379458A (en) * 2019-07-15 2019-10-25 中国人民解放军陆军军医大学第一附属医院 Pathogenicity variation site determination method, device, computer equipment and storage medium
CN110544537A (en) * 2019-07-29 2019-12-06 北京荣之联科技股份有限公司 Generation method of single-gene genetic disease gene analysis report and electronic equipment thereof
CN111139291A (en) * 2020-01-14 2020-05-12 首都医科大学附属北京安贞医院 High-throughput sequencing analysis method for monogenic hereditary diseases

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
3 个 Stargardt 家系 ABCA4 基因致病突变位点的筛查及验证;王悠 等;《郑州大学学报( 医学版)》;第53卷(第2期);217-221 *
A phenotype-specific framework for identifying the eye abnormalities causative nonsynonymous-variants;Han-Kui Liu 等;《bioRxiv preprint》;1-20 *
MPD: a pathogen genome and metagenome database;Tingting Zhang 等;《Database》;第100卷(第39期);1-16 *
人类基因组上有害同义突变预测;姚瑶;《中国优秀硕士学位论文全文数据库 基础科学辑》(第2018年10期);A006-41 *
基于序列模式挖掘识别基因剪接位点的硏究;孙永山;《中国优秀硕士学位论文 基础科学辑》(第2016年09期);A006-113 *

Also Published As

Publication number Publication date
CN111798926A (en) 2020-10-20

Similar Documents

Publication Publication Date Title
CN107423578B (en) Device for detecting somatic cell mutation
CN103667438B (en) Method for screening HRDs disease-causing mutation and gene chip hybridization probe designing method involved in same
CN109033749A (en) A kind of Tumor mutations load testing method, device and storage medium
KR102648634B1 (en) Systems and methods for exploiting relatedness in genomic data analysis
CN104462869A (en) Method and device for detecting somatic cell SNP
CN107368708B (en) A kind of method and system of precisely analysis DMD gene structures variation breakpoint
CN111091868B (en) Method and system for analyzing chromosome aneuploidy
CN109346130A (en) A method of directly micro- haplotype and its parting are obtained from full-length genome weight sequencing data
CN112662767B (en) Kit and probe for measuring genomic instability and application of kit and probe
CN114049914B (en) Method and device for integrally detecting CNV, uniparental disomy, triploid and ROH
CN109994154A (en) A kind of screening plant of single-gene recessive genetic disorder candidate disease causing genes
CN113724791B (en) CYP21A2 gene NGS data analysis method, device and application
CN110021346B (en) Gene fusion and mutation detection method and system based on RNAseq data
CN111091869A (en) Genetic relationship identification method using SNP as genetic marker
CN111798926B (en) Pathogenic gene locus database and establishment method thereof
CN112669906A (en) Detection method, device, terminal device and computer-readable storage medium for measuring genome instability
CN111139291A (en) High-throughput sequencing analysis method for monogenic hereditary diseases
CN111223525A (en) Tumor exon sequencing data analysis method
CN107862177B (en) Construction method of single nucleotide polymorphism molecular marker set for distinguishing carp populations
CN112201306A (en) True and false gene mutation analysis method based on high-throughput sequencing and application
WO2023191262A1 (en) Method for predicting cancer recurrence using patient-specific panel
CN109712671B (en) Gene detection device based on ctDNA, storage medium and computer system
CN114530200B (en) Mixed sample identification method based on calculation of SNP entropy
CN116564406A (en) Automatic analysis method and equipment for genetic variation
Gu et al. A suite of automated sequence analyses reduces the number of candidate deleterious variants and reveals a difference between probands and unaffected siblings

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: No. 10, Helix 3 Road, International Biological Island, Huangpu District, Guangzhou City, Guangdong Province, 510320

Applicant after: GUANGZHOU KINGMED CENTER FOR CLINICAL LABORATORY

Applicant after: GUANGZHOU KINGMED DIAGNOSTICS GROUP Co.,Ltd.

Address before: 510335 3rd floor, 2429 Xingang East Road, Haizhu District, Guangzhou City, Guangdong Province

Applicant before: GUANGZHOU KINGMED CENTER FOR CLINICAL LABORATORY

Applicant before: GUANGZHOU KINGMED DIAGNOSTICS GROUP Co.,Ltd.

CB02 Change of applicant information
GR01 Patent grant
GR01 Patent grant