CN108165564B

CN108165564B - Mycobacterium tuberculosis H37Rv encoding gene and application thereof

Info

Publication number: CN108165564B
Application number: CN201711251274.8A
Authority: CN
Inventors: 徐平; 张瑶; 王富强; 孙金帅; 武舒佳; 常蕾
Original assignee: BEIJING PROTEOME RESEARCH CENTER
Current assignee: Academy of Military Medical Sciences AMMS of PLA
Priority date: 2017-12-01
Filing date: 2017-12-01
Publication date: 2021-06-08
Anticipated expiration: 2037-12-01
Also published as: CN108165564A

Abstract

The invention relates to a mycobacterium tuberculosis H37Rv coding gene which can be used as a standard gene for molecular identification of mycobacterium tuberculosis complex and is used for molecular identification and clinical detection of the mycobacterium tuberculosis complex.

Description

Mycobacterium tuberculosis H37Rv encoding gene and application thereof

Technical Field

The invention relates to the field of gene detection, in particular to identification of pathogenic bacteria species.

Background

Mycobacterium Tuberculosis (MTB) is a pathogenic bacterium that causes tuberculosis in humans. It can invade all organs of the body, but pulmonary tuberculosis is the most common. Tuberculosis is an extremely important infectious disease so far and seriously threatens the life health of human beings. It is reported by WHO that about 800 new cases occur each year, and at least 300 million people die from the disease. The clinical bacterial strain of MTB is difficult to culture, slow in growth, capable of cross-infecting with other mycobacteria, difficult to distinguish between tuberculosis and other respiratory tract infection symptoms and the like, and brings great difficulty to clinical rapid diagnosis and treatment. Therefore, the establishment of a quick, accurate, specific, sensitive and cheap tuberculosis detection method is a necessary premise for effectively treating and controlling tuberculosis spreading, and is a new challenge and a new task for detecting mycobacterium in clinical laboratories.

Mycobacterium tuberculosis complex (MTBC) includes the Mycobacterium groups m.tuberculosis, m.africanum, m.orygis, m.bovis, m.microti, m.canettii, m.caprae, m.pinnipedii, m.subcatetate, m.mungi, which all cause tuberculosis in humans and other life forms. At present, the domestic and foreign MTBC identification method is mainly divided into the following three categories: traditional separation culture method; molecular level detection (IS6110, restriction fragment length polymorphism analysis, multi-site variable number repeat polymorphism analysis, etc.); a method for analyzing the components of a microorganism (fatty acid, mycolic acid) by chromatography. The three methods have respective advantages, but have disadvantages, such as long separation culture period and low thallus culturable rate; at present, the molecular level detection is poor in specificity, sensitivity and simplicity; the analysis cost of the thallus component characteristics is high, and the operation is complex.

MTB H37Rv completed whole genome sequencing in 1998, the MTB strain that completed whole genome sequencing the earliest. From this point on, researchers in various countries are perfecting and supplementing H37Rv gene annotation databases based on strategies such as algorithm optimization, annotation software updating, transcriptomics and proteomics. However, since MTB belongs to prokaryotes, annotation errors (over-annotation, gene boundary error, ORF initiation, termination site error, alternative splicing, ribosome translocation, missing annotation) may still exist in genome annotation due to the inherent shortcomings of the prokaryote genome annotation technology, which brings trouble to deep and accurate analysis of biological mechanisms. In order to solve the problem, proteomics (proteomics) has been used for correcting the annotated gene of H37Rv, however, high-proportion false positive, difficulty in annotated gene prediction, new gene verification, new gene function analysis and application thereof, and the like, are problems faced in the field.

In general, the traditional mycobacterium tuberculosis complex (MTBC) identification strategy has the defects of long period, tedious steps, low specificity and sensitivity and the like. In order to further perfect re-annotation of the H37Rv whole genome, missing annotation genes in H37Rv are found, the H37Rv whole genome missing annotation genes and application technologies thereof in MTBC molecular identification are effectively protected, and a method for quickly and accurately identifying the MTBC group by using the H37Rv new genes is imperatively developed.

Disclosure of Invention

An object of the present invention is to provide a new encoding gene of mycobacterium tuberculosis H37Rv, which is H37Rv leaky annotation encoding gene Rv3108c (-3476972-3477175 |), which can be used as a barcode molecular marker of mycobacterium tuberculosis complex for detecting mycobacterium tuberculosis complex, and the sequence of which is shown in SEQ ID NO. 1.

Other objects of the present invention include providing specific PCR primers useful for amplifying the above-described encoding genes and providing a method of detecting or identifying the presence of a binding Mycobacterium complex in a sample; the invention also provides a detection kit related to the coding gene and application of the gene.

According to one aspect of the invention, by comparing proteomic research techniques, a protein coding sequence of H37Rv that is difficult to find by genetic prediction software was discovered that effectively distinguishes MTBC from other species of the same genus. The gene is a missing annotation gene of Mycobacterium tuberculosis (Mycobacterium tuberculosis H37Rv), namely Rv3108c (- | 3476972-. Comparative genomics studies show that the gene sequence can distinguish the Mycobacterium tuberculosis complex (MTBC) strain from other species of Mycobacterium.

Specifically, a primer capable of specifically amplifying the Rv3108c (- |3476972-3477175|) gene of MTBC is designed, namely the primer provided by the invention, and the primer sequence is as follows:

F:5’-GACCAGTGCCCTCGCAGT-3’；

R:5’-AGGACGATCATGGCTCCG-3’。

according to the existence of the gene DNA sequence PCR product in the sample to be detected or the difference of the DNA sequence, the MTBC can be quickly and accurately identified.

According to another aspect of the present invention, based on the above-mentioned new standard encoding gene of Mycobacterium tuberculosis H37Rv, the present invention specifically establishes a method for detecting or identifying Mycobacterium tuberculosis complex, comprising the following steps:

(1) separating and extracting genome DNA from a sample to be detected;

(2) and (2) performing PCR amplification by using the DNA obtained in the step (1) as a template and adopting the following primers:

F:5’-GACCAGTGCCCTCGCAGT-3’(SEQ ID NO.4)；

R:5’-AGGACGATCATGGCTCCG-3’(SEQ ID NO.5)。

(3) performing gel electrophoresis analysis or sequencing on the DNA product obtained by amplification in the step (2);

(4) and (3) comparing the result of the step (3) with the barcode gene Rv3108c (-) (- |3476972-3477175|), and if the homology is more than 99%, judging that the sample to be detected contains the mycobacterium tuberculosis complex.

Further, the detection method is characterized in that electrophoresis analysis is performed on the PCR product primarily according to the DNA bar code principle, and if the strain to be detected does not have a target band, the strain is not MTBC; if the band exists, further sequencing verification can be carried out, the sequence obtained by sequencing and the standard sequence of Rv3108c (- |3476972-3477175|) of H37Rv are subjected to homologous comparison and alignment to obtain the similarity between the sequences, and if the sequence homology is more than 99 percent, the strain can be judged to be MTBC; and (3) distinguishing the MTBC family from nontuberculous mycobacteria, common respiratory pathogenic bacteria and common respiratory viruses according to the clustering condition of the DNA barcode sequence of the strain to be identified and the standard sequence.

The detection method can be used for strain identification research of the mycobacterium tuberculosis complex and can also be used for clinical rapid inspection. The sample to be detected can be H37Rv strain, other MTBC, nontuberculous mycobacteria, respiratory tract common pathogenic bacteria and respiratory tract common virus strain; or directly using sputum, saliva or blood of tuberculosis and other respiratory patients.

Based on the above method, the present invention also provides a detection kit, wherein the kit contains a reagent for detecting the novel standard encoding gene of Mycobacterium tuberculosis H37Rv in a container, and simultaneously provides manufacturing, using and marketing information about the medicine or biological product, which can be approved by a government drug administration. For example, after PCR amplification, the reagent for directly detecting the Rv3108c (- |3476972-3477175|) gene in the sample may comprise one or more of amplification primers, dNTPs, DNA polymerase used for PCR reaction and its buffer, reagents required for enzyme digestion reaction and/or sequencing reaction, etc. It is known to those skilled in the art that the above components are merely illustrative, and for example, the primers may employ the specific PCR primers described above, and the DNA polymerase used for the PCR reaction is an enzyme capable of being used for PCR amplification. The detection of the encoding gene of the present invention can also be provided in the form of an integrated, e.g., gene chip.

Has the advantages that: the invention provides a standard gene and a molecular identification method for molecular identification of Mycobacterium tuberculosis complex (MTBC), wherein the gene can effectively distinguish MTBC from other species of the same genus, the identification method using the gene overcomes the defects of primer design multiplicity, poor result repeatability and the like in the existing identification process of the Mycobacterium tuberculosis complex, has the characteristics of universality, easy amplification and easy comparison, can accurately identify the class from other mycobacteria with close relativity or other respiratory tract infectious germs, and provides powerful technical means and research tools for the epidemiological investigation and the rapid diagnosis and identification of clinical tuberculosis patients.

Drawings

FIG. 1: evidence of peptide profile matching supporting the discovery of new coding genes;

FIG. 2: comparing the mass spectrogram of the synthesized peptide fragment with the mass spectrogram of the original identified peptide fragment;

FIG. 3: a corresponding diagram of a protein sequence coded by ORF of the peptide fragment locus region; the underlined part is the peptide identified in proteomics and verified by the synthetic peptide;

FIG. 4: comparing the homology of the Rv3108c (- |3476972-3477175|) standard gene sequence;

FIG. 5: the result of BLASTP of a protein sequence corresponding to the Rv3108c (- |3476972-3477175|) gene of the H37Rv strain;

FIG. 6: the result of agarose gel electrophoresis of the PCR amplification product of the Rv3108c (- |3476972-3477175|) specific primer;

wherein, the specific information of each lane sample is shown in Table 1;

FIG. 7: the PCR amplification sequencing result of the Rv3108c (- |3476972-3477175|) gene is compared with a standard sequence.

Detailed Description

The invention is further described with reference to specific embodiments, but the scope of the claims is not limited thereto. The reagents used in the present invention are all commercially available.

Example 1: search for genes encoding missing release of the genome of strain H37Rv

1.1 high coverage proteomic validation of the genome of the H37Rv strain

The deep coverage study of proteome was performed on the H37Rv strain using the high coverage proteome technique. Annotated encoding gene validation was performed on its genome using the pFind 3 engine based on the Tuberculosis (20160307) database. To find new protein coding regions, we performed six-reading-frame database translation of H37Rv in the genome-wide (NC _000962.3) file published at NCBI using pAnno software based on proteomic technology, and identified new peptide fragments and new proteins using this database for mass spectrometry data. To reduce the false positive rate, we used 3 filtering methods to separately estimate class FDR for the annotated and new peptide fragments, S-FDR, T-FDR I and T-FDR II, respectively, during the data filtering.

Through data analysis, a total of 3238H 37Rv annotated genes are identified, and the coverage is as high as more than 80% of the strain, which is the largest mass spectrum data of the H37Rv protein reported so far. In addition, we obtained new peptide fragments after 3 FDRs ≤ 1 filtration. In order to further ensure the quality of the new peptide fragments, spectrogram quality screening is carried out on spectrograms corresponding to the new peptide fragments left after filtration, and finally some peptide fragments with good spectrogram quality are reserved. To further investigate that these peptides with higher spectral quality were not due to single amino acid mutations in the annotated peptide, we performed amino acid mutation checks to ensure that these new peptides were newly identified peptides of H37 Rv.

1.2 verification of the encoded protein and database of the Rv3108c (-3476972-3477175 |) Gene

After high coverage proteome verification, we find some suspected new peptide fragments which are leaked to release, and perform peptide fragment synthesis verification on the suspected new peptide fragments with high reliability, and score more than or equal to 0.8 according to the similarity between the original spectrum and the synthesized spectrum of the new peptide fragments as a similarity threshold, and after scoring and screening, a plurality of peptide fragments pass through verification and correspond to a new Open Reading Frame (ORF), namely the potential leaked to release genes of the current H37Rv strain.

Among them, we found that the new leaky release gene Rv3108c (-3476972-3477175. cndot.) has 99% similarity to M.tuboculosis 1825K, A70645 and M.canettii CIPT140070010 and less than 76% similarity to other strains by comparison with BLASTP, and belongs to a protein with unknown function. We detected a peptide segment ATSALAVIR (SEQ ID NO.6) and corresponded to the new gene Rv3108c (- |3476972-3477175|), as shown in FIG. 1, the spectrogram quality was good, the b/y ions were continuously matched, the peak signal was low, and the result was very reliable.

To further confirm this identification, we chemically synthesized the peptide according to the amino acid sequence of our newly identified peptide and generated a secondary spectrum of the synthesized peptide using the mass spectrometry conditions described above.

Our high energy collision MS on synthetic peptide fragments₂Verification is carried out, and the primary parent ions and the secondary daughter ions both accord with theoretical values, so that the sequence of the synthesized peptide fragment is correct; on this basis, we manually examined MS of synthetic peptides of novel peptide sequences identified from large-scale proteomic data₂And the large scale identification of the new peptide fragment spectrum, both of which are almost completely identical, the cosin value obtained by the daughter ion similarity is 0.98, which proves that the new peptide fragment identified by us from H37Rv is correct. (FIG. 2).

After confirming the sequence of the peptide fragment to be released, according to the gene position of the peptide fragment, taking the region included by the former stop codon and the latter stop codon as a boundary, obtaining the Open Reading Frame (ORF) DNA sequence containing the new peptide fragment to be released, as shown in SEQ ID NO. 2.

TAGTCAGCTGGCATCCTGAAGGGCATGCCAGGCAAGGAAATCGATCGAGTCCGGGCGACCAGTGCCCTCGCAGTGATTAGGCAGCACCCGGTAATGGTGTTCTTCGCGCTGTCGCCGGTACTCGCCGCATTGGGTGTCATGTGGTGGCTAGCCGGTGCTGGATGGGCTATCGTCGCGGCCCTGGTGCTGGTGGTCGTCGGCGGAGCCATGATCGTCCTCAAACGCTGA(SEQ ID NO.2)

The correspondence between the open reading frame code and the amino acid sequence is shown in FIG. 3.

Further translation verification revealed that the authentic gene sequence (SEQ ID NO.1) was found from the above-mentioned open reading frame DNA (SEQ ID NO.2)ATGAt the beginning, 204bp in total encodes 67 amino acids, the theoretical molecular weight of which is 7.10kDa, namely the gene Rv3108c (- |3476972-3477175 |).

ATGCCAGGCAAGGAAATCGATCGAGTCCGGGCGACCAGTGCCCTCGCAGTGATTAGGCAGCACCCGGTAATGGTGTTCTTCGCGCTGTCGCCGGTACTCGCCGCATTGGGTGTCATGTGGTGGCTAGCCGGTGCTGGATGGGCTATCGTCGCGGCCCTGGTGCTGGTGGTCGTCGGCGGAGCCATGATCGTCCTCAAACGCTGA(SEQ ID NO.1)

The theoretical coding product amino acid sequence of the gene is shown as SEQ ID NO. 3:

MPGKEIDRVRATSALAVIRQHPVMVFFALSPVLAALGVMWWLAGAGWAIVAALVLVVVGGAMIVLKR(SEQ ID NO.3)

the amino acid sequence of the theoretical gene-encoded product shown in SEQ ID NO.3 was analyzed by NCBI-BLASTP, and it had 99% similarity to M.tubericalis 1825K, A70645 and M.canettii CIPT140070010, and 76% similarity to other strains, and was a protein of unknown function. (see FIG. 5). It was shown that our detected Rv3108c (- |3476972-3477175|) gene product was missing annotations in the H37Rv strain database.

We carried out comparative genome local BLAST analysis on the DNA sequence of the Rv3108c (- |3476972-3477175|) gene, as shown in FIG. 5, and the result showed that the Rv3108c (- |3476972-3477175|) gene sequence belongs to MTBC family specific gene and has no more homologous sequence in other species, which indicates that the Rv3108c (- |3476972-3477175|) gene sequence found in the H37Rv strain has better sequence specificity and can distinguish MTBC from other mycobacteria and other respiratory tract infection bacteria in the same genus.

Example 2: method for establishing and identifying MTBC complex group

(1) Designing a primer:

based on the CDS sequence of the Rv3108c (- |3476972-3477175|) gene shown in SEQ ID NO.1, Oligo7.0 was used to design PCR primers with the following sequences:

F:5’-GACCAGTGCCCTCGCAGT-3’(SEQ ID NO.4)；

R:5’-AGGACGATCATGGCTCCG-3’(SEQ ID NO.5)

the positional relationship between the above primers and the Rv3108c (-3476972-3477175) gene is shown below, wherein the single-dashed lines are marked below the corresponding positions of the primers.

(2) Extracting total DNA of strains to be detected including M.tuberculosis H37Rv, wherein 40 standard strains of mycobacterium are preserved by China medical bacterial strain preservation management center (CMCC), the other 16 non-tuberculous mycobacteria are clinical isolates of 309 hospital of China people' S liberation military, completing the work of sequencing and comparing strains 16S RNA genes and submitting NCBI sequences, and the strains to be detected are shown in Table 1:

TABLE 1 related strains selected

(3) The DNA fragment was amplified and subjected to Polymerase Chain Reaction (PCR) using the above F/R primer.

PCR System (25. mu.L) as ddH₂O (9.5. mu.L), 2XTaq PCR MasterMix (TIANGEN, 12.5. mu.L), primer F (10. mu.M, 1. mu.L), primer R (10. mu.M, 1. mu.L), DNA template (1. mu.L);

and (3) amplification procedure: pre-denaturation at 94 ℃ for 3min, denaturation at 94 ℃ for 30s, annealing at 58 ℃ for 30s, extension at 72 ℃ for 1min, 35 cycles, and extension at 72 ℃ for 5 min.

(4) And (4) detecting the amplified product by electrophoresis in agarose gel and 1 xTBE electrophoresis solution. As a result, as shown in FIG. 6, an amplification band appeared at 162bp in MTBC and positive control group, and the amplification result was consistent with the expectation, and the specificity was 98.3%.

(5) To further verify the sequence of the amplified DNA, we sequenced the amplified sequence and compared it with the original sequence, as shown in FIG. 7, which is a perfect match to the expected sequence without errors, further verifying the presence of a new missing annotated gene.

This indicates that the method for identifying MTBC complex based on the Rv3108c (-3476972-3477175 |) gene is truly reliable.

SEQUENCE LISTING

<110> Peking proteome research center

<120> Mycobacterium tuberculosis H37Rv encoding gene and application thereof

<130> BJ1936-17P121794

<160> 6

<170> PatentIn version 3.3

<210> 1

<211> 204

<212> DNA

<213> Artificial

<220>

<223> Mycobacterium tuberculosis H37Rv encoding gene Rv3108c (- |3476972-3477175|)

<400> 1

atgccaggca aggaaatcga tcgagtccgg gcgaccagtg ccctcgcagt gattaggcag 60

cacccggtaa tggtgttctt cgcgctgtcg ccggtactcg ccgcattggg tgtcatgtgg 120

tggctagccg gtgctggatg ggctatcgtc gcggccctgg tgctggtggt cgtcggcgga 180

gccatgatcg tcctcaaacg ctga 204

<210> 2

<211> 228

<212> DNA

<213> Artificial

<220>

<223> open reading frame DNA sequence comprising peptide fragment with missing annotation

<400> 2

tagtcagctg gcatcctgaa gggcatgcca ggcaaggaaa tcgatcgagt ccgggcgacc 60

agtgccctcg cagtgattag gcagcacccg gtaatggtgt tcttcgcgct gtcgccggta 120

ctcgccgcat tgggtgtcat gtggtggcta gccggtgctg gatgggctat cgtcgcggcc 180

ctggtgctgg tggtcgtcgg cggagccatg atcgtcctca aacgctga 228

<210> 3

<211> 67

<212> PRT

<213> Artificial

<220>

<223> theoretical coding product amino acid sequence of Rv3108c (-3476972-3477175 |) gene

<400> 3

Met Pro Gly Lys Glu Ile Asp Arg Val Arg Ala Thr Ser Ala Leu Ala

1 5 10 15

Val Ile Arg Gln His Pro Val Met Val Phe Phe Ala Leu Ser Pro Val

20 25 30

Leu Ala Ala Leu Gly Val Met Trp Trp Leu Ala Gly Ala Gly Trp Ala

35 40 45

Ile Val Ala Ala Leu Val Leu Val Val Val Gly Gly Ala Met Ile Val

50 55 60

Leu Lys Arg

65

<210> 4

<211> 18

<212> DNA

<213> Artificial

<220>

<223> F primer sequences

<400> 4

gaccagtgcc ctcgcagt 18

<210> 5

<211> 18

<212> DNA

<213> Artificial

<220>

<223> R primer sequences

<400> 5

aggacgatca tggctccg 18

<210> 6

<211> 9

<212> PRT

<213> Artificial

<220>

<223> peptide fragment to be released by missed injection

<400> 6

Ala Thr Ser Ala Leu Ala Val Ile Arg

1 5

Claims

1. An identification method for distinguishing the Mycobacterium tuberculosis complex strain from other strains of the Mycobacterium genus, which is not used for the diagnosis and treatment of diseases, characterized in that whether the Mycobacterium tuberculosis complex exists in a sample to be detected is determined by detecting whether the gene Rv3108c (- |3476972-3477175|) encoded by the Mycobacterium tuberculosis H37Rv exists in the sample to be detected, and the nucleotide sequence of the gene Rv3108c (- |3476972-3477175|) encoded by the H37Rv is shown as SEQ ID NO. 1.

2. The method as claimed in claim 1, wherein the gene Rv3108c (-3476972-3477175 |) encoding H37Rv encodes the amino acid sequence shown in SEQ ID No. 3.

3. The method of claim 1, comprising the steps of:

(1) separating and extracting genome DNA from a sample to be detected;

(2) adding an amplification primer by taking the DNA obtained in the step (1) as a template to perform polymerase chain reaction;

(3) carrying out gel electrophoresis analysis and sequencing on the DNA product obtained by amplification in the step (2);

(4) comparing the result of the step (3) with the gene Rv3108c (-) |3476972-3477175|) encoded by the H37Rv of claim 1, and determining whether the Mycobacterium tuberculosis complex of the category exists in the sample to be detected according to the homology.

4. The method of claim 3, wherein the amplification primer sequence of step (2) is:

F: 5’- GACCAGTGCCCTCGCAGT -3’；

R: 5’- AGGACGATCATGGCTCCG -3’。

5. the method according to claim 3, wherein in the step (4), if the homology is more than 99%, it is judged that the Mycobacterium tuberculosis complex of the class is present in the sample to be tested.