CN108165562B

CN108165562B - Mycobacterium tuberculosis H37Rv encoding gene and application thereof

Info

Publication number: CN108165562B
Application number: CN201711250534.XA
Authority: CN
Inventors: 徐平; 张瑶; 王富强; 孙金帅; 武舒佳; 常蕾
Original assignee: BEIJING PROTEOME RESEARCH CENTER
Current assignee: Academy of Military Medical Sciences AMMS of PLA
Priority date: 2017-12-01
Filing date: 2017-12-01
Publication date: 2021-06-08
Anticipated expiration: 2037-12-01
Also published as: CN108165562A

Abstract

The invention relates to a mycobacterium tuberculosis H37Rv coding gene which can be used as a standard gene for molecular identification of mycobacterium tuberculosis complex and is used for molecular identification and clinical detection of the mycobacterium tuberculosis complex.

Description

Mycobacterium tuberculosis H37Rv encoding gene and application thereof

Technical Field

The invention relates to the field of gene detection, in particular to identification of pathogenic bacteria species.

Background

Mycobacterium Tuberculosis (MTB) is a pathogenic bacterium that causes tuberculosis in humans. It can invade all organs of the body, but pulmonary tuberculosis is the most common. Tuberculosis is an extremely important infectious disease so far and seriously threatens the life health of human beings. It is reported by WHO that about 800 new cases occur each year, and at least 300 million people die from the disease. The clinical bacterial strain of MTB is difficult to culture, slow in growth, capable of cross-infecting with other mycobacteria, difficult to distinguish between tuberculosis and other respiratory tract infection symptoms and the like, and brings great difficulty to clinical rapid diagnosis and treatment. Therefore, the establishment of a quick, accurate, specific, sensitive and cheap tuberculosis detection method is a necessary premise for effectively treating and controlling tuberculosis spreading, and is a new challenge and a new task for detecting mycobacterium in clinical laboratories.

Mycobacterium tuberculosis complex (MTBC) includes the Mycobacterium groups m.tuberculosis, m.africanum, m.orygis, m.bovis, m.microti, m.canettii, m.caprae, m.pinnipedii, m.subcatetate, m.mungi, which all cause tuberculosis in humans and other life forms. At present, the domestic and foreign MTBC identification method is mainly divided into the following three categories: traditional separation culture method; molecular level detection (IS6110, restriction fragment length polymorphism analysis, multi-site variable number repeat polymorphism analysis, etc.); a method for analyzing the components of a microorganism (fatty acid, mycolic acid) by chromatography. The three methods have respective advantages, but have disadvantages, such as long separation culture period and low thallus culturable rate; at present, the molecular level detection is poor in specificity, sensitivity and simplicity; the analysis cost of the thallus component characteristics is high, and the operation is complex.

MTB H37Rv completed whole genome sequencing in 1998, the MTB strain that completed whole genome sequencing the earliest. From this point on, researchers in various countries are perfecting and supplementing H37Rv gene annotation databases based on strategies such as algorithm optimization, annotation software updating, transcriptomics and proteomics. However, since MTB belongs to prokaryotes, annotation errors (over-annotation, gene boundary error, ORF initiation, termination site error, alternative splicing, ribosome translocation, missing annotation) may still exist in genome annotation due to the inherent shortcomings of the prokaryote genome annotation technology, which brings trouble to deep and accurate analysis of biological mechanisms. In order to solve the problem, proteomics (proteomics) has been used for correcting the annotated gene of H37Rv, however, high-proportion false positive, difficulty in annotated gene prediction, new gene verification, new gene function analysis and application thereof, and the like, are problems faced in the field.

In general, the traditional mycobacterium tuberculosis complex (MTBC) identification strategy has the defects of long period, tedious steps, low specificity and sensitivity and the like. In order to further perfect re-annotation of the H37Rv whole genome, missing annotation genes in H37Rv are found, the H37Rv whole genome missing annotation genes and application technologies thereof in MTBC molecular identification are effectively protected, and a method for quickly and accurately identifying the MTBC group by using the H37Rv new genes is imperatively developed.

Disclosure of Invention

An object of the present invention is to provide a new encoding gene of mycobacterium tuberculosis H37Rv, which is H37Rv minus annotation encoding gene Rv2815A (+ |3123619 |), which can be used as a barcode molecular marker of mycobacterium tuberculosis complex for detecting mycobacterium tuberculosis complex, and the sequence of which is shown in SEQ ID NO. 1.

Other objects of the present invention include providing specific PCR primers useful for amplifying the above-described encoding genes and providing a method of detecting or identifying the presence of a binding Mycobacterium complex in a sample; the invention also provides a detection kit related to the coding gene and application of the gene.

According to one aspect of the invention, by comparing proteomic research techniques, a protein coding sequence of H37Rv that is difficult to find by genetic prediction software was discovered that effectively distinguishes MTBC from other species of the same genus. The gene is a missing annotation gene of Mycobacterium tuberculosis (Mycobacterium tuberculosis H37Rv), namely Rv2815A (+ | 3123619-. Comparative genomics studies show that the gene sequence can distinguish the Mycobacterium tuberculosis complex (MTBC) strain from other species of Mycobacterium.

Specifically, a primer capable of realizing specific amplification on the Rv2815A (+ |3123619-3123756|) gene of MTBC is designed, namely the primer provided by the invention, and the primer sequence is as follows:

F:5’-CAGCGTGTGGTAACAATGCC-3’；

R:5’-AGCGATGCTGACGAAGGG-3’。

according to the existence of the gene DNA sequence PCR product in the sample to be detected or the difference of the DNA sequence, the MTBC can be quickly and accurately identified.

According to another aspect of the present invention, based on the above-mentioned new standard encoding gene of Mycobacterium tuberculosis H37Rv, the present invention specifically establishes a method for detecting or identifying Mycobacterium tuberculosis complex, comprising the following steps:

(1) separating and extracting genome DNA from a sample to be detected;

(2) and (2) performing PCR amplification by using the DNA obtained in the step (1) as a template and adopting the following primers:

F:5’-CAGCGTGTGGTAACAATGCC-3’(SEQ ID NO.4)；

R:5’-AGCGATGCTGACGAAGGG-3’(SEQ ID NO.5)。

(3) performing gel electrophoresis analysis or sequencing on the DNA product obtained by amplification in the step (2);

(4) and (3) comparing the result of the step (3) with the barcode gene Rv2815A (+ | 3123619-.

Further, the detection method is characterized in that electrophoresis analysis is performed on the PCR product primarily according to the DNA bar code principle, and if the strain to be detected does not have a target band, the strain is not MTBC; if the band exists, further sequencing verification can be carried out, the sequence obtained by sequencing and the standard sequence of Rv2815A (+ |3123619-3123756|) of H37Rv are subjected to homologous comparison and alignment to obtain the similarity between the sequences, and if the sequence homology is more than 99%, the strain can be judged to be MTBC; and (3) distinguishing the MTBC family from nontuberculous mycobacteria, common respiratory pathogenic bacteria and common respiratory viruses according to the clustering condition of the DNA barcode sequence of the strain to be identified and the standard sequence.

The detection method can be used for strain identification research of the mycobacterium tuberculosis complex and can also be used for clinical rapid inspection. The sample to be detected can be H37Rv strain, other MTBC, nontuberculous mycobacteria, respiratory tract common pathogenic bacteria and respiratory tract common virus strain; or directly using sputum, saliva or blood of tuberculosis and other respiratory patients.

Based on the above method, the present invention also provides a detection kit, wherein the kit contains a reagent for detecting the novel standard encoding gene of Mycobacterium tuberculosis H37Rv in a container, and simultaneously provides manufacturing, using and marketing information about the medicine or biological product, which can be approved by a government drug administration. For example, the reagent for directly detecting the Rv2815A (+ |3123619-3123756|) gene in the sample after PCR amplification may comprise one or more of amplification primers, dNTP, DNA polymerase used for PCR reaction and its buffer, reagents required for enzyme digestion reaction and/or sequencing reaction, etc. It is known to those skilled in the art that the above components are merely illustrative, and for example, the primers may employ the specific PCR primers described above, and the DNA polymerase used for the PCR reaction is an enzyme capable of being used for PCR amplification. The detection of the encoding gene of the present invention can also be provided in the form of an integrated, e.g., gene chip.

Has the advantages that: the invention provides a standard gene and a molecular identification method for molecular identification of Mycobacterium tuberculosis complex (MTBC), wherein the gene can effectively distinguish MTBC from other species of the same genus, the identification method using the gene overcomes the defects of primer design multiplicity, poor result repeatability and the like in the existing identification process of the Mycobacterium tuberculosis complex, has the characteristics of universality, easy amplification and easy comparison, can accurately identify the class from other mycobacteria with close relativity or other respiratory tract infectious germs, and provides powerful technical means and research tools for the epidemiological investigation and the rapid diagnosis and identification of clinical tuberculosis patients.

Drawings

FIG. 1: evidence of peptide profile matching supporting the discovery of new coding genes;

FIG. 2: comparing the mass spectrogram of the synthesized peptide fragment with the mass spectrogram of the original identified peptide fragment;

FIG. 3: a corresponding diagram of a protein sequence coded by ORF of the peptide fragment locus region; the underlined part is the peptide identified in proteomics and verified by the synthetic peptide;

FIG. 4: rv2815A (+ |3123619-3123756|) standard gene sequence homology comparison;

FIG. 5: the result of BLASTP of a protein sequence corresponding to the Rv2815A (+ |3123619-3123756|) gene of the H37Rv strain;

FIG. 6: the result of agarose gel electrophoresis of the PCR amplification product of the Rv2815A (+ |3123619-3123756|) specific primer; wherein, the specific information of each lane sample is shown in Table 1;

FIG. 7: the PCR amplification sequencing result of the Rv2815A (+ |3123619-3123756|) gene is compared with a standard sequence.

Detailed Description

The invention is further described with reference to specific embodiments, but the scope of the claims is not limited thereto. The reagents used in the present invention are all commercially available.

Example 1: search for genes encoding missing release of the genome of strain H37Rv

1.1 high coverage proteomic validation of the genome of the H37Rv strain

The deep coverage study of proteome was performed on the H37Rv strain using the high coverage proteome technique. Annotated encoding gene validation was performed on its genome using the pFind 3 engine based on the Tuberculosis (20160307) database. To find new protein coding regions, we performed six-reading-frame database translation of H37Rv in the genome-wide (NC _000962.3) file published at NCBI using pAnno software based on proteomic technology, and identified new peptide fragments and new proteins using this database for mass spectrometry data. To reduce the false positive rate, we used 3 filtering methods to separately estimate class FDR for the annotated and new peptide fragments, S-FDR, T-FDR I and T-FDR II, respectively, during the data filtering.

Through data analysis, a total of 3238H 37Rv annotated genes are identified, and the coverage is as high as more than 80% of the strain, which is the largest mass spectrum data of the H37Rv protein reported so far. In addition, we obtained new peptide fragments after 3 FDRs ≤ 1 filtration. In order to further ensure the quality of the new peptide fragments, spectrogram quality screening is carried out on spectrograms corresponding to the new peptide fragments left after filtration, and finally some peptide fragments with good spectrogram quality are reserved. To further investigate that these peptides with higher spectral quality were not due to single amino acid mutations in the annotated peptide, we performed amino acid mutation checks to ensure that these new peptides were newly identified peptides of H37 Rv.

1.2 verification of the encoded protein of Rv2815A (+ |3123619-3123756|) Gene and database

After high coverage proteome verification, we find some suspected new peptide fragments which are leaked to release, and perform peptide fragment synthesis verification on the suspected new peptide fragments with high reliability, and score more than or equal to 0.8 according to the similarity between the original spectrum and the synthesized spectrum of the new peptide fragments as a similarity threshold, and after scoring and screening, a plurality of peptide fragments pass through verification and correspond to a new Open Reading Frame (ORF), namely the potential leaked to release genes of the current H37Rv strain.

Among them, we found a new missing annotated gene Rv2815A (+ | 3123619-. We found that peptide LLCADKPSPSVER (SEQ ID NO.6) was detected in the data from three different laboratory proteins and corresponded to the new gene Rv2815A (+ |3123619- |), as shown in FIG. 1, the spectrum quality was good, 4 b/y ions were matched consecutively, the peak signal was low, and the result was very reliable.

To further confirm this identification, we chemically synthesized the peptide according to the amino acid sequence of our newly identified peptide and generated a secondary spectrum of the synthesized peptide using the mass spectrometry conditions described above.

Our high energy collision MS on synthetic peptide fragments₂Verification is carried out, and the primary parent ions and the secondary daughter ions both accord with theoretical values, so that the sequence of the synthesized peptide fragment is correct; on this basis, we manually examined MS of synthetic peptides of novel peptide sequences identified from large-scale proteomic data₂And the large scale identification of the new peptide fragment spectrum, both of which are almost completely identical, the cosin value obtained by the daughter ion similarity is 0.98, which proves that the new peptide fragment identified by us from H37Rv is correct. (FIG. 2).

After confirming the sequence of the peptide fragment to be released, according to the gene position of the peptide fragment, taking the region included by the former stop codon and the latter stop codon as a boundary, obtaining the Open Reading Frame (ORF) DNA sequence containing the new peptide fragment to be released, as shown in SEQ ID NO. 2.

TAACAATGCCTGCTGATGATGTCAAAAGAACACAAACTCCTCTGCGCTGACAAGCCGTCCCCTTCCGTAGAACGTAACTGCCGCAACACCTCTTATCTTATAGATCCGGATGTTGTCGCAGTCGATGGCGAAGCGGTCGATACGTGCAACTAG(SEQ ID NO.2)

The correspondence between the open reading frame code and the amino acid sequence is shown in FIG. 3.

Further translation verification revealed that the authentic gene sequence (SEQ ID NO.1) was opened from aboveIn reading frame DNA (SEQ ID NO.2)ATGAt the beginning, 138bp in total encodes 45 amino acids, the theoretical molecular weight of which is 4.95kDa, namely the Rv2815A (+ |3123619-3123756|) gene.

ATGATGTCAAAAGAACACAAACTCCTCTGCGCTGACAAGCCGTCCCCTTCCGTAGAACGTAACTGCCGCAACACCTCTTATCTTATAGATCCGGATGTTGTCGCAGTCGATGGCGAAGCGGTCGATACGTGCAACTAG(SEQ ID NO.1)

The theoretical coding product amino acid sequence of the gene is shown as SEQ ID NO. 3:

MMSKEHKLLCADKPSPSVERNCRNTSYLIDPDVVAVDGEAVDTCN(SEQ ID NO.3)

the amino acid sequence of the theoretical gene coding product shown in SEQ ID NO.3 is subjected to NCBI-BLASTP analysis, and the sequence is not aligned to any sequence and is a human protein which is not found yet. (see FIG. 4). It was shown that our detected Rv2815A (+ |3123619-3123756|) gene products were missing annotations in the H37Rv strain database.

We performed a comparative genome local BLAST analysis on the DNA sequence of the Rv2815A (+ |3123619-3123756|) gene, as shown in FIG. 5, and the results showed that the Rv2815A (+ |3123619-3123756|) gene sequence belongs to MTBC family-specific gene and has no more homologous sequence in other species, which indicates that the Rv2815A (+ |3123619-3123756|) gene sequence found in the H37Rv strain has better sequence specificity and can distinguish MTBC from other mycobacteria and other respiratory tract-infecting bacteria in the same genus.

Example 2: method for establishing and identifying MTBC complex group

(1) Designing a primer:

based on the Rv2815A (+ |3123619-3123756|) gene sequence shown in SEQ ID NO.1, the PCR primers were designed using Oligo7.0, and the primer sequences were as follows:

F:5’-CAGCGTGTGGTAACAATGCC-3’(SEQ ID NO.4)；

R:5’-AGCGATGCTGACGAAGGG-3’(SEQ ID NO.5)

the position relationship between the above primers and the Rv2815A (+ |3123619-3123756|) gene is shown below, the primers are respectively positioned at the upstream and downstream of the structural gene, wherein the positions corresponding to the primers are marked by single-dashed lines, and the double-dashed lines are an initiator and a terminator.

GGACAATTCGTCCAGCGTGTGGTAACAATGCCTGCTG

ATGTCAAAAGAACACAAACTCCTCTGCGCTGACAAGCCGTCCCCTTCCGTAGAACGTAACTGCCGCAACACCTCTTATCTTATAGATCCGGATGTTGTCGCAGTCGATGGCGAAGCGGTCGATACGTGCAAC

TTTCGCGAGCTGGCCCTTCGTCAGCATCGCTTCGAATG(SEQ ID NO.7)

(2) Extracting total DNA of strains to be detected including M.tuberculosis H37Rv, wherein 40 standard strains of mycobacterium are preserved by China medical bacterial strain preservation management center (CMCC), the other 16 non-tuberculous mycobacteria are clinical isolates of 309 hospital of China people' S liberation military, completing the work of sequencing and comparing strains 16S RNA genes and submitting NCBI sequences, and the strains to be detected are shown in Table 1:

TABLE 1 related strains selected

(3) The DNA fragment was amplified and subjected to Polymerase Chain Reaction (PCR) using the above F/R primer.

PCR System (25. mu.L) as ddH₂O (9.5. mu.L), 2XTaq PCR MasterMix (TIANGEN, 12.5. mu.L), primer F (10. mu.M, 1. mu.L), primer R (10. mu.M, 1. mu.L), DNA template (1. mu.L);

and (3) amplification procedure: pre-denaturation at 94 ℃ for 3min, denaturation at 94 ℃ for 30s, annealing at 58 ℃ for 30s, extension at 72 ℃ for 1min, 35 cycles, and extension at 72 ℃ for 5 min.

(4) And (4) detecting the amplified product by electrophoresis in agarose gel and 1 xTBE electrophoresis solution. As shown in FIG. 6, the MTBC and the positive control group showed an amplification band at 194bp, and the actual amplification result is consistent with the expected result, with the specificity of 98.3%.

(5) To further verify the sequence of the amplified DNA, we sequenced the amplified sequence and compared it with the original sequence, as shown in FIG. 7, which is a perfect match to the expected sequence without errors, further verifying the presence of a new missing annotated gene.

This indicates that the method for identifying MTBC complex based on Rv2815A (+ |3123619-3123756|) gene is truly reliable.

SEQUENCE LISTING

<110> Peking proteome research center

<120> Mycobacterium tuberculosis H37Rv encoding gene and application thereof

<130> BJ1936-17P121793

<160> 7

<170> PatentIn version 3.3

<210> 1

<211> 138

<212> DNA

<213> Artificial

<220>

<223> Mycobacterium tuberculosis H37Rv encoding gene Rv2815A (+ |3123619-3123756|)

<400> 1

atgatgtcaa aagaacacaa actcctctgc gctgacaagc cgtccccttc cgtagaacgt 60

aactgccgca acacctctta tcttatagat ccggatgttg tcgcagtcga tggcgaagcg 120

gtcgatacgt gcaactag 138

<210> 2

<211> 153

<212> DNA

<213> Artificial

<220>

<223> open reading frame DNA sequence comprising peptide fragment with missing annotation

<400> 2

taacaatgcc tgctgatgat gtcaaaagaa cacaaactcc tctgcgctga caagccgtcc 60

ccttccgtag aacgtaactg ccgcaacacc tcttatctta tagatccgga tgttgtcgca 120

gtcgatggcg aagcggtcga tacgtgcaac tag 153

<210> 3

<211> 45

<212> PRT

<213> Artificial

<220>

<223> Rv2815A (+ |3123619-3123756|) gene theory encoding product amino acid sequence

<400> 3

Met Met Ser Lys Glu His Lys Leu Leu Cys Ala Asp Lys Pro Ser Pro

1 5 10 15

Ser Val Glu Arg Asn Cys Arg Asn Thr Ser Tyr Leu Ile Asp Pro Asp

20 25 30

Val Val Ala Val Asp Gly Glu Ala Val Asp Thr Cys Asn

35 40 45

<210> 4

<211> 20

<212> DNA

<213> Artificial

<220>

<223> F primer sequences

<400> 4

cagcgtgtgg taacaatgcc 20

<210> 5

<211> 18

<212> DNA

<213> Artificial

<220>

<223> R primer sequences

<400> 5

agcgatgctg acgaaggg 18

<210> 6

<211> 13

<212> PRT

<213> Artificial

<220>

<223> peptide fragment to be released by missed injection

<400> 6

Leu Leu Cys Ala Asp Lys Pro Ser Pro Ser Val Glu Arg

1 5 10

<210> 7

<211> 213

<212> DNA

<213> Artificial

<220>

<223> demonstration of integration sequence of upstream and downstream primers with Rv2815A (+ |3123619-3123756|) Gene

<400> 7

ggacaattcg tccagcgtgt ggtaacaatg cctgctgatg atgtcaaaag aacacaaact 60

cctctgcgct gacaagccgt ccccttccgt agaacgtaac tgccgcaaca cctcttatct 120

tatagatccg gatgttgtcg cagtcgatgg cgaagcggtc gatacgtgca actagtttcg 180

cgagctggcc cttcgtcagc atcgcttcga atg 213

Claims

1. A mycobacterium tuberculosis H37Rv coding gene is H37Rv missing-release coding gene Rv2815A (+ |3123619-3123756|), and the nucleotide sequence of the coding gene is shown as SEQ ID NO. 1.

2. The Mycobacterium tuberculosis H37Rv encoding gene of claim 1, wherein the gene encodes an amino acid represented by the sequence of SEQ ID No. 3.

3. A barcode molecular marker for use in detecting and/or identifying a mycobacterium tuberculosis complex, comprising the gene encoding mycobacterium tuberculosis H37Rv according to claim 1 as a standard detection gene.

4. Specific PCR primers for amplifying the gene encoding Mycobacterium tuberculosis H37Rv of claim 1, wherein the sequence of the primers is as follows:

F: 5’- CAGCGTGTGGTAACAATGCC -3’；

R: 5’- AGCGATGCTGACGAAGGG -3’。

5. a method for identifying strains of the mycobacterium tuberculosis complex from other species of the genus mycobacterium which are not used for the diagnostic treatment of disease, comprising the steps of:

(1) separating and extracting genome DNA from a sample to be detected;

(2) adding an amplification primer by taking the DNA obtained in the step (1) as a template to perform polymerase chain reaction;

(3) carrying out gel electrophoresis analysis and sequencing on the DNA product obtained by amplification in the step (2);

(4) comparing the result of the step (3) with the coding gene of claim 1, and judging whether the mycobacterium tuberculosis complex exists in the sample to be detected according to the homology;

wherein the sequence of the amplification primer in the step (2) is as follows:

F: 5’- CAGCGTGTGGTAACAATGCC -3’；

R: 5’- AGCGATGCTGACGAAGGG -3’。