CN115346606B

CN115346606B - Method and system for designing targeting probe based on species sequence

Info

Publication number: CN115346606B
Application number: CN202211264175.4A
Authority: CN
Inventors: 易康; 樊晓梅; 李靖
Original assignee: Nanjing Nuoyin Biotechnology Co ltd
Current assignee: Nanjing Nuoyin Biotechnology Co ltd
Priority date: 2022-10-17
Filing date: 2022-10-17
Publication date: 2023-03-24
Anticipated expiration: 2042-10-17
Also published as: CN115346606A

Abstract

The invention belongs to the technical field of design of biological probes, and particularly relates to a method and a system for designing a targeting probe based on a species sequence. The invention discloses a method for designing a targeting probe based on a species sequence, which comprises the following steps: downloading from the database the genomic sequence or CDS region sequence of the probe-targeted species; dividing the downloaded genome sequence or CDS region sequence into a plurality of probe sequences with the length of 120bp by using a sliding window method; comparing the probe sequences by using blast to obtain a comparison result file in a json format; and extracting the file information of the comparison result, screening the probe sequence to obtain a specific probe which accords with the corresponding species, and outputting the final probe sequence through a directory. The invention also discloses a system for designing the targeting probe based on the species sequence. The invention can complete the design from genome or CDS sequence to specific target probe in one step.

Description

Method and system for designing targeting probe based on species sequence

Technical Field

The invention belongs to the technical field of design of biological probes, and particularly relates to a method and a system for designing a targeting probe based on a species sequence.

Background

The development of Next Generation Sequencing (NGS) methods has revolutionized human clinical research as it enables the rapid generation of large amounts of sequencing data per run, while reducing sequencing costs, facilitating clinical diagnosis, performing more accurate clinical treatments, and saving patient lives. To date, polymerase Chain Reaction (PCR) has been the gold standard method for infectious clinical diagnosis based on amplification of generally short and conserved genomic regions, and due to its high specificity, PCR may not detect microorganisms whose sequences differ too much from those targeted by the designed primers, which is missing part of the information; it is also difficult to capture the lower levels of pathogen nucleic acid or other genetic material in complex biological samples if the sample size is minimal.

Due to the limitations described above, methods of hybrid capture are now being developed, target enrichment by designed targeting probes, which allow the search of genomic fragments to complete sequencing with high sequencing coverage, which facilitates downstream research, such as: phylogeny, evolution, epidemiology and drug resistance, etc. However, current methods for designing targeting probes have certain drawbacks, including: 1. the captured species sequence needs to be designed manually; 2. the probe has no high specificity, and the capture area is not unique; 3. there is no general method for probe design, and many are single species, for example: a virus capture probe; 4. the designed probe does not meet the requirement of batch design; 5. the information associated with each probe cannot be unambiguously designed; 6. the design steps of the probe are more complicated and are not convenient.

Disclosure of Invention

The invention aims to provide a method and a system for designing a targeting probe based on a species sequence. The method can automatically acquire the species sequence and simultaneously acquire the related information of the probe, including the position, GC content, the scientific name representing the species and the like; here, the present invention enables the design of specific targeting probes from species sequences to be done in one step.

The purpose of the invention and the technical problem to be solved are realized by adopting the following technical scheme.

One aspect of the present invention provides a method for designing a targeting probe based on a species sequence, comprising the steps of:

(1) Finding the required genome or CDS sequence from the database, naming by the tax and Latin literature name of the corresponding species, and downloading the fasta sequence file of the species;

(2) Dividing the fasta sequence of the species obtained in the step (1) into fasta sequences with consistent length, and uniquely naming each sequence;

(3) Establishing a database by using the nt sequence on the NCBI website as a source data, and comparing the fasta sequences with the consistent length obtained in the step (2) based on blastn software to obtain a comparison result file in a json format;

(4) Extracting information of the result file in the step (3), and screening a fasta sequence which accords with the species, namely a probe sequence with high specificity;

(5) And (5) sorting the results in the step (4), and outputting a specific probe sequence and a statistical sequence number.

Further, the fasta sequence in step (2) is named in the way that the start and end positions of the sequence taxi + in the genome or CDS region + species access number + species name + GC content.

Further, the GC content is the proportion of all bases in the fasta sequence occupied by the base G and the base C, and the fasta sequence with the GC content ranging from 30% to 70% is screened out.

Further, the fragment size of the fasta sequence in step (2) is 120bp.

Further, the screening conditions in step (4) are that the sequence is uniquely aligned to the species and the number of consecutive bp is larger than 50, or the number of consecutive bp which is not uniquely aligned to the species but aligned to other species is smaller than 40bp.

In another aspect of the present invention, there is provided a system for designing a targeting probe based on a sequence of a species, comprising a storage device, a processor connected to the storage device, and a computer program running on the processor, wherein the computer program is executed by the processor to perform the method for designing a targeting probe as described above.

Further, the computer program is written based on the python language.

By means of the technical scheme, the invention at least has the following advantages:

1. the invention can conveniently acquire the species sequence (including genome or CDS region sequence) of the required designed probe according to the needs of users.

2. For the designed probe sequences, information is provided about each sequence, including: the name of the species targeted, the relative location of the genome in which the sequence is located, the taxonomic id and GC content of the targeted species, etc.

3. For the designed probe sequence, because blast comparison is utilized, the comparison result of the probe basically points to the same species, and the probe sequence has better species directivity.

4. Not limited to single species batch designs, species that can be designed include: viruses, bacteria and fungi.

5. After the species is input, all available probe sequences corresponding to the species can be obtained, and meanwhile, 5 recommendations of preferred probes are provided for each species for a user, so that the trouble of probe selection for the user is reduced.

6. The probe acquisition is convenient and fast, and only the species required to be designed needs to be input, and a series of intermediate files and final result files convenient for a user to check are output.

7. The operation time of the probe design program is short, and the operation occupies less memory.

The foregoing is a summary of the present invention, and in order to provide a clear understanding of the technical means of the present invention and to be implemented in accordance with the present specification, the following is a detailed description of the preferred embodiments of the present invention.

Drawings

FIG. 1 is a downloaded genomic or CDS region sequence;

FIG. 2 shows a fragment of a fasta sequence of uniform length;

FIG. 3 is a result information file of aligned sequences using blastn;

FIG. 4 is a diagram showing the sequences of probes available after finishing with high specificity;

FIG. 5 is a statistical count of probe sequences after sorting;

FIG. 6 is a flow chart of the present invention.

Detailed Description

In order to make the technical means, the creation features, the achievement purposes and the effects of the invention easy to understand, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Example 1

Taking a small ureaplasma as an example, the method for designing a targeting probe based on a species sequence comprises the following steps:

(1) The genomic or CDS sequences of ureaplasma parvum were found from the NCBI official website (or other species database), named after their taxi and latin literature names, and a fasta sequence file was downloaded, named 134821 u ureaplasma parvum.

(2) The genomic sequence of ureaplasma parvum was divided into fragments of 120bp in length (as shown in FIG. 2) using a sliding window method (window length 50 bp), which all have unique names, and various information of the fragment sequences were recorded, including: species scientific name, sequence relative position, class id and GC content of the targeted species, etc.

(3) Comparing the probe sequences obtained in the step (2) by blast to obtain a json file of the comparison result in the step (3), and extracting required information to obtain a comparison information file (as shown in fig. 3), wherein the information comprises: sequence name, access number id on alignment, description information on alignment, number of aligned bp, number of alignment deletion bp, sequence and alignment hit condition.

(4) In step (4), a plurality of definite screening conditions are used to obtain a probe sequence (shown in fig. 4) with high specificity and available for the small ureaplasma, and the screening conditions are that the sequence is uniquely aligned to the small ureaplasma and the number of continuous aligned bp is more than 50, or the number of continuous aligned bp which is not uniquely aligned to the small ureaplasma but is aligned to other species is less than 40bp.

(5) In step (5), the specific probes obtained are counted and recorded (as shown in fig. 5), and the display information includes: taxid, species Chinese name, species academic name, statistical total number and the like.

Example 2

The embodiment provides a system for designing a targeting probe based on a species sequence, which comprises a storage device, a processor connected with the storage device, and a computer program which can run on the processor, wherein the computer program is executed by the processor and the method for designing the targeting probe is executed. The computer program is written based on the python language.

Although the present invention has been described with reference to a preferred embodiment, it should be understood that various changes, substitutions and alterations can be made herein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A method for designing a targeting probe based on a species sequence, comprising the steps of:

(1) Finding the required genome or CDS sequence from the database, naming the corresponding species by their taxi and Latin literature names, and downloading the fasta sequence file of the species;

(2) Segmenting the species fasta sequence obtained in the step (1) into fasta sequences with consistent length, and uniquely naming each sequence;

(3) Using the nt sequence on the NCBI website as a source data to build a database, and comparing the fasta sequences with the consistent length obtained in the step (2) based on blastn software to obtain a json format comparison result file;

(4) Extracting information of the result file in the step (3), and screening a fasta sequence which accords with the species according to screening conditions, wherein the fasta sequence is a probe sequence with high specificity;

(5) Sorting the results in the step (4), and outputting a specific probe sequence and a statistical sequence number;

wherein the fragment size of the fasta sequence in the step (2) is 120bp;

the screening condition in the step (4) is that the sequence is uniquely aligned to the species and the number of continuous bp is more than 50, or the number of continuous bp which is not uniquely aligned to the species but is aligned to other species is less than 40bp.

2. The method of claim 1, wherein the fasta sequence is named in step (2) as the start and stop positions of the taxi + sequence in the genome or CDS region + species access number + species scientific name + GC content.

3. The method for designing a targeting probe based on a species sequence as claimed in claim 2, wherein the GC content is the ratio of the base G and the base C in the fasta sequence to all the bases in the sequence, and the fasta sequence with a GC content ranging from 30% to 70% is screened out.

4. A system for designing a targeting probe based on a sequence of a species, comprising a storage device, a processor coupled thereto, and a computer program in the storage device and capable of running on the processor, wherein the computer program when executed by the processor performs the method of designing a targeting probe according to any one of claims 1-3.

5. The system for designing targeting probes based on species sequences according to claim 4, wherein said computer program is written based on python language.