CN115346606B - Method and system for designing targeting probe based on species sequence - Google Patents

Method and system for designing targeting probe based on species sequence Download PDF

Info

Publication number
CN115346606B
CN115346606B CN202211264175.4A CN202211264175A CN115346606B CN 115346606 B CN115346606 B CN 115346606B CN 202211264175 A CN202211264175 A CN 202211264175A CN 115346606 B CN115346606 B CN 115346606B
Authority
CN
China
Prior art keywords
sequence
species
probe
designing
fasta
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211264175.4A
Other languages
Chinese (zh)
Other versions
CN115346606A (en
Inventor
易康
樊晓梅
李靖
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Nuoyin Biotechnology Co ltd
Original Assignee
Nanjing Nuoyin Biotechnology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Nuoyin Biotechnology Co ltd filed Critical Nanjing Nuoyin Biotechnology Co ltd
Priority to CN202211264175.4A priority Critical patent/CN115346606B/en
Publication of CN115346606A publication Critical patent/CN115346606A/en
Application granted granted Critical
Publication of CN115346606B publication Critical patent/CN115346606B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Medical Informatics (AREA)
  • General Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • Biophysics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Bioethics (AREA)
  • Databases & Information Systems (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention belongs to the technical field of design of biological probes, and particularly relates to a method and a system for designing a targeting probe based on a species sequence. The invention discloses a method for designing a targeting probe based on a species sequence, which comprises the following steps: downloading from the database the genomic sequence or CDS region sequence of the probe-targeted species; dividing the downloaded genome sequence or CDS region sequence into a plurality of probe sequences with the length of 120bp by using a sliding window method; comparing the probe sequences by using blast to obtain a comparison result file in a json format; and extracting the file information of the comparison result, screening the probe sequence to obtain a specific probe which accords with the corresponding species, and outputting the final probe sequence through a directory. The invention also discloses a system for designing the targeting probe based on the species sequence. The invention can complete the design from genome or CDS sequence to specific target probe in one step.

Description

Method and system for designing targeting probe based on species sequence
Technical Field
The invention belongs to the technical field of design of biological probes, and particularly relates to a method and a system for designing a targeting probe based on a species sequence.
Background
The development of Next Generation Sequencing (NGS) methods has revolutionized human clinical research as it enables the rapid generation of large amounts of sequencing data per run, while reducing sequencing costs, facilitating clinical diagnosis, performing more accurate clinical treatments, and saving patient lives. To date, polymerase Chain Reaction (PCR) has been the gold standard method for infectious clinical diagnosis based on amplification of generally short and conserved genomic regions, and due to its high specificity, PCR may not detect microorganisms whose sequences differ too much from those targeted by the designed primers, which is missing part of the information; it is also difficult to capture the lower levels of pathogen nucleic acid or other genetic material in complex biological samples if the sample size is minimal.
Due to the limitations described above, methods of hybrid capture are now being developed, target enrichment by designed targeting probes, which allow the search of genomic fragments to complete sequencing with high sequencing coverage, which facilitates downstream research, such as: phylogeny, evolution, epidemiology and drug resistance, etc. However, current methods for designing targeting probes have certain drawbacks, including: 1. the captured species sequence needs to be designed manually; 2. the probe has no high specificity, and the capture area is not unique; 3. there is no general method for probe design, and many are single species, for example: a virus capture probe; 4. the designed probe does not meet the requirement of batch design; 5. the information associated with each probe cannot be unambiguously designed; 6. the design steps of the probe are more complicated and are not convenient.
Disclosure of Invention
The invention aims to provide a method and a system for designing a targeting probe based on a species sequence. The method can automatically acquire the species sequence and simultaneously acquire the related information of the probe, including the position, GC content, the scientific name representing the species and the like; here, the present invention enables the design of specific targeting probes from species sequences to be done in one step.
The purpose of the invention and the technical problem to be solved are realized by adopting the following technical scheme.
One aspect of the present invention provides a method for designing a targeting probe based on a species sequence, comprising the steps of:
(1) Finding the required genome or CDS sequence from the database, naming by the tax and Latin literature name of the corresponding species, and downloading the fasta sequence file of the species;
(2) Dividing the fasta sequence of the species obtained in the step (1) into fasta sequences with consistent length, and uniquely naming each sequence;
(3) Establishing a database by using the nt sequence on the NCBI website as a source data, and comparing the fasta sequences with the consistent length obtained in the step (2) based on blastn software to obtain a comparison result file in a json format;
(4) Extracting information of the result file in the step (3), and screening a fasta sequence which accords with the species, namely a probe sequence with high specificity;
(5) And (5) sorting the results in the step (4), and outputting a specific probe sequence and a statistical sequence number.
Further, the fasta sequence in step (2) is named in the way that the start and end positions of the sequence taxi + in the genome or CDS region + species access number + species name + GC content.
Further, the GC content is the proportion of all bases in the fasta sequence occupied by the base G and the base C, and the fasta sequence with the GC content ranging from 30% to 70% is screened out.
Further, the fragment size of the fasta sequence in step (2) is 120bp.
Further, the screening conditions in step (4) are that the sequence is uniquely aligned to the species and the number of consecutive bp is larger than 50, or the number of consecutive bp which is not uniquely aligned to the species but aligned to other species is smaller than 40bp.
In another aspect of the present invention, there is provided a system for designing a targeting probe based on a sequence of a species, comprising a storage device, a processor connected to the storage device, and a computer program running on the processor, wherein the computer program is executed by the processor to perform the method for designing a targeting probe as described above.
Further, the computer program is written based on the python language.
By means of the technical scheme, the invention at least has the following advantages:
1. the invention can conveniently acquire the species sequence (including genome or CDS region sequence) of the required designed probe according to the needs of users.
2. For the designed probe sequences, information is provided about each sequence, including: the name of the species targeted, the relative location of the genome in which the sequence is located, the taxonomic id and GC content of the targeted species, etc.
3. For the designed probe sequence, because blast comparison is utilized, the comparison result of the probe basically points to the same species, and the probe sequence has better species directivity.
4. Not limited to single species batch designs, species that can be designed include: viruses, bacteria and fungi.
5. After the species is input, all available probe sequences corresponding to the species can be obtained, and meanwhile, 5 recommendations of preferred probes are provided for each species for a user, so that the trouble of probe selection for the user is reduced.
6. The probe acquisition is convenient and fast, and only the species required to be designed needs to be input, and a series of intermediate files and final result files convenient for a user to check are output.
7. The operation time of the probe design program is short, and the operation occupies less memory.
The foregoing is a summary of the present invention, and in order to provide a clear understanding of the technical means of the present invention and to be implemented in accordance with the present specification, the following is a detailed description of the preferred embodiments of the present invention.
Drawings
FIG. 1 is a downloaded genomic or CDS region sequence;
FIG. 2 shows a fragment of a fasta sequence of uniform length;
FIG. 3 is a result information file of aligned sequences using blastn;
FIG. 4 is a diagram showing the sequences of probes available after finishing with high specificity;
FIG. 5 is a statistical count of probe sequences after sorting;
FIG. 6 is a flow chart of the present invention.
Detailed Description
In order to make the technical means, the creation features, the achievement purposes and the effects of the invention easy to understand, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Example 1
Taking a small ureaplasma as an example, the method for designing a targeting probe based on a species sequence comprises the following steps:
(1) The genomic or CDS sequences of ureaplasma parvum were found from the NCBI official website (or other species database), named after their taxi and latin literature names, and a fasta sequence file was downloaded, named 134821 u ureaplasma parvum.
(2) The genomic sequence of ureaplasma parvum was divided into fragments of 120bp in length (as shown in FIG. 2) using a sliding window method (window length 50 bp), which all have unique names, and various information of the fragment sequences were recorded, including: species scientific name, sequence relative position, class id and GC content of the targeted species, etc.
(3) Comparing the probe sequences obtained in the step (2) by blast to obtain a json file of the comparison result in the step (3), and extracting required information to obtain a comparison information file (as shown in fig. 3), wherein the information comprises: sequence name, access number id on alignment, description information on alignment, number of aligned bp, number of alignment deletion bp, sequence and alignment hit condition.
(4) In step (4), a plurality of definite screening conditions are used to obtain a probe sequence (shown in fig. 4) with high specificity and available for the small ureaplasma, and the screening conditions are that the sequence is uniquely aligned to the small ureaplasma and the number of continuous aligned bp is more than 50, or the number of continuous aligned bp which is not uniquely aligned to the small ureaplasma but is aligned to other species is less than 40bp.
(5) In step (5), the specific probes obtained are counted and recorded (as shown in fig. 5), and the display information includes: taxid, species Chinese name, species academic name, statistical total number and the like.
Example 2
The embodiment provides a system for designing a targeting probe based on a species sequence, which comprises a storage device, a processor connected with the storage device, and a computer program which can run on the processor, wherein the computer program is executed by the processor and the method for designing the targeting probe is executed. The computer program is written based on the python language.
Although the present invention has been described with reference to a preferred embodiment, it should be understood that various changes, substitutions and alterations can be made herein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims (5)

1. A method for designing a targeting probe based on a species sequence, comprising the steps of:
(1) Finding the required genome or CDS sequence from the database, naming the corresponding species by their taxi and Latin literature names, and downloading the fasta sequence file of the species;
(2) Segmenting the species fasta sequence obtained in the step (1) into fasta sequences with consistent length, and uniquely naming each sequence;
(3) Using the nt sequence on the NCBI website as a source data to build a database, and comparing the fasta sequences with the consistent length obtained in the step (2) based on blastn software to obtain a json format comparison result file;
(4) Extracting information of the result file in the step (3), and screening a fasta sequence which accords with the species according to screening conditions, wherein the fasta sequence is a probe sequence with high specificity;
(5) Sorting the results in the step (4), and outputting a specific probe sequence and a statistical sequence number;
wherein the fragment size of the fasta sequence in the step (2) is 120bp;
the screening condition in the step (4) is that the sequence is uniquely aligned to the species and the number of continuous bp is more than 50, or the number of continuous bp which is not uniquely aligned to the species but is aligned to other species is less than 40bp.
2. The method of claim 1, wherein the fasta sequence is named in step (2) as the start and stop positions of the taxi + sequence in the genome or CDS region + species access number + species scientific name + GC content.
3. The method for designing a targeting probe based on a species sequence as claimed in claim 2, wherein the GC content is the ratio of the base G and the base C in the fasta sequence to all the bases in the sequence, and the fasta sequence with a GC content ranging from 30% to 70% is screened out.
4. A system for designing a targeting probe based on a sequence of a species, comprising a storage device, a processor coupled thereto, and a computer program in the storage device and capable of running on the processor, wherein the computer program when executed by the processor performs the method of designing a targeting probe according to any one of claims 1-3.
5. The system for designing targeting probes based on species sequences according to claim 4, wherein said computer program is written based on python language.
CN202211264175.4A 2022-10-17 2022-10-17 Method and system for designing targeting probe based on species sequence Active CN115346606B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211264175.4A CN115346606B (en) 2022-10-17 2022-10-17 Method and system for designing targeting probe based on species sequence

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211264175.4A CN115346606B (en) 2022-10-17 2022-10-17 Method and system for designing targeting probe based on species sequence

Publications (2)

Publication Number Publication Date
CN115346606A CN115346606A (en) 2022-11-15
CN115346606B true CN115346606B (en) 2023-03-24

Family

ID=83957162

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211264175.4A Active CN115346606B (en) 2022-10-17 2022-10-17 Method and system for designing targeting probe based on species sequence

Country Status (1)

Country Link
CN (1) CN115346606B (en)

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2009018174A1 (en) * 2007-07-27 2009-02-05 Biological Targets, Inc. Methods to design probes and primers
CN111455102A (en) * 2020-04-09 2020-07-28 上海符贝基因科技有限公司 Preparation method of capture probe for target sequencing of new coronavirus SARS-CoV-2 genome
CN113046478A (en) * 2021-03-18 2021-06-29 深圳人体密码基因科技有限公司 Targeted capture sequencing detection method for pathogenic microorganisms in respiratory system
CN113403368A (en) * 2021-07-01 2021-09-17 南京诺因生物科技有限公司 Method and system for designing primer based on characteristic sequence

Also Published As

Publication number Publication date
CN115346606A (en) 2022-11-15

Similar Documents

Publication Publication Date Title
US20230366046A1 (en) Systems and methods for analyzing viral nucleic acids
Al-Ghalith et al. NINJA-OPS: fast accurate marker gene alignment using concatenated ribosomes
Williams et al. RNA‐seq data: challenges in and recommendations for experimental design and analysis
De Oliveira et al. An automated genotyping system for analysis of HIV-1 and other microbial sequences
US8032310B2 (en) Computer-implemented method, computer readable storage medium, and apparatus for identification of a biological sequence
Powell et al. Empirical evaluation of partitioning schemes for phylogenetic analyses of mitogenomic data: an avian case study
KR20190117529A (en) Method and system for generation and error correction of unique molecular index sets with heterogeneous molecular length
CN104169927A (en) Compact next generation sequencing database and efficient sequence processing using same
CN112687337B (en) Super multiplex primer design method
CN116064755B (en) Device for detecting MRD marker based on linkage gene mutation
Volinia et al. GOAL: automated Gene Ontology analysis of expression profiles
Bayly-Jones et al. Mining folded proteomes in the era of accurate structure prediction
CN115346606B (en) Method and system for designing targeting probe based on species sequence
US20140058682A1 (en) Nucleic Acid Information Processing Device and Processing Method Thereof
EP1608786B1 (en) Genomic profiling of regulatory factor binding sites
CN110136780B (en) Method for constructing probe specificity database based on comparison algorithm
US20170076047A1 (en) Systems and methods for genetic testing
WO2016123472A2 (en) Analyzing characteristics of genomic regions of a genome
CN108624667A (en) Method and device for analyzing T cell receptor library based on next-generation sequencing
Wang et al. Quantitative translation of dog-to-human aging by conserved remodeling of epigenetic networks
CN115719616A (en) Method and system for screening specific sequences of pathogenic species
CN115679002A (en) Detection method and application of yak Wnt7A gene CNV marker
Xu et al. DriverGenePathway: Identifying driver genes and driver pathways in cancer based on MutSigCV and statistical methods
Guydosh ribofootPrinter: A precision python toolbox for analysis of ribosome profiling data
EP1134687A2 (en) Method for displaying results of hybridization experiments

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant