CN111816254A - Method for quickly removing carrier sequences in batches based on perl language - Google Patents
Method for quickly removing carrier sequences in batches based on perl language Download PDFInfo
- Publication number
- CN111816254A CN111816254A CN202010485310.2A CN202010485310A CN111816254A CN 111816254 A CN111816254 A CN 111816254A CN 202010485310 A CN202010485310 A CN 202010485310A CN 111816254 A CN111816254 A CN 111816254A
- Authority
- CN
- China
- Prior art keywords
- sequences
- sequence
- matching
- vector
- insertion site
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
- G16B20/30—Detection of binding sites or motifs
Landscapes
- Bioinformatics & Cheminformatics (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Physics & Mathematics (AREA)
- Engineering & Computer Science (AREA)
- Genetics & Genomics (AREA)
- Biotechnology (AREA)
- Biophysics (AREA)
- Chemical & Material Sciences (AREA)
- Molecular Biology (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Bioinformatics & Computational Biology (AREA)
- Analytical Chemistry (AREA)
- Evolutionary Biology (AREA)
- General Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Theoretical Computer Science (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Abstract
The invention discloses a method for quickly removing a carrier sequence in batches based on perl language, which comprises the following steps: the method comprises the following steps: using a file in a sequencing result text format or using Chromas software to perform format conversion on a sequencing peak map file; step two: specifically removing the vector sequence by positioning sequences at two ends of an insertion site of a matched vector; comparing and matching sequences at two ends of the carrier insertion site set in the script with sequences in a sequencing result; step three: filtering the unqualified sequencing result; comparing and matching sequences at two ends of a vector insertion site set in the script with sequences in a sequencing result, and defaulting to an unqualified sequence without outputting if complete matching or matching is less than more than 7 bases; step four: and outputting a result of the sequence without the carrier, and adjusting the output result according to the directionality of the sequences at the two ends of the carrier insertion site. The method can remove a plurality of sequences at one time without limiting the number of data volume processed at one time.
Description
Technical Field
The invention relates to the field of gene detection, in particular to a method for quickly removing vector sequences in batches based on perl language.
Background
The Sanger method is a method of obtaining a visible DNA base sequence by starting at a fixed point, randomly terminating at a specific base, and fluorescently labeling behind each base to generate four sets of nucleotides of different lengths ending at A, T, C, G, and then electrophoresing on a urea-denatured PAGE gel to detect, as a first generation sequencing technique represented by Sanger sequencing. The Sanger sequencing technology is mainly characterized in that the sequencing reading length can reach 1000bp, the accuracy is as high as 99.999 percent, and the Sanger sequencing technology is one of the mainstream sequencing methods in China. However, for some complex DNA templates or gene synthesis sequences, it is difficult to directly sequence and obtain the target sequence by sanger, and in this case, the cloning technology in molecular biology is used, and usually DNA obtained from virus, plasmid or higher organism cells is used as a cloning vector, and a foreign DNA fragment with proper size is inserted into the vector, and attention is paid to the fact that the self-replication property of the vector cannot be damaged. The recombinant vector is introduced into a host cell, is propagated in the host cell in a large quantity, and then is analyzed for biological significance, but the biological significance is that the vector sequence is not analyzed, the vector sequence is manually removed by the existing method, so that the method is time-consuming, labor-consuming, easy to make mistakes and extremely slow in working efficiency. However, no software exists for removing vector sequences from multiple sequencing results at one time.
Disclosure of Invention
In order to overcome the above defects of the prior art, the present invention aims to provide a method for quickly removing vector sequences in batches based on perl language.
In order to realize the purpose of the invention, the adopted technical scheme is as follows:
a method for quickly removing a carrier sequence in batches based on perl language comprises the following steps:
the method comprises the following steps: using a file in a sequencing result text format or using Chromas software to perform format conversion on a sequencing peak map file;
step two: specifically removing the vector sequence by positioning sequences at two ends of an insertion site of a matched vector;
comparing and matching sequences at two ends of a vector insertion site set in a script with sequences in a sequencing result, completely matching or matching more than 7 bases, and intercepting a sequence by taking a default intermediate sequence as a target sequence;
step three: filtering the unqualified sequencing result;
comparing and matching sequences at two ends of a vector insertion site set in the script with sequences in a sequencing result, and defaulting to an unqualified sequence without outputting if complete matching or matching is less than more than 7 bases;
step four: and outputting a result of the sequence without the carrier, and adjusting the output result according to the directionality of the sequences at the two ends of the carrier insertion site.
In a preferred embodiment of the present invention, the format conversion includes:
file → Batch or Processing → Batch or Export → Save or as type → Fasta → Export, finally the conversion File of fa is obtained.
The invention has the beneficial effects that:
the method for quickly removing the carrier sequences in batches based on the perl language can remove a plurality of sequences at one time without limiting the number of data volume processed at one time.
Drawings
FIG. 1 is a schematic view of the results before removal of the carrier in FIG. 1.
FIG. 2 is a schematic diagram of the results before removal of the carrier.
FIG. 3 is a graph showing the results of removing the carrier by the method of the present invention.
FIG. 4 is a block flow diagram of the present invention.
Detailed Description
The present invention is further illustrated by the following examples, which should not be construed as limiting the invention thereto.
The existing analysis is by:
manually locking sequences at two ends of a carrier insertion site in a sequencing text result in a sequence query mode, and manually deleting a carrier sequence;
the other method is that Chromas software, Options → Vector Sequences input the two end Sequences of the carrier insertion site, the sequencing result is singly opened, the Vector is removed by Export, and the mismatch in the middle can not be used.
The methods all require manual single sequencing sequence processing and are extremely troublesome.
Compared with the prior art, the invention does not limit the number of data volume processed in a single time.
Example 1
Example 1 specifically includes the following steps:
the method comprises the following steps: using a file in a sequencing result text format, or using Chromas software to perform format conversion on a sequencing peak map file;
file → Batch or Processing → Batch or Export → Save or as type → Fasta → Export, finally the conversion File of fa is obtained.
Step two: specifically removing the vector sequence by positioning sequences at two ends of an insertion site of a matched vector;
and comparing and matching sequences at two ends of the vector insertion site set in the script with sequences in a sequencing result, completely matching or matching more than 7 bases, and intercepting the sequence by taking the default intermediate sequence as a target sequence.
Step three: filtering the unqualified sequencing result;
and comparing and matching sequences at two ends of the vector insertion site set in the script with sequences in a sequencing result, and defaulting unqualified sequences without outputting if complete matching or matching is less than more than 7 bases.
Step four: and outputting a result of the sequence without the carrier, and adjusting the output result according to the directionality of the sequences at the two ends of the carrier insertion site. As can be seen from FIGS. 1-3, the method for rapidly removing vector sequences in batches based on perl language of the present invention has good removal effect, and the result can be directly used for downstream biological information analysis.
In practical applications, the toolkit utilized by the method of the present invention contains a perl script code, and the script name is as follows: delete-vectoreq.
The code writing of the invention is based on the perl language and can be used under various Unix-like systems such as Linux, MacOS, ubuntu and the like.
Claims (2)
1. A method for quickly removing a carrier sequence in batches based on perl language is characterized by comprising the following steps:
the method comprises the following steps: using a file in a sequencing result text format or using Chromas software to perform format conversion on a sequencing peak map file;
step two: specifically removing the vector sequence by positioning sequences at two ends of an insertion site of a matched vector;
comparing and matching sequences at two ends of a vector insertion site set in a script with sequences in a sequencing result, completely matching or matching more than 7 bases, and intercepting a sequence by taking a default intermediate sequence as a target sequence;
step three: filtering the unqualified sequencing result;
comparing and matching sequences at two ends of a vector insertion site set in the script with sequences in a sequencing result, and defaulting to an unqualified sequence without outputting if complete matching or matching is less than more than 7 bases;
step four: and outputting a result of the sequence without the carrier, and adjusting the output result according to the directionality of the sequences at the two ends of the carrier insertion site.
2. The method for fast batch removal of vector sequences based on perl language as claimed in claim 1, wherein the format conversion comprises:
file → Batch or Processing → Batch or Export → Save or as type → Fasta → Export, finally the conversion File of fa is obtained.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010485310.2A CN111816254A (en) | 2020-06-01 | 2020-06-01 | Method for quickly removing carrier sequences in batches based on perl language |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010485310.2A CN111816254A (en) | 2020-06-01 | 2020-06-01 | Method for quickly removing carrier sequences in batches based on perl language |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111816254A true CN111816254A (en) | 2020-10-23 |
Family
ID=72848204
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010485310.2A Pending CN111816254A (en) | 2020-06-01 | 2020-06-01 | Method for quickly removing carrier sequences in batches based on perl language |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111816254A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113380323A (en) * | 2021-07-19 | 2021-09-10 | 浙江迪谱诊断技术有限公司 | Sanger sequencing peak image interception identification method and system, computer equipment and storage medium |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101149743A (en) * | 2007-11-09 | 2008-03-26 | 中国水产科学研究院黑龙江水产研究所 | DNA sequencing polluted sequence batch treating tool |
CN110534157A (en) * | 2019-07-26 | 2019-12-03 | 江苏省农业科学院 | A kind of batch extracting genomic gene information simultaneously translates the method for comparing analytical sequence |
-
2020
- 2020-06-01 CN CN202010485310.2A patent/CN111816254A/en active Pending
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101149743A (en) * | 2007-11-09 | 2008-03-26 | 中国水产科学研究院黑龙江水产研究所 | DNA sequencing polluted sequence batch treating tool |
CN110534157A (en) * | 2019-07-26 | 2019-12-03 | 江苏省农业科学院 | A kind of batch extracting genomic gene information simultaneously translates the method for comparing analytical sequence |
Non-Patent Citations (3)
Title |
---|
AME´LIE BARDIL等: "Genomic expression dominance in the natural allopolyploid coffea Arabica is massively affected by growth temperature", NEW PHYTOLOGIST, vol. 192, pages 760 * |
刘华波: "西伯利亚杏SSR引物开发及燕山群体遗传多样性研究", 中国优秀硕士学位论文全文数据库, no. 9, pages 048 - 10 * |
周猛等: "基于Linux平台的林木EST序列分析系统的构建及应用", 生物信息学, no. 2, pages 74 - 77 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113380323A (en) * | 2021-07-19 | 2021-09-10 | 浙江迪谱诊断技术有限公司 | Sanger sequencing peak image interception identification method and system, computer equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107180166B (en) | Third-generation sequencing-based whole genome structural variation analysis method and system | |
CN108595915B (en) | Third-generation data correction method based on DNA variation detection | |
CN113963749B (en) | High-throughput sequencing data automatic assembly method, system, equipment and storage medium | |
CN113488106B (en) | Method for rapidly acquiring target genome region comparison result data | |
CN111816254A (en) | Method for quickly removing carrier sequences in batches based on perl language | |
CN113571131A (en) | Pangenome construction method and corresponding structural variation mining method | |
CN109658981B (en) | Data classification method for single cell sequencing | |
CN108388772B (en) | Method for analyzing high-throughput sequencing gene expression level by text comparison | |
CN102841988B (en) | A kind of system and method that nucleic acid sequence information is mated | |
CN111292806B (en) | Transcriptome analysis method by using nanopore sequencing | |
CN110570901B (en) | Method and system for SSR typing based on sequencing data | |
CN117238376A (en) | Virus vector sequence analysis system and method based on second-generation sequencing technology | |
CN113416770B (en) | Chromosome structure variation breakpoint positioning method and device | |
Kasukurthi et al. | SURFr: Algorithm for identification and analysis of ncRNA-derived RNAs | |
Pavlovich et al. | Sequences to Differences in Gene Expression: Analysis of RNA-Seq Data | |
CN116343921A (en) | Automatic DNA sequence verification method and system | |
Gudodagi et al. | Investigations and Compression of Genomic Data | |
CN107403076B (en) | Method and apparatus for treating DNA sequence | |
Quan et al. | SALT: a fast, memory-efficient and snp-aware short read alignment tool | |
CN112309500B (en) | Unique fragment sequence capturing method based on single cell sequencing data | |
CN115331736B (en) | Splicing method for extending high-throughput sequencing genes based on text matching | |
CN111883212B (en) | Construction method and construction device of DNA fingerprint spectrum and terminal equipment | |
CN112802554B (en) | Animal mitochondrial genome assembly method based on second-generation data | |
CN114171121B (en) | Quick detection method for mRNA 5'3' terminal difference | |
Ochieng et al. | Tandem repeats analysis in DNA sequences based on improved Burrows-Wheeler transform |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |