CN111816254A - Method for quickly removing carrier sequences in batches based on perl language - Google Patents

Method for quickly removing carrier sequences in batches based on perl language Download PDF

Info

Publication number
CN111816254A
CN111816254A CN202010485310.2A CN202010485310A CN111816254A CN 111816254 A CN111816254 A CN 111816254A CN 202010485310 A CN202010485310 A CN 202010485310A CN 111816254 A CN111816254 A CN 111816254A
Authority
CN
China
Prior art keywords
sequences
sequence
matching
vector
insertion site
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010485310.2A
Other languages
Chinese (zh)
Inventor
辛文斌
朱月艳
孙子奎
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Personal Biotechnology Co ltd
Original Assignee
Shanghai Personal Biotechnology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Personal Biotechnology Co ltd filed Critical Shanghai Personal Biotechnology Co ltd
Priority to CN202010485310.2A priority Critical patent/CN111816254A/en
Publication of CN111816254A publication Critical patent/CN111816254A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/30Detection of binding sites or motifs

Landscapes

  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Genetics & Genomics (AREA)
  • Biotechnology (AREA)
  • Biophysics (AREA)
  • Chemical & Material Sciences (AREA)
  • Molecular Biology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Analytical Chemistry (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention discloses a method for quickly removing a carrier sequence in batches based on perl language, which comprises the following steps: the method comprises the following steps: using a file in a sequencing result text format or using Chromas software to perform format conversion on a sequencing peak map file; step two: specifically removing the vector sequence by positioning sequences at two ends of an insertion site of a matched vector; comparing and matching sequences at two ends of the carrier insertion site set in the script with sequences in a sequencing result; step three: filtering the unqualified sequencing result; comparing and matching sequences at two ends of a vector insertion site set in the script with sequences in a sequencing result, and defaulting to an unqualified sequence without outputting if complete matching or matching is less than more than 7 bases; step four: and outputting a result of the sequence without the carrier, and adjusting the output result according to the directionality of the sequences at the two ends of the carrier insertion site. The method can remove a plurality of sequences at one time without limiting the number of data volume processed at one time.

Description

Method for quickly removing carrier sequences in batches based on perl language
Technical Field
The invention relates to the field of gene detection, in particular to a method for quickly removing vector sequences in batches based on perl language.
Background
The Sanger method is a method of obtaining a visible DNA base sequence by starting at a fixed point, randomly terminating at a specific base, and fluorescently labeling behind each base to generate four sets of nucleotides of different lengths ending at A, T, C, G, and then electrophoresing on a urea-denatured PAGE gel to detect, as a first generation sequencing technique represented by Sanger sequencing. The Sanger sequencing technology is mainly characterized in that the sequencing reading length can reach 1000bp, the accuracy is as high as 99.999 percent, and the Sanger sequencing technology is one of the mainstream sequencing methods in China. However, for some complex DNA templates or gene synthesis sequences, it is difficult to directly sequence and obtain the target sequence by sanger, and in this case, the cloning technology in molecular biology is used, and usually DNA obtained from virus, plasmid or higher organism cells is used as a cloning vector, and a foreign DNA fragment with proper size is inserted into the vector, and attention is paid to the fact that the self-replication property of the vector cannot be damaged. The recombinant vector is introduced into a host cell, is propagated in the host cell in a large quantity, and then is analyzed for biological significance, but the biological significance is that the vector sequence is not analyzed, the vector sequence is manually removed by the existing method, so that the method is time-consuming, labor-consuming, easy to make mistakes and extremely slow in working efficiency. However, no software exists for removing vector sequences from multiple sequencing results at one time.
Disclosure of Invention
In order to overcome the above defects of the prior art, the present invention aims to provide a method for quickly removing vector sequences in batches based on perl language.
In order to realize the purpose of the invention, the adopted technical scheme is as follows:
a method for quickly removing a carrier sequence in batches based on perl language comprises the following steps:
the method comprises the following steps: using a file in a sequencing result text format or using Chromas software to perform format conversion on a sequencing peak map file;
step two: specifically removing the vector sequence by positioning sequences at two ends of an insertion site of a matched vector;
comparing and matching sequences at two ends of a vector insertion site set in a script with sequences in a sequencing result, completely matching or matching more than 7 bases, and intercepting a sequence by taking a default intermediate sequence as a target sequence;
step three: filtering the unqualified sequencing result;
comparing and matching sequences at two ends of a vector insertion site set in the script with sequences in a sequencing result, and defaulting to an unqualified sequence without outputting if complete matching or matching is less than more than 7 bases;
step four: and outputting a result of the sequence without the carrier, and adjusting the output result according to the directionality of the sequences at the two ends of the carrier insertion site.
In a preferred embodiment of the present invention, the format conversion includes:
file → Batch or Processing → Batch or Export → Save or as type → Fasta → Export, finally the conversion File of fa is obtained.
The invention has the beneficial effects that:
the method for quickly removing the carrier sequences in batches based on the perl language can remove a plurality of sequences at one time without limiting the number of data volume processed at one time.
Drawings
FIG. 1 is a schematic view of the results before removal of the carrier in FIG. 1.
FIG. 2 is a schematic diagram of the results before removal of the carrier.
FIG. 3 is a graph showing the results of removing the carrier by the method of the present invention.
FIG. 4 is a block flow diagram of the present invention.
Detailed Description
The present invention is further illustrated by the following examples, which should not be construed as limiting the invention thereto.
The existing analysis is by:
manually locking sequences at two ends of a carrier insertion site in a sequencing text result in a sequence query mode, and manually deleting a carrier sequence;
the other method is that Chromas software, Options → Vector Sequences input the two end Sequences of the carrier insertion site, the sequencing result is singly opened, the Vector is removed by Export, and the mismatch in the middle can not be used.
The methods all require manual single sequencing sequence processing and are extremely troublesome.
Compared with the prior art, the invention does not limit the number of data volume processed in a single time.
Example 1
Example 1 specifically includes the following steps:
the method comprises the following steps: using a file in a sequencing result text format, or using Chromas software to perform format conversion on a sequencing peak map file;
file → Batch or Processing → Batch or Export → Save or as type → Fasta → Export, finally the conversion File of fa is obtained.
Step two: specifically removing the vector sequence by positioning sequences at two ends of an insertion site of a matched vector;
and comparing and matching sequences at two ends of the vector insertion site set in the script with sequences in a sequencing result, completely matching or matching more than 7 bases, and intercepting the sequence by taking the default intermediate sequence as a target sequence.
Step three: filtering the unqualified sequencing result;
and comparing and matching sequences at two ends of the vector insertion site set in the script with sequences in a sequencing result, and defaulting unqualified sequences without outputting if complete matching or matching is less than more than 7 bases.
Step four: and outputting a result of the sequence without the carrier, and adjusting the output result according to the directionality of the sequences at the two ends of the carrier insertion site. As can be seen from FIGS. 1-3, the method for rapidly removing vector sequences in batches based on perl language of the present invention has good removal effect, and the result can be directly used for downstream biological information analysis.
In practical applications, the toolkit utilized by the method of the present invention contains a perl script code, and the script name is as follows: delete-vectoreq.
The code writing of the invention is based on the perl language and can be used under various Unix-like systems such as Linux, MacOS, ubuntu and the like.

Claims (2)

1. A method for quickly removing a carrier sequence in batches based on perl language is characterized by comprising the following steps:
the method comprises the following steps: using a file in a sequencing result text format or using Chromas software to perform format conversion on a sequencing peak map file;
step two: specifically removing the vector sequence by positioning sequences at two ends of an insertion site of a matched vector;
comparing and matching sequences at two ends of a vector insertion site set in a script with sequences in a sequencing result, completely matching or matching more than 7 bases, and intercepting a sequence by taking a default intermediate sequence as a target sequence;
step three: filtering the unqualified sequencing result;
comparing and matching sequences at two ends of a vector insertion site set in the script with sequences in a sequencing result, and defaulting to an unqualified sequence without outputting if complete matching or matching is less than more than 7 bases;
step four: and outputting a result of the sequence without the carrier, and adjusting the output result according to the directionality of the sequences at the two ends of the carrier insertion site.
2. The method for fast batch removal of vector sequences based on perl language as claimed in claim 1, wherein the format conversion comprises:
file → Batch or Processing → Batch or Export → Save or as type → Fasta → Export, finally the conversion File of fa is obtained.
CN202010485310.2A 2020-06-01 2020-06-01 Method for quickly removing carrier sequences in batches based on perl language Pending CN111816254A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010485310.2A CN111816254A (en) 2020-06-01 2020-06-01 Method for quickly removing carrier sequences in batches based on perl language

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010485310.2A CN111816254A (en) 2020-06-01 2020-06-01 Method for quickly removing carrier sequences in batches based on perl language

Publications (1)

Publication Number Publication Date
CN111816254A true CN111816254A (en) 2020-10-23

Family

ID=72848204

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010485310.2A Pending CN111816254A (en) 2020-06-01 2020-06-01 Method for quickly removing carrier sequences in batches based on perl language

Country Status (1)

Country Link
CN (1) CN111816254A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113380323A (en) * 2021-07-19 2021-09-10 浙江迪谱诊断技术有限公司 Sanger sequencing peak image interception identification method and system, computer equipment and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101149743A (en) * 2007-11-09 2008-03-26 中国水产科学研究院黑龙江水产研究所 DNA sequencing polluted sequence batch treating tool
CN110534157A (en) * 2019-07-26 2019-12-03 江苏省农业科学院 A kind of batch extracting genomic gene information simultaneously translates the method for comparing analytical sequence

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101149743A (en) * 2007-11-09 2008-03-26 中国水产科学研究院黑龙江水产研究所 DNA sequencing polluted sequence batch treating tool
CN110534157A (en) * 2019-07-26 2019-12-03 江苏省农业科学院 A kind of batch extracting genomic gene information simultaneously translates the method for comparing analytical sequence

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
AME´LIE BARDIL等: "Genomic expression dominance in the natural allopolyploid coffea Arabica is massively affected by growth temperature", NEW PHYTOLOGIST, vol. 192, pages 760 *
刘华波: "西伯利亚杏SSR引物开发及燕山群体遗传多样性研究", 中国优秀硕士学位论文全文数据库, no. 9, pages 048 - 10 *
周猛等: "基于Linux平台的林木EST序列分析系统的构建及应用", 生物信息学, no. 2, pages 74 - 77 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113380323A (en) * 2021-07-19 2021-09-10 浙江迪谱诊断技术有限公司 Sanger sequencing peak image interception identification method and system, computer equipment and storage medium

Similar Documents

Publication Publication Date Title
CN107180166B (en) Third-generation sequencing-based whole genome structural variation analysis method and system
CN108595915B (en) Third-generation data correction method based on DNA variation detection
CN113963749B (en) High-throughput sequencing data automatic assembly method, system, equipment and storage medium
CN113488106B (en) Method for rapidly acquiring target genome region comparison result data
CN111816254A (en) Method for quickly removing carrier sequences in batches based on perl language
CN113571131A (en) Pangenome construction method and corresponding structural variation mining method
CN109658981B (en) Data classification method for single cell sequencing
CN108388772B (en) Method for analyzing high-throughput sequencing gene expression level by text comparison
CN102841988B (en) A kind of system and method that nucleic acid sequence information is mated
CN111292806B (en) Transcriptome analysis method by using nanopore sequencing
CN110570901B (en) Method and system for SSR typing based on sequencing data
CN117238376A (en) Virus vector sequence analysis system and method based on second-generation sequencing technology
CN113416770B (en) Chromosome structure variation breakpoint positioning method and device
Kasukurthi et al. SURFr: Algorithm for identification and analysis of ncRNA-derived RNAs
Pavlovich et al. Sequences to Differences in Gene Expression: Analysis of RNA-Seq Data
CN116343921A (en) Automatic DNA sequence verification method and system
Gudodagi et al. Investigations and Compression of Genomic Data
CN107403076B (en) Method and apparatus for treating DNA sequence
Quan et al. SALT: a fast, memory-efficient and snp-aware short read alignment tool
CN112309500B (en) Unique fragment sequence capturing method based on single cell sequencing data
CN115331736B (en) Splicing method for extending high-throughput sequencing genes based on text matching
CN111883212B (en) Construction method and construction device of DNA fingerprint spectrum and terminal equipment
CN112802554B (en) Animal mitochondrial genome assembly method based on second-generation data
CN114171121B (en) Quick detection method for mRNA 5'3' terminal difference
Ochieng et al. Tandem repeats analysis in DNA sequences based on improved Burrows-Wheeler transform

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination