CN111816254A

CN111816254A - Method for quickly removing carrier sequences in batches based on perl language

Info

Publication number: CN111816254A
Application number: CN202010485310.2A
Authority: CN
Inventors: 辛文斌; 朱月艳; 孙子奎
Original assignee: Shanghai Personal Biotechnology Co ltd
Current assignee: Shanghai Personal Biotechnology Co ltd
Priority date: 2020-06-01
Filing date: 2020-06-01
Publication date: 2020-10-23

Abstract

The invention discloses a method for quickly removing a carrier sequence in batches based on perl language, which comprises the following steps: the method comprises the following steps: using a file in a sequencing result text format or using Chromas software to perform format conversion on a sequencing peak map file; step two: specifically removing the vector sequence by positioning sequences at two ends of an insertion site of a matched vector; comparing and matching sequences at two ends of the carrier insertion site set in the script with sequences in a sequencing result; step three: filtering the unqualified sequencing result; comparing and matching sequences at two ends of a vector insertion site set in the script with sequences in a sequencing result, and defaulting to an unqualified sequence without outputting if complete matching or matching is less than more than 7 bases; step four: and outputting a result of the sequence without the carrier, and adjusting the output result according to the directionality of the sequences at the two ends of the carrier insertion site. The method can remove a plurality of sequences at one time without limiting the number of data volume processed at one time.

Description

Method for quickly removing carrier sequences in batches based on perl language

Technical Field

The invention relates to the field of gene detection, in particular to a method for quickly removing vector sequences in batches based on perl language.

Background

The Sanger method is a method of obtaining a visible DNA base sequence by starting at a fixed point, randomly terminating at a specific base, and fluorescently labeling behind each base to generate four sets of nucleotides of different lengths ending at A, T, C, G, and then electrophoresing on a urea-denatured PAGE gel to detect, as a first generation sequencing technique represented by Sanger sequencing. The Sanger sequencing technology is mainly characterized in that the sequencing reading length can reach 1000bp, the accuracy is as high as 99.999 percent, and the Sanger sequencing technology is one of the mainstream sequencing methods in China. However, for some complex DNA templates or gene synthesis sequences, it is difficult to directly sequence and obtain the target sequence by sanger, and in this case, the cloning technology in molecular biology is used, and usually DNA obtained from virus, plasmid or higher organism cells is used as a cloning vector, and a foreign DNA fragment with proper size is inserted into the vector, and attention is paid to the fact that the self-replication property of the vector cannot be damaged. The recombinant vector is introduced into a host cell, is propagated in the host cell in a large quantity, and then is analyzed for biological significance, but the biological significance is that the vector sequence is not analyzed, the vector sequence is manually removed by the existing method, so that the method is time-consuming, labor-consuming, easy to make mistakes and extremely slow in working efficiency. However, no software exists for removing vector sequences from multiple sequencing results at one time.

Disclosure of Invention

In order to overcome the above defects of the prior art, the present invention aims to provide a method for quickly removing vector sequences in batches based on perl language.

In order to realize the purpose of the invention, the adopted technical scheme is as follows:

a method for quickly removing a carrier sequence in batches based on perl language comprises the following steps:

the method comprises the following steps: using a file in a sequencing result text format or using Chromas software to perform format conversion on a sequencing peak map file;

step two: specifically removing the vector sequence by positioning sequences at two ends of an insertion site of a matched vector;

comparing and matching sequences at two ends of a vector insertion site set in a script with sequences in a sequencing result, completely matching or matching more than 7 bases, and intercepting a sequence by taking a default intermediate sequence as a target sequence;

step three: filtering the unqualified sequencing result;

comparing and matching sequences at two ends of a vector insertion site set in the script with sequences in a sequencing result, and defaulting to an unqualified sequence without outputting if complete matching or matching is less than more than 7 bases;

step four: and outputting a result of the sequence without the carrier, and adjusting the output result according to the directionality of the sequences at the two ends of the carrier insertion site.

In a preferred embodiment of the present invention, the format conversion includes:

file → Batch or Processing → Batch or Export → Save or as type → Fasta → Export, finally the conversion File of fa is obtained.

The invention has the beneficial effects that:

the method for quickly removing the carrier sequences in batches based on the perl language can remove a plurality of sequences at one time without limiting the number of data volume processed at one time.

Drawings

FIG. 1 is a schematic view of the results before removal of the carrier in FIG. 1.

FIG. 2 is a schematic diagram of the results before removal of the carrier.

FIG. 3 is a graph showing the results of removing the carrier by the method of the present invention.

FIG. 4 is a block flow diagram of the present invention.

Detailed Description

The present invention is further illustrated by the following examples, which should not be construed as limiting the invention thereto.

The existing analysis is by:

manually locking sequences at two ends of a carrier insertion site in a sequencing text result in a sequence query mode, and manually deleting a carrier sequence;

the other method is that Chromas software, Options → Vector Sequences input the two end Sequences of the carrier insertion site, the sequencing result is singly opened, the Vector is removed by Export, and the mismatch in the middle can not be used.

The methods all require manual single sequencing sequence processing and are extremely troublesome.

Compared with the prior art, the invention does not limit the number of data volume processed in a single time.

Example 1

Example 1 specifically includes the following steps:

the method comprises the following steps: using a file in a sequencing result text format, or using Chromas software to perform format conversion on a sequencing peak map file;

and comparing and matching sequences at two ends of the vector insertion site set in the script with sequences in a sequencing result, completely matching or matching more than 7 bases, and intercepting the sequence by taking the default intermediate sequence as a target sequence.

Step three: filtering the unqualified sequencing result;

and comparing and matching sequences at two ends of the vector insertion site set in the script with sequences in a sequencing result, and defaulting unqualified sequences without outputting if complete matching or matching is less than more than 7 bases.

Step four: and outputting a result of the sequence without the carrier, and adjusting the output result according to the directionality of the sequences at the two ends of the carrier insertion site. As can be seen from FIGS. 1-3, the method for rapidly removing vector sequences in batches based on perl language of the present invention has good removal effect, and the result can be directly used for downstream biological information analysis.

In practical applications, the toolkit utilized by the method of the present invention contains a perl script code, and the script name is as follows: delete-vectoreq.

The code writing of the invention is based on the perl language and can be used under various Unix-like systems such as Linux, MacOS, ubuntu and the like.

Claims

1. A method for quickly removing a carrier sequence in batches based on perl language is characterized by comprising the following steps:

step three: filtering the unqualified sequencing result;

2. The method for fast batch removal of vector sequences based on perl language as claimed in claim 1, wherein the format conversion comprises: