CN113593645A

CN113593645A - cDNA library gene sequence frame shift judgment method

Info

Publication number: CN113593645A
Application number: CN202110878793.7A
Authority: CN
Inventors: 张萍萍; 公光业; 肖云平; 李晖; 林博; 殷昊; 赵仕兰
Original assignee: Shanghai Oe Biotech Co ltd
Current assignee: Shanghai Oe Biotech Co ltd
Priority date: 2021-08-02
Filing date: 2021-08-02
Publication date: 2021-11-02

Abstract

The invention provides a cDNA library gene sequence frame shift judgment method, belonging to the technical field of gene analysis; the method can obtain target sequences matched with the cDNA to be compared in batch, and detect whether the cDNA is shifted in the carrier or not according to the position number, so that the analysis efficiency is not influenced by the limitations of maintenance of a source database, network speed and the like, and the gene comparison analysis efficiency is greatly improved.

Description

cDNA library gene sequence frame shift judgment method

Technical Field

The invention relates to the technical field of gene analysis, in particular to a method for judging cDNA library gene sequence frameshifting.

Background

The cDNA library is a kind of gene library, and refers to a collection of clones formed by transferring all mRNA transcribed by a certain organism at a certain development period into a recipient cell after connecting a cDNA fragment to a certain vector. Unlike genomic DNA, which contains introns, cDNA is difficult to express correctly, and is convenient for cloning and mass amplification, and can be used for screening the desired target gene from a cDNA library and directly for expression and transgenic research of the target gene. The construction and screening of cDNA library has become an important method for researching functional genomics, and is one of the basic tools for discovering new genes and researching gene functions.

Generally, cDNA library construction utilizes the characteristic that polyA tail of mRNA is poly A, and uses OligodT primer to carry out reverse transcription from 3 'end to 5' end to obtain sscDNA; then the cDNA is constructed on a corresponding vector after double-strand synthesis and linker connection, and the position of the termination of reverse transcription cannot be accurately controlled, so that the cDNA connected into the vector can not be ensured to correctly code the protein. Therefore, in the gene screening experiment (such as the yeast two-hybrid cDNA library screening, subtraction library screening and other experiments) using the cDNA library, after obtaining the positive clone, whether the obtained positive clone cDNA correctly encodes the protein in the vector can not be directly detected, the obtained cDNA sequence needs to be confirmed by a first-generation sequencing method and other methods, and then the sequencing result is analyzed one by one to determine whether the insertion sequence is shifted, so that the process is complicated, the workload is large, and errors are easy to occur.

Disclosure of Invention

The invention aims to provide a cDNA library gene sequence frameshift judgment method, which is simple in process and can greatly improve the working efficiency.

In order to achieve the above object, the present invention provides the following technical solutions:

the invention provides a cDNA library gene sequence frameshift judgment method, which comprises the following steps:

1) converting the cDNA sequence to be compared into a Fasta format, identifying and removing a joint sequence, and extracting the cDNA sequence to obtain an input sequence;

2) constructing a local database comprising candidate proteins;

3) comparing the input sequence in the step 1) with the local database in the step 2) by using blastx, and taking a gene with the highest matching rate as a target sequence;

4) obtaining the frame shift condition of the cDNA according to the position comparison information of the input sequence in the step 1) and the target sequence in the step 3);

the frame shift of the cDNA comprises:

when the initial position of the input sequence compared with the target sequence is a multiple of three plus one, the input sequence is not a frame shift;

when the initial position of the input sequence is compared with a target sequence and is a multiple of three plus two, the initial position is a frame shift one bit;

when the initial position of the input sequence compared with the target sequence is a multiple of three, the input sequence is two shift codes;

there is no chronological restriction between step 1) and step 2).

Preferably, the step 4) further comprises determining the matching degree of the input sequence and the target sequence, wherein the determining the matching degree of the input sequence and the target sequence comprises comparing the starting and ending positions of the input sequence, the starting and ending positions of the target sequence, gap and mismatch information;

the judgment standard of the matching degree comprises the following steps: when the initial position of the input sequence is 1, the end position is the total length of the sequence; and the initial position of the target sequence is 1, the terminal position is the total length of the sequence, 0 mismatch, 0gap, then the input sequence is completely matched with the target sequence;

when the initial position of the input sequence is 1, the termination position is less than the total length of the sequence; and the initial position of the target sequence is 1, the final position is less than the total length of the sequence, 0 mismatch, 0gap, the 5 'end of the input sequence is completely matched with the 5' end of the target sequence;

when the initial position of the input sequence is 1, the end position is the total length of the sequence; the initial position of the target sequence is not 1, the terminal position is the total length of the sequence, 0 mismatch and 0gap, and the 3 'end of the input sequence is judged to be completely matched with the 3' end of the target sequence;

when the starting position of the input sequence is not 1, the ending position is less than the total length of the sequence; and the initial position of the target sequence is not 1, the end position is less than the total sequence length, N is mismatched, N gap, N is more than or equal to 0 and is an integer, and the input sequence and the target sequence are not completely matched.

Preferably, in step 3), the threshold value of the alignment is 1e-5 or 1 e-10.

Preferably, in step 1), the software for converting the cDNA sequences to be aligned into Fasta format includes sequence processing software seqtk.

Preferably, in step 1), the software used for identifying and removing the linker sequence includes a substr function of awk.

Preferably, in step 1), before converting the cDNA sequences to be aligned into Fasta format, the method further comprises converting the cDNA sequences to be aligned into a line format for display.

The invention provides a cDNA library gene sequence frameshift judgment method, which comprises the following steps: converting the cDNA sequence to be compared into a Fasta format, identifying and removing a joint sequence, and extracting the cDNA sequence to obtain an input sequence; constructing a local database comprising candidate proteins; comparing the input sequence with a local database, and taking the gene with the highest matching rate as a target sequence; judging the frame shift condition of the cDNA according to the position comparison information of the input sequence and the target sequence; the step of judging the frame shift condition of the cDNA comprises the following steps: when the initial position of the input sequence is compared with the target sequence and is a multiple of three plus one, the input sequence is not frame-shifted; when the initial position of the input sequence is a multiple of three plus two when compared with the target sequence, the input sequence is a frame shift one bit; when the initial position of the input sequence is compared with the target sequence and is a multiple of three, the input sequence is two bits of frame shift. The method can obtain target sequences matched with the cDNA to be compared in batch, and detect whether the cDNA is shifted in the carrier or not according to the position number, so that the analysis efficiency is not influenced by the limitations of maintenance of a source database, network speed and the like, and the gene comparison analysis efficiency is greatly improved.

Drawings

FIG. 1 shows the results of rapid batch alignment analysis of the best matched gene in cDNA database;

FIG. 2 is a diagram of the overall framework of the method for performing rapid alignment of gene data and analyzing whether a gene is frameshifted according to example 2 of the present invention;

FIG. 3 shows the results of rapid batch alignment analysis of the best matched gene in the database and analysis of the frameshift of cDNA in the library vector.

Detailed Description

2) constructing a local database comprising candidate proteins;

the frame shift of the cDNA comprises:

there is no chronological restriction between step 1) and step 2).

The invention converts the cDNA sequence to be compared into Fasta format, identifies and removes the linker sequence, extracts the cDNA sequence and obtains the input sequence.

In the present invention, the software used to convert the cDNA sequences to be aligned to Fasta format preferably comprises the sequence processing software seqtk.

In the present invention, the software used to identify and remove linker sequences preferably includes the substr function of awk.

In the present invention, before converting the cDNA sequences to be aligned into Fasta format, it is preferable to display the cDNA sequences to be aligned in a line format. In the present invention, the software used to convert the cDNA sequences to be aligned into a line format for display includes the sequence processing software seqtk.

In the specific implementation process of the invention, the sequence processing software seqtk is used to adjust all cDNA sequences to be compared into one line of each sequence for display, so that the condition that the sequence part of a matched joint cannot be matched due to line change can be avoided, and the seq format gene sequences to be processed are converted into Fasta format in batch.

The method can identify any connector sequence and can extract the post-connector sequence in batches. The specific steps are that all sequences after (including) the first base after the adaptor sequence are extracted according to the appointed adaptor sequence by using a substr function of awk (namely, only the adaptor sequence and the rest sequences of the front part of the adaptor sequence are removed to obtain a complete cDNA sequence), and the position of the extracted sequence is Y ═ X + L (X is the position of the first base of the adaptor sequence, and L is the length of the adaptor sequence).

For example, the following steps are carried out:

1) the linker sequence is ACAAGTTTTGTACAAAAAGTTGGX (SEQ ID NO.1, X is a non-fixed base and may be any one of ATCGs), and the length of the linker sequence is 24 (including X);

2) if the linker position information (i.e. the position of the first base of the linker sequence over the entire sequence) obtained by the substr function of awk is 80, the sequence start site to be extracted is 80+ 24-104. Position 104 of the entire sequence is the starting base position of the CDS sequence actually required.

The present invention constructs a local database comprising candidate proteins.

In the present invention, local database construction is preferably performed by makeblastdb of blast. In the present invention, the amino acid sequence of the candidate protein is selected according to the alignment requirement. The candidate protein is not particularly limited in the present invention, and may be a species, a family of proteins, or a single protein of interest. The data source of the local database is not particularly limited in the invention, and the data source can be derived from NCBI, uniprot and other published protein databases, personalized and customized protein sequences or corrected and modified protein sequences. In the specific implementation process of the invention, a personalized local database can be constructed for accurate comparison.

After obtaining the local database and the input sequence, the invention uses blastx to compare the input sequence with the local database, and the output sequence is the target sequence.

In the specific implementation process of the invention, under the condition that the studied species is an unusual species and the gene amount in the database is small, the threshold value is set to be 1 e-5; in the case where the species under study is a common species, the genome and transcriptome have been deeply sequenced, and the number of genes entered in the database is large, the setting is 1 e-10. The more relaxed the threshold, the more results obtained from the alignment, the more stringent the threshold, and the fewer results obtained from the alignment. Sequences below the above threshold are not displayed in the output result.

In the present invention, the alignment parameters include: and the Identity, the Gap, the Align _ length and the E _ value are scored according to the parameters, and the sequence with the highest score is an output sequence, namely a target sequence.

After a target sequence is obtained, judging the frame shift condition of the cDNA according to the position comparison information of an input sequence and the target sequence;

the step of judging the frame shift condition of the cDNA comprises the following steps:

and when the initial position of the input sequence compared with the target sequence is a multiple of three, the input sequence is two bits of frame shift.

In the present invention, according to the position alignment information of the input sequence and the target sequence, it is preferable to further determine the matching degree between the input sequence and the target sequence; the judgment criterion for determining the matching degree of the input sequence and the target sequence comprises the following steps:

comparing the start and end positions of the input sequence, the start and end positions of the target sequence, gap and mismatch information;

when the initial position of the input sequence is 1, the end position is the total length of the sequence; the initial position of the target sequence is 1, the terminal position is the total length of the sequence, 0 mismatch and 0gap, and the input sequence is judged to be completely matched with the target sequence;

when the initial position of the input sequence is 1, the termination position is less than the total length of the sequence; and the initial position of the target sequence is 1, the end position is less than the total length of the sequence, 0 mismatch and 0gap, and the 5' end of the input sequence is judged to be completely matched with the target sequence;

when the initial position of the input sequence is 1, the end position is the total length of the sequence; the initial position of the target sequence is not 1, the terminal position is the total length of the sequence, 0 mismatch and 0gap, and the 3' end of the input sequence is judged to be completely matched with the target sequence;

when the starting position of the input sequence is not 1, the ending position is less than the total length of the sequence; and the initial position of the target sequence is not 1, the end position is less than the total sequence length, N is mismatched, N gap, N is more than or equal to 0 and is an integer, the input sequence and the target sequence are incompletely matched, and the specific matching rate is output according to a sequence similarity algorithm.

The technical solution of the present invention will be clearly and completely described below with reference to the embodiments of the present invention. It is to be understood that the described embodiments are merely exemplary of the invention, and not restrictive of the full scope of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Example 1

The cDNA sequences were aligned as follows:

(1) uploading the sequence to be compared and the gene database to a host;

(2) converting the format of the sequence to be compared into a Fasta format;

(3) comparing the sequence to be compared with a reference gene or protein sequence in a local database;

(4) and outputting the target gene matched with the gene to be compared according to the matching score value of the candidate gene and the gene to be compared. The alignment results are shown in FIG. 1.

Example 2

The cDNA sequences were aligned according to the following procedure, and the flow chart is shown in FIG. 2:

(1) uploading the sequence to be compared and the gene database to a host;

(2) converting the format of the sequence to be compared into a Fasta format;

(3) setting the display mode of the sequences to be compared as 1 line;

(4) setting a library adaptor sequence;

(5) searching and deleting the upstream of the joint and the joint sequence in the sequence to be compared;

(6) converting the cDNA sequence without the library joint into an amino acid sequence, comparing the amino acid sequence in a local protein database, and analyzing the matching rate;

(7) outputting the gene with the highest comparison score as an optimal result, wherein the optimal result is a target gene;

(8) calculating whether the expression of the sequences to be compared in the vector is shifted according to the set shifting judgment rule;

(9) and outputting the frame shift of the cDNA to be compared in the carrier and the detailed information of the compared target gene.

Comparative example 1

This is consistent with example 2, except that the sequence display format is set, the library adaptor sequence is set, the adaptor is removed and the number of first matched base positions of the sequence to be aligned and the target gene is calculated.

Comparative example 2

Adding sequence display format, setting library joints and removing the joint sequences, calculating the number of first matching base positions of the sequences to be compared and the target gene, and on the basis of quick batch comparison in embodiment 1, simultaneously analyzing whether the cDNA is frameshifted in the carrier or not. And for different cDNA library vectors, the same effect can be achieved by only replacing the linker sequence in the script, and the method is simple and quick. The comparison result is shown in fig. 3, and the frame shift column shows the result of the determination of whether to shift the code (indicated by yellow mark). Wherein 0 in the frameshift column indicates that the cDNA has not been frameshifted in the vector, i.e. the inserted gene is correctly expressed;

frame shift column

1 or 2 indicates that the cDNA is frameshifted in the vector and needs to be reconstructed.

The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims

1. A method for judging cDNA library gene sequence frameshift comprises the following steps:

2) constructing a local database comprising candidate proteins;

the frame shift of the cDNA comprises:

there is no chronological restriction between step 1) and step 2).

2. The method of claim 1, wherein step 4) further comprises determining the degree of match between the input sequence and the target sequence, wherein the determining the degree of match between the input sequence and the target sequence comprises comparing the start and end positions of the input sequence, the start and end positions of the target sequence, gap, and mismatch information;

3. The method of claim 1, wherein in step 3), the threshold value of the alignment is 1e-5 or 1 e-10.

4. The method according to claim 1, wherein in step 1), the software used to convert the cDNA sequences to be aligned into Fasta format comprises the sequence processing software seqtk.

5. The method of claim 1, wherein in step 1), the software used to identify and remove the linker sequence comprises a substr function of awk.

6. The method according to claim 1, wherein the step 1) further comprises displaying the cDNA sequences to be aligned in a line format before converting the cDNA sequences to be aligned into Fasta format.