CN112634988A

CN112634988A - Python language-based gene variation detection method and system

Info

Publication number: CN112634988A
Application number: CN202110016893.9A
Authority: CN
Inventors: 吕云云; 李燕平
Original assignee: Neijiang Normal University
Current assignee: Neijiang Normal University
Priority date: 2021-01-07
Filing date: 2021-01-07
Publication date: 2021-04-09
Anticipated expiration: 2041-01-07
Also published as: CN112634988B

Abstract

The invention provides a gene variation detection method and system based on Python language, the method comprises the following steps: s1: determining a Query protein sequence and genome sequence information of a plurality of species; s2: integrating after Blast comparison analysis to obtain comparison region information of each Query protein sequence; s3: creating and initializing a list and a set, sequentially comparing Query protein sequences with the set, and writing all Query protein sequences into the list to obtain the optimal comparison result information of each Query protein sequence; s4: removing sequences which are not compared in the Query protein sequence to obtain an updated Query protein sequence; s5: comparing the updated Query protein sequence with the sequence alignment library of the next species, and repeating the steps S2-S4 until the optimal comparison result information of each species of the Query protein sequence is obtained; s6: and setting extraction length base numbers, and extracting and recording the number information and variation condition of each exon of the Query protein sequence in different species. The method can shorten the comparison time, improve the effective comparison rate and complete the gene detection.

Description

Python language-based gene variation detection method and system

Technical Field

The invention belongs to the technical field of gene detection, and particularly relates to a gene variation detection method and system based on Python language.

Background

The genome sequence contains hundreds of millions of base pairs, and the biodiversity not only reflects the phenotypic diversity, but also includes the difference of the base pair arrangement sequence. The phenotypic changes of species are closely related to the characteristics of the molecular sequence. With the progress of sequencing technology, the genome sequence of a species can be determined within hours, but analysis of the variation characteristics of genome sequences of different species becomes important for revealing the phenotype and genetic relationship of the species, and various sequence alignment methods, such as Blast, exon, Genewise, Blat, Fasta and the like, have been provided, and the emphasis of different methods is different.

The Blast-based sequence comparison method can quickly and effectively analyze the similar part of the Query sequence and the Target sequence, and automatically screen out the section with the score larger than the threshold value according to the scoring matrix, but the Blast sequence comparison result contains all comparison information, and the result has larger redundancy, so that the effective information needs to be determined and screened in a longer time.

The similar part of the Query sequence and the Target sequence can be accurately obtained by sequence comparison based on the exon, and the accurate gene structure characteristics can be predicted, but the sequence comparison speed is exponentially reduced when the Target sequence is longer by sequence comparison of the exon, the whole genome sequence is directly used for comparison, the time period is too long, and no practical significance is realized.

Disclosure of Invention

An object of the present invention is to provide a Python language-based genetic variation detection method capable of detecting genetic variations efficiently and accurately.

In order to achieve the purpose, the technical scheme of the invention is as follows: a gene variation detection method based on Python language comprises the following steps:

s1: determining Query protein sequences required for comparison and genome sequence information of a plurality of species to be compared, and constructing a sequence comparison library according to the genome sequence information;

s2: performing Blast comparison analysis on Query protein sequences and a sequence comparison library of a species, and respectively comparing each Query protein sequence to a corresponding target genome adjacent region in the sequence comparison library according to comparison analysis results to integrate to obtain comparison region information of each Query protein sequence in a genome in the sequence comparison library;

s3: creating and initializing a list and a set, sequentially comparing each Query protein in a Query protein sequence with the set, writing the current Query protein into the set when the current Query protein does not exist in the set, recording a corresponding comparison score of the current Query protein as first comparison score and comparison region information, storing the first comparison score and the comparison region information into the list, if the current Query protein exists in the set, calculating a second comparison score of the current Query protein being compared, removing all information of the current Query protein existing in the list when the first comparison score of the current Query protein stored in the list is smaller than the second comparison score, and writing the second comparison score and the comparison region information of the current Query protein being compared into the list, otherwise, not operating until the list writes the best comparison result information of each Query protein;

s4: comparing the list obtained in the step S3 with the Query protein sequence, and removing the Query protein which is not compared in the Query protein sequence to obtain an updated Query protein sequence;

s5: comparing the updated Query protein sequence with the sequence comparison library of the next species, and repeating the steps S2-S4 until the optimal comparison result information of each species of the Query protein sequence is obtained;

s6: setting an extraction length base number, respectively passing the optimal comparison result information of each species and the corresponding Query protein sequence through the exon, extracting and recording the number information and variation condition of each exon of the Query protein sequence in different species according to the operation result of the exon.

Further, the comparison region information includes a start position and an end position of the comparison region, a comparison rate, and similarity information in the comparison region.

Further, the first comparison score is the product of the comparison rate and the similarity of the current Query protein stored in the set, and the second comparison score is the product of the comparison rate and the similarity of the current Query protein being compared.

Further, the step of Blast alignment analysis of Query protein sequence and sequence alignment library of a species in the step S2 specifically includes:

calculating whether multithreading is supported or not by using parallel, if so, segmenting and analyzing data of a Query protein sequence and a sequence comparison library of one species, and performing multithreading Blast comparison analysis;

otherwise, single-thread Blast alignment analysis of Query protein sequences and a species sequence alignment library.

Further, the extraction length base is set to 50000bp in the step S6.

The second purpose of the invention is to provide a gene mutation detection system based on Python language, which is used for detecting the mutation information in the gene alignment.

In order to achieve the purpose, the technical scheme of the invention is as follows: a Python language-based gene variation detection system comprises:

the data storage module is used for storing a sequence alignment database constructed by the Query protein sequences required by alignment and the genome sequence information of a plurality of species to be aligned;

the Blast comparison module is connected with the data storage module and is used for respectively comparing and analyzing the Query protein sequences and the sequence comparison library of each species through Blast, and respectively comparing each Query protein sequence to the adjacent region of the corresponding target genome in the sequence comparison library according to the comparison and analysis result to integrate so as to obtain the comparison region information of each Query protein sequence in the genome in the sequence comparison library of each species;

an optimization module, connected to the Blast comparison module, provided with an initialized list and set, for sequentially comparing each Query protein in the Query protein sequences with the set, when no current Query protein exists in the set, writing the current Query protein into the set, and recording the corresponding comparison score of the current Query protein as a first comparison score and comparison region information, storing the first comparison score and the comparison region information into the list, if the current Query protein exists in the set, calculating a second comparison score of the current Query protein being compared, when the first comparison score of the current Query protein stored in the list is smaller than the second comparison score, removing all information of the current Query protein existing in the list, writing the second comparison score and the comparison region information of the current Query protein being compared into the list, otherwise, not operating until the list writes the best comparison result information of each Query protein, and removing a sequence which is not compared with the finally obtained list in the Query protein sequences, obtaining an updated Query protein sequence;

and the Exonete comparison module is connected with the optimization module, is connected with the data storage module and is used for setting an extraction length base number, respectively passing the optimal comparison result information and the corresponding Query protein sequence through the Exonete, and extracting and recording the number information and variation condition of each exon of the Query protein sequence in different species according to the operation result of the Exonete.

Further, the comparison region information obtained by the Blast comparison module includes a start position and an end position of the comparison region, a comparison rate and similarity information in the comparison region.

Further, the first comparison score of the optimization module is a product of the comparison rate and the similarity of the current Query protein stored in the collection, and the second comparison score is a product of the comparison rate and the similarity of the current Query protein being compared.

Further, the Blast comparison module is also provided with a thread unit for calculating whether the Blast comparison module supports multithreading, if so, the data of the Query protein sequence and a sequence comparison library of one species are segmented and analyzed, and the multithreading Blast comparison analysis is carried out

Furthermore, the extraction length base number is set to be 50000bp in the Exonete comparison module.

Compared with the prior art, the invention has the following advantages:

the invention provides a Python language-based gene variation detection method and system, aiming at the defects of large redundancy of comparison information and low comparison speed in the current gene comparison genome process, the method provides the gene variation detection method which can shorten the comparison time, improve the effective comparison speed, simultaneously can accurately integrate the comparison result, integrate the information of gene comparison and variation into the result, and truly and reliably present in a table form, thereby easily selecting the gene with specific variation in certain species and bringing great convenience for solving the gene function and subsequent functional verification.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below. It is obvious that the drawings in the following description are some embodiments of the invention, and that for a person skilled in the art, other drawings can be derived from them without inventive exercise.

FIG. 1 is a schematic structural diagram of an embodiment of a Python language-based genetic variation detection system according to the present invention;

FIG. 2 is a flowchart of an embodiment of a Python language-based genetic variation detection method according to the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention. It is to be understood that the embodiments described are only a few embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The examples are given for the purpose of better illustration of the invention, but the invention is not limited to the examples. Therefore, those skilled in the art should make insubstantial modifications and adaptations to the embodiments of the present invention in light of the above teachings and remain within the scope of the invention.

Example 1

Referring to fig. 1, a schematic structural diagram of a Python language-based genetic variation detection system according to the present invention is shown, specifically, the system includes:

the data storage module 1 is used for storing a sequence alignment library constructed by the Query protein sequences required for alignment and the genome sequence information of a plurality of species to be aligned;

a Blast comparison module 2 connected with the data storage module 1 and used for respectively comparing and analyzing the Query protein sequences and the sequence comparison library of each species through Blast, and respectively comparing each Query protein sequence to the adjacent region of the corresponding target genome in the sequence comparison library according to the comparison and analysis result to integrate, so as to obtain the comparison region information of each Query protein sequence in the genome in the sequence comparison library of each species;

further, the Blast comparison module 2 is also provided with a first thread unit for calculating whether the Blast comparison module supports multithreading, and if so, segmenting and analyzing data of the Query protein sequence and a sequence comparison library of one species, and performing multithreading Blast comparison analysis;

the comparison region information obtained by the Blast comparison module 2 in this embodiment includes the start position and the end position of the comparison region, the comparison rate, and the similarity information in the comparison region.

The optimization module 3 is connected with the Blast comparison module 2, is provided with an initialized list and set, and is used for sequentially comparing each Query protein in a Query protein sequence with the set, when the current Query protein does not exist in the set, writing the current Query protein into the set, recording the corresponding comparison score of the current Query protein as a first comparison score and comparison area information, and storing the first comparison score and the comparison area information into the list, if the current Query protein exists in the set, calculating a second comparison score of the current Query protein being compared, when the first comparison score of the current Query protein stored in the list is smaller than the second comparison score, removing all information of the current Query protein existing in the list, writing the second comparison score and the comparison area information of the current Query protein being compared into the list, otherwise, not operating until the list writes the best comparison result information of each Query protein;

specifically, the first comparison score and the second comparison score of the optimization module 3 are products of the comparison rate and the similarity of the Query proteins, for example, the first comparison score is a product of the comparison rate and the similarity of the current Query protein stored in the set, and the second comparison score is a product of the comparison rate and the similarity of the current Query protein.

And the Exonete comparison module 4 is connected with the optimization module 3, is connected with the data storage module 1 and is used for setting an extraction length base number, respectively passing the optimal comparison result information and the corresponding Query protein sequence through the Exonete, extracting and recording the number information and variation condition of each exon of the Query protein sequence in different species according to the operation result of the Exonete.

Preferably, the extraction length base is set to 50000bp in the Exonete alignment module.

Example 2

Based on the system of example 1, this example provides a Python language-based gene mutation detection method, and referring to fig. 2, the method includes the following steps:

in the embodiment, a complete high-quality Query protein sequence with less errors is preferably selected;

in the embodiment, firstly, an operation environment is analyzed, whether a multithreading operation program can be executed or not is determined, specifically, parallel calculation is performed by using parallel calculation or multithreading calculation can be performed by calling a calculation node of a Linux server, whether multithreading is supported or not is checked, if the multithreading is supported, data of a Query protein sequence and a sequence comparison library of one species are segmented and analyzed, and multithreading Blast comparison analysis is performed; otherwise, analyzing the Query protein sequence and a sequence alignment library of a species by single-thread Blast alignment;

then, using Solar to compare a gene to a target genome adjacent region for integration, wherein the interval of the first comparison region of the target genome is not more than 10000bp in the integration process, and finally obtaining the comparison region of the protein in the genome, wherein the comparison region information comprises the initial position and the end position of the comparison region, the comparison rate and the similarity information in the comparison region; wherein, the comparison rate and the similarity are derived from the Solar integration result, the comparison rate is defined as the ratio of the comparison region of the protein to the length of the comparison region, and the comparison rate is 1 if the comparison region covers the full length; the similarity is defined as the proportion of all consistent sites in the comparison region to the total sites of the comparison region, and each gene is stored in the comparison region of the target genome for the subsequent extraction of the nucleotide sequence information of the region;

s3: creating and initializing a list and a set, sequentially comparing each Query protein in a Query protein sequence with the set, when the current Query protein does not exist in the set, writing the current Query protein into the set, and the corresponding comparison score of the current Query protein is recorded as the information of a first comparison score and a comparison area and stored in a list, and the comparison score of the current Query protein stored in the list is set as a first comparison score, if the current Query protein exists in the set, calculating a second alignment score of the current Query protein being aligned, when the first comparison score of the current Query protein stored in the list is smaller than the second comparison score, removing all the information of the current Query protein in the list, writing the second comparison score and the comparison region information of the current Query protein being compared into the list, otherwise, not operating until the list writes the optimal comparison result information of each Query protein;

after the integration in step S2, there still exist more Blast comparison redundant information, and the information with the highest comparison score with each optimal protein in this step is used to select and further remove the redundant information, and the specific operations in this step are as follows: specifically, a list and a set are created in a Python environment, the list and the set are initialized, n rows exist in the list and the set, a comparison score is defined as a product of a comparison rate and a similarity, each Query protein in a Query protein sequence is sequentially compared with the set from the 1 st row to the nth row by using an algorithm designed as follows according to the sequence of Blast comparison results in the step S2, if no current Query protein exists, the Query protein is written into the set, and the corresponding comparison score of the current Query protein is recorded as a first comparison score and stored into the list together with comparison information (i.e., comparison region information, comparison results of the gene in a genome after the Solar program is integrated) (and the comparison score of the current Query protein stored into the list is set as a first comparison score); if the current Query protein exists in the set when the nth row of the set is compared with the Query protein sequence, the comparison region corresponding to the current Query protein is written in the list, namely, one Query protein has two comparison information, the two comparison information are Blast comparison redundant information, at the moment, the second comparison score of the current Query protein (the Query protein in the nth row of the set) being compared is calculated, whether the first comparison score of the current Query protein stored in the list is smaller than the second comparison score is judged, if so, all information of the current Query protein existing in the list is removed, the second comparison score and the comparison region information of the current Query protein being compared are written in the list, the first comparison score is updated to be the second comparison score, if not, the cycle is skipped until the best comparison result information of each Query protein is written in the list, the optimal comparison result information of each gene can be obtained in linear time through the algorithm;

further, the first comparison score and the second comparison score are products of the comparison rate and the similarity of the Query proteins, for example, the first comparison score is a product of the comparison rate and the similarity of the current Query protein stored in the set, and the second comparison score is a product of the comparison rate and the similarity of the current Query protein.

S4: comparing the list obtained in the step S3 with the Query protein sequence, removing sequences which are not compared in the Query protein sequence, and obtaining an updated Query protein sequence;

in this embodiment, the list obtained in step S3 is to remove redundant information of the Query protein sequence obtained by Blast comparison, in this step, the redundant information in the Query protein sequence is correspondingly removed, specifically, according to the best comparison result information list (i.e., list) of the current species obtained in step S3, the list is compared with the Query protein sequence, a sequence having no comparison information compared with the list in the Query protein sequence is removed, the remaining sequence is used as an updated Query protein sequence, and the best comparison result file is stored in a new list.

in the embodiment, after each comparison between one species and a Query protein sequence, redundancy is removed through S3-S4, an updated Query protein sequence is obtained, and then the updated Query protein sequence is compared with the next species until the optimal comparison result information of each species of the Query protein sequence is obtained;

The method comprises the steps of S1-S5 redundancy removal, integrating blast comparison result information with Solar to obtain the optimal comparison region information of each Query protein sequence in a target genome, wherein the blast comparison has comparison information with a length of 12bp omitted (a blast algorithm is default to set a comparison region to be larger than 12bp), so that the blast comparison is not integrated by a Solar program, 50000bp is used as an extracted length base number in the step, wherein the length of the comparison region is increased to be (1-comparison rate), the step can dynamically and scientifically increase the length of the comparison region, the comparison length of the target genome is linearly increased according to the comparison rate, the operation speed is not influenced too much, and effective information can be reserved to the maximum extent;

inputting the optimal comparison region information with the increased region length into the execution and simultaneously extracting the corresponding Query protein sequence to extract the comparison protein according to the function in the program, writing the optimal comparison result information of each species and the corresponding Query protein sequence into the running, analyzing whether the current environment can be calculated in parallel according to the method in the step S2, if so, carrying out multi-thread execution analysis, otherwise, carrying out single-thread analysis; then obtaining the prediction result of all genes of the species based on the operation of the exon software, designing a corresponding algorithm according to the format of an exon result file to extract the number information and the variation condition of each exon, summarizing the number of exons in the current result according to the result, inserting variation, deletion variation and frame shift mutation information of all sequences of the exons, writing the information into a variation summarizing file of the current species, performing the above-mentioned circulation operation on the next species to know the result of the exon analysis of all the species, summarizing the number of the exons, the number of the insertion variation, the number of the deletion variation and the frame shift mutation information of each gene in all the species according to the variation information of all the species, and listing all the information into a general table to complete the comparison of the gene variation.

The method integrates Blast comparison results by calling a Solar program, compares a gene to a region close to the genome for combination, then filters redundant parts of result information, and then creates a step of meeting the requirement of comparing variation conditions of different genes of different species.

While the present invention has been described with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, which are illustrative and not restrictive, and it will be apparent to those skilled in the art that various changes and modifications can be made therein without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. A Python language-based gene variation detection method is characterized by comprising the following steps:

s2: performing Blast comparison analysis on a Query protein sequence and a sequence comparison library of a species, and respectively comparing each Query protein in the Query protein sequence to a corresponding target genome adjacent region in the sequence comparison library according to a comparison analysis result to integrate, so as to obtain comparison region information of each Query protein in a genome in the sequence comparison library;

2. The method of claim 1, wherein the comparison region information comprises a start position and an end position of the comparison region, the comparison ratio and the similarity information in the comparison region.

3. The method of claim 2, wherein the first alignment score is the product of the alignment and similarity of the current Query protein in the stored set and the second alignment score is the product of the alignment and similarity of the current Query protein being aligned.

4. The method of claim 1, wherein the step of Blast alignment analysis of Query protein sequences with a library of sequence alignments of a species in step S2 specifically comprises:

5. The method according to claim 1, wherein the extraction length base is set to 50000bp in step S6.

6. A Python language-based gene variation detection system is characterized by comprising:

an optimization module, connected to the Blast comparison module, provided with an initialized list and set, for sequentially comparing each Query protein in the Query protein sequence with the set, when no current Query protein exists in the set, writing the current Query protein into the set, and recording the corresponding comparison score of the current Query protein as a first comparison score and comparison area information, storing the first comparison score and the comparison area information into the list, if the current Query protein exists in the set, calculating a second comparison score of the current Query protein being compared, when the first comparison score of the current Query protein stored in the list is smaller than the second comparison score, removing all information of the current Query protein existing in the list, writing the second comparison score and the comparison area information of the current Query protein being compared into the list, otherwise, not operating until the list writes the best comparison result information of each Query protein

And removing Query proteins which are not compared with the finally obtained list in the Query protein sequence to obtain an updated Query protein sequence;

7. The system according to claim 6, wherein the information of the comparison region obtained by the Blast comparison module includes a start position, an end position of the comparison region, the comparison rate and the similarity information in the comparison region.

8. The system of claim 7, wherein the first alignment score of the optimization module is a product of alignment and similarity of a current Query protein in the stored set and the second alignment score is a product of alignment and similarity of a current Query protein being aligned.

9. The system of claim 6, wherein the Blast alignment module is further provided with a thread unit for calculating whether the Blast alignment module supports multithreading, and if so, the data of the Query protein sequence and the sequence alignment library of one species are segmented and analyzed to perform the multithreading Blast alignment analysis.

10. The system of claim 6, wherein the extract length base is set in the exon alignment module to 50000 bp.