CN112634988A - Python language-based gene variation detection method and system - Google Patents

Python language-based gene variation detection method and system Download PDF

Info

Publication number
CN112634988A
CN112634988A CN202110016893.9A CN202110016893A CN112634988A CN 112634988 A CN112634988 A CN 112634988A CN 202110016893 A CN202110016893 A CN 202110016893A CN 112634988 A CN112634988 A CN 112634988A
Authority
CN
China
Prior art keywords
comparison
query protein
sequence
information
species
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110016893.9A
Other languages
Chinese (zh)
Other versions
CN112634988B (en
Inventor
吕云云
李燕平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Neijiang Normal University
Original Assignee
Neijiang Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Neijiang Normal University filed Critical Neijiang Normal University
Priority to CN202110016893.9A priority Critical patent/CN112634988B/en
Publication of CN112634988A publication Critical patent/CN112634988A/en
Application granted granted Critical
Publication of CN112634988B publication Critical patent/CN112634988B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Theoretical Computer Science (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Biophysics (AREA)
  • Medical Informatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Chemical & Material Sciences (AREA)
  • Genetics & Genomics (AREA)
  • Molecular Biology (AREA)
  • Bioethics (AREA)
  • Databases & Information Systems (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a gene variation detection method and system based on Python language, the method comprises the following steps: s1: determining a Query protein sequence and genome sequence information of a plurality of species; s2: integrating after Blast comparison analysis to obtain comparison region information of each Query protein sequence; s3: creating and initializing a list and a set, sequentially comparing Query protein sequences with the set, and writing all Query protein sequences into the list to obtain the optimal comparison result information of each Query protein sequence; s4: removing sequences which are not compared in the Query protein sequence to obtain an updated Query protein sequence; s5: comparing the updated Query protein sequence with the sequence alignment library of the next species, and repeating the steps S2-S4 until the optimal comparison result information of each species of the Query protein sequence is obtained; s6: and setting extraction length base numbers, and extracting and recording the number information and variation condition of each exon of the Query protein sequence in different species. The method can shorten the comparison time, improve the effective comparison rate and complete the gene detection.

Description

Python language-based gene variation detection method and system
Technical Field
The invention belongs to the technical field of gene detection, and particularly relates to a gene variation detection method and system based on Python language.
Background
The genome sequence contains hundreds of millions of base pairs, and the biodiversity not only reflects the phenotypic diversity, but also includes the difference of the base pair arrangement sequence. The phenotypic changes of species are closely related to the characteristics of the molecular sequence. With the progress of sequencing technology, the genome sequence of a species can be determined within hours, but analysis of the variation characteristics of genome sequences of different species becomes important for revealing the phenotype and genetic relationship of the species, and various sequence alignment methods, such as Blast, exon, Genewise, Blat, Fasta and the like, have been provided, and the emphasis of different methods is different.
The Blast-based sequence comparison method can quickly and effectively analyze the similar part of the Query sequence and the Target sequence, and automatically screen out the section with the score larger than the threshold value according to the scoring matrix, but the Blast sequence comparison result contains all comparison information, and the result has larger redundancy, so that the effective information needs to be determined and screened in a longer time.
The similar part of the Query sequence and the Target sequence can be accurately obtained by sequence comparison based on the exon, and the accurate gene structure characteristics can be predicted, but the sequence comparison speed is exponentially reduced when the Target sequence is longer by sequence comparison of the exon, the whole genome sequence is directly used for comparison, the time period is too long, and no practical significance is realized.
Disclosure of Invention
An object of the present invention is to provide a Python language-based genetic variation detection method capable of detecting genetic variations efficiently and accurately.
In order to achieve the purpose, the technical scheme of the invention is as follows: a gene variation detection method based on Python language comprises the following steps:
s1: determining Query protein sequences required for comparison and genome sequence information of a plurality of species to be compared, and constructing a sequence comparison library according to the genome sequence information;
s2: performing Blast comparison analysis on Query protein sequences and a sequence comparison library of a species, and respectively comparing each Query protein sequence to a corresponding target genome adjacent region in the sequence comparison library according to comparison analysis results to integrate to obtain comparison region information of each Query protein sequence in a genome in the sequence comparison library;
s3: creating and initializing a list and a set, sequentially comparing each Query protein in a Query protein sequence with the set, writing the current Query protein into the set when the current Query protein does not exist in the set, recording a corresponding comparison score of the current Query protein as first comparison score and comparison region information, storing the first comparison score and the comparison region information into the list, if the current Query protein exists in the set, calculating a second comparison score of the current Query protein being compared, removing all information of the current Query protein existing in the list when the first comparison score of the current Query protein stored in the list is smaller than the second comparison score, and writing the second comparison score and the comparison region information of the current Query protein being compared into the list, otherwise, not operating until the list writes the best comparison result information of each Query protein;
s4: comparing the list obtained in the step S3 with the Query protein sequence, and removing the Query protein which is not compared in the Query protein sequence to obtain an updated Query protein sequence;
s5: comparing the updated Query protein sequence with the sequence comparison library of the next species, and repeating the steps S2-S4 until the optimal comparison result information of each species of the Query protein sequence is obtained;
s6: setting an extraction length base number, respectively passing the optimal comparison result information of each species and the corresponding Query protein sequence through the exon, extracting and recording the number information and variation condition of each exon of the Query protein sequence in different species according to the operation result of the exon.
Further, the comparison region information includes a start position and an end position of the comparison region, a comparison rate, and similarity information in the comparison region.
Further, the first comparison score is the product of the comparison rate and the similarity of the current Query protein stored in the set, and the second comparison score is the product of the comparison rate and the similarity of the current Query protein being compared.
Further, the step of Blast alignment analysis of Query protein sequence and sequence alignment library of a species in the step S2 specifically includes:
calculating whether multithreading is supported or not by using parallel, if so, segmenting and analyzing data of a Query protein sequence and a sequence comparison library of one species, and performing multithreading Blast comparison analysis;
otherwise, single-thread Blast alignment analysis of Query protein sequences and a species sequence alignment library.
Further, the extraction length base is set to 50000bp in the step S6.
The second purpose of the invention is to provide a gene mutation detection system based on Python language, which is used for detecting the mutation information in the gene alignment.
In order to achieve the purpose, the technical scheme of the invention is as follows: a Python language-based gene variation detection system comprises:
the data storage module is used for storing a sequence alignment database constructed by the Query protein sequences required by alignment and the genome sequence information of a plurality of species to be aligned;
the Blast comparison module is connected with the data storage module and is used for respectively comparing and analyzing the Query protein sequences and the sequence comparison library of each species through Blast, and respectively comparing each Query protein sequence to the adjacent region of the corresponding target genome in the sequence comparison library according to the comparison and analysis result to integrate so as to obtain the comparison region information of each Query protein sequence in the genome in the sequence comparison library of each species;
an optimization module, connected to the Blast comparison module, provided with an initialized list and set, for sequentially comparing each Query protein in the Query protein sequences with the set, when no current Query protein exists in the set, writing the current Query protein into the set, and recording the corresponding comparison score of the current Query protein as a first comparison score and comparison region information, storing the first comparison score and the comparison region information into the list, if the current Query protein exists in the set, calculating a second comparison score of the current Query protein being compared, when the first comparison score of the current Query protein stored in the list is smaller than the second comparison score, removing all information of the current Query protein existing in the list, writing the second comparison score and the comparison region information of the current Query protein being compared into the list, otherwise, not operating until the list writes the best comparison result information of each Query protein, and removing a sequence which is not compared with the finally obtained list in the Query protein sequences, obtaining an updated Query protein sequence;
and the Exonete comparison module is connected with the optimization module, is connected with the data storage module and is used for setting an extraction length base number, respectively passing the optimal comparison result information and the corresponding Query protein sequence through the Exonete, and extracting and recording the number information and variation condition of each exon of the Query protein sequence in different species according to the operation result of the Exonete.
Further, the comparison region information obtained by the Blast comparison module includes a start position and an end position of the comparison region, a comparison rate and similarity information in the comparison region.
Further, the first comparison score of the optimization module is a product of the comparison rate and the similarity of the current Query protein stored in the collection, and the second comparison score is a product of the comparison rate and the similarity of the current Query protein being compared.
Further, the Blast comparison module is also provided with a thread unit for calculating whether the Blast comparison module supports multithreading, if so, the data of the Query protein sequence and a sequence comparison library of one species are segmented and analyzed, and the multithreading Blast comparison analysis is carried out
Furthermore, the extraction length base number is set to be 50000bp in the Exonete comparison module.
Compared with the prior art, the invention has the following advantages:
the invention provides a Python language-based gene variation detection method and system, aiming at the defects of large redundancy of comparison information and low comparison speed in the current gene comparison genome process, the method provides the gene variation detection method which can shorten the comparison time, improve the effective comparison speed, simultaneously can accurately integrate the comparison result, integrate the information of gene comparison and variation into the result, and truly and reliably present in a table form, thereby easily selecting the gene with specific variation in certain species and bringing great convenience for solving the gene function and subsequent functional verification.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below. It is obvious that the drawings in the following description are some embodiments of the invention, and that for a person skilled in the art, other drawings can be derived from them without inventive exercise.
FIG. 1 is a schematic structural diagram of an embodiment of a Python language-based genetic variation detection system according to the present invention;
FIG. 2 is a flowchart of an embodiment of a Python language-based genetic variation detection method according to the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention. It is to be understood that the embodiments described are only a few embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The examples are given for the purpose of better illustration of the invention, but the invention is not limited to the examples. Therefore, those skilled in the art should make insubstantial modifications and adaptations to the embodiments of the present invention in light of the above teachings and remain within the scope of the invention.
Example 1
Referring to fig. 1, a schematic structural diagram of a Python language-based genetic variation detection system according to the present invention is shown, specifically, the system includes:
the data storage module 1 is used for storing a sequence alignment library constructed by the Query protein sequences required for alignment and the genome sequence information of a plurality of species to be aligned;
a Blast comparison module 2 connected with the data storage module 1 and used for respectively comparing and analyzing the Query protein sequences and the sequence comparison library of each species through Blast, and respectively comparing each Query protein sequence to the adjacent region of the corresponding target genome in the sequence comparison library according to the comparison and analysis result to integrate, so as to obtain the comparison region information of each Query protein sequence in the genome in the sequence comparison library of each species;
further, the Blast comparison module 2 is also provided with a first thread unit for calculating whether the Blast comparison module supports multithreading, and if so, segmenting and analyzing data of the Query protein sequence and a sequence comparison library of one species, and performing multithreading Blast comparison analysis;
the comparison region information obtained by the Blast comparison module 2 in this embodiment includes the start position and the end position of the comparison region, the comparison rate, and the similarity information in the comparison region.
The optimization module 3 is connected with the Blast comparison module 2, is provided with an initialized list and set, and is used for sequentially comparing each Query protein in a Query protein sequence with the set, when the current Query protein does not exist in the set, writing the current Query protein into the set, recording the corresponding comparison score of the current Query protein as a first comparison score and comparison area information, and storing the first comparison score and the comparison area information into the list, if the current Query protein exists in the set, calculating a second comparison score of the current Query protein being compared, when the first comparison score of the current Query protein stored in the list is smaller than the second comparison score, removing all information of the current Query protein existing in the list, writing the second comparison score and the comparison area information of the current Query protein being compared into the list, otherwise, not operating until the list writes the best comparison result information of each Query protein;
specifically, the first comparison score and the second comparison score of the optimization module 3 are products of the comparison rate and the similarity of the Query proteins, for example, the first comparison score is a product of the comparison rate and the similarity of the current Query protein stored in the set, and the second comparison score is a product of the comparison rate and the similarity of the current Query protein.
And the Exonete comparison module 4 is connected with the optimization module 3, is connected with the data storage module 1 and is used for setting an extraction length base number, respectively passing the optimal comparison result information and the corresponding Query protein sequence through the Exonete, extracting and recording the number information and variation condition of each exon of the Query protein sequence in different species according to the operation result of the Exonete.
Preferably, the extraction length base is set to 50000bp in the Exonete alignment module.
Example 2
Based on the system of example 1, this example provides a Python language-based gene mutation detection method, and referring to fig. 2, the method includes the following steps:
s1: determining Query protein sequences required for comparison and genome sequence information of a plurality of species to be compared, and constructing a sequence comparison library according to the genome sequence information;
in the embodiment, a complete high-quality Query protein sequence with less errors is preferably selected;
s2: performing Blast comparison analysis on Query protein sequences and a sequence comparison library of a species, and respectively comparing each Query protein sequence to a corresponding target genome adjacent region in the sequence comparison library according to comparison analysis results to integrate to obtain comparison region information of each Query protein sequence in a genome in the sequence comparison library;
in the embodiment, firstly, an operation environment is analyzed, whether a multithreading operation program can be executed or not is determined, specifically, parallel calculation is performed by using parallel calculation or multithreading calculation can be performed by calling a calculation node of a Linux server, whether multithreading is supported or not is checked, if the multithreading is supported, data of a Query protein sequence and a sequence comparison library of one species are segmented and analyzed, and multithreading Blast comparison analysis is performed; otherwise, analyzing the Query protein sequence and a sequence alignment library of a species by single-thread Blast alignment;
then, using Solar to compare a gene to a target genome adjacent region for integration, wherein the interval of the first comparison region of the target genome is not more than 10000bp in the integration process, and finally obtaining the comparison region of the protein in the genome, wherein the comparison region information comprises the initial position and the end position of the comparison region, the comparison rate and the similarity information in the comparison region; wherein, the comparison rate and the similarity are derived from the Solar integration result, the comparison rate is defined as the ratio of the comparison region of the protein to the length of the comparison region, and the comparison rate is 1 if the comparison region covers the full length; the similarity is defined as the proportion of all consistent sites in the comparison region to the total sites of the comparison region, and each gene is stored in the comparison region of the target genome for the subsequent extraction of the nucleotide sequence information of the region;
s3: creating and initializing a list and a set, sequentially comparing each Query protein in a Query protein sequence with the set, when the current Query protein does not exist in the set, writing the current Query protein into the set, and the corresponding comparison score of the current Query protein is recorded as the information of a first comparison score and a comparison area and stored in a list, and the comparison score of the current Query protein stored in the list is set as a first comparison score, if the current Query protein exists in the set, calculating a second alignment score of the current Query protein being aligned, when the first comparison score of the current Query protein stored in the list is smaller than the second comparison score, removing all the information of the current Query protein in the list, writing the second comparison score and the comparison region information of the current Query protein being compared into the list, otherwise, not operating until the list writes the optimal comparison result information of each Query protein;
after the integration in step S2, there still exist more Blast comparison redundant information, and the information with the highest comparison score with each optimal protein in this step is used to select and further remove the redundant information, and the specific operations in this step are as follows: specifically, a list and a set are created in a Python environment, the list and the set are initialized, n rows exist in the list and the set, a comparison score is defined as a product of a comparison rate and a similarity, each Query protein in a Query protein sequence is sequentially compared with the set from the 1 st row to the nth row by using an algorithm designed as follows according to the sequence of Blast comparison results in the step S2, if no current Query protein exists, the Query protein is written into the set, and the corresponding comparison score of the current Query protein is recorded as a first comparison score and stored into the list together with comparison information (i.e., comparison region information, comparison results of the gene in a genome after the Solar program is integrated) (and the comparison score of the current Query protein stored into the list is set as a first comparison score); if the current Query protein exists in the set when the nth row of the set is compared with the Query protein sequence, the comparison region corresponding to the current Query protein is written in the list, namely, one Query protein has two comparison information, the two comparison information are Blast comparison redundant information, at the moment, the second comparison score of the current Query protein (the Query protein in the nth row of the set) being compared is calculated, whether the first comparison score of the current Query protein stored in the list is smaller than the second comparison score is judged, if so, all information of the current Query protein existing in the list is removed, the second comparison score and the comparison region information of the current Query protein being compared are written in the list, the first comparison score is updated to be the second comparison score, if not, the cycle is skipped until the best comparison result information of each Query protein is written in the list, the optimal comparison result information of each gene can be obtained in linear time through the algorithm;
further, the first comparison score and the second comparison score are products of the comparison rate and the similarity of the Query proteins, for example, the first comparison score is a product of the comparison rate and the similarity of the current Query protein stored in the set, and the second comparison score is a product of the comparison rate and the similarity of the current Query protein.
S4: comparing the list obtained in the step S3 with the Query protein sequence, removing sequences which are not compared in the Query protein sequence, and obtaining an updated Query protein sequence;
in this embodiment, the list obtained in step S3 is to remove redundant information of the Query protein sequence obtained by Blast comparison, in this step, the redundant information in the Query protein sequence is correspondingly removed, specifically, according to the best comparison result information list (i.e., list) of the current species obtained in step S3, the list is compared with the Query protein sequence, a sequence having no comparison information compared with the list in the Query protein sequence is removed, the remaining sequence is used as an updated Query protein sequence, and the best comparison result file is stored in a new list.
S5: comparing the updated Query protein sequence with the sequence comparison library of the next species, and repeating the steps S2-S4 until the optimal comparison result information of each species of the Query protein sequence is obtained;
in the embodiment, after each comparison between one species and a Query protein sequence, redundancy is removed through S3-S4, an updated Query protein sequence is obtained, and then the updated Query protein sequence is compared with the next species until the optimal comparison result information of each species of the Query protein sequence is obtained;
s6: setting an extraction length base number, respectively passing the optimal comparison result information of each species and the corresponding Query protein sequence through the exon, extracting and recording the number information and variation condition of each exon of the Query protein sequence in different species according to the operation result of the exon.
The method comprises the steps of S1-S5 redundancy removal, integrating blast comparison result information with Solar to obtain the optimal comparison region information of each Query protein sequence in a target genome, wherein the blast comparison has comparison information with a length of 12bp omitted (a blast algorithm is default to set a comparison region to be larger than 12bp), so that the blast comparison is not integrated by a Solar program, 50000bp is used as an extracted length base number in the step, wherein the length of the comparison region is increased to be (1-comparison rate), the step can dynamically and scientifically increase the length of the comparison region, the comparison length of the target genome is linearly increased according to the comparison rate, the operation speed is not influenced too much, and effective information can be reserved to the maximum extent;
inputting the optimal comparison region information with the increased region length into the execution and simultaneously extracting the corresponding Query protein sequence to extract the comparison protein according to the function in the program, writing the optimal comparison result information of each species and the corresponding Query protein sequence into the running, analyzing whether the current environment can be calculated in parallel according to the method in the step S2, if so, carrying out multi-thread execution analysis, otherwise, carrying out single-thread analysis; then obtaining the prediction result of all genes of the species based on the operation of the exon software, designing a corresponding algorithm according to the format of an exon result file to extract the number information and the variation condition of each exon, summarizing the number of exons in the current result according to the result, inserting variation, deletion variation and frame shift mutation information of all sequences of the exons, writing the information into a variation summarizing file of the current species, performing the above-mentioned circulation operation on the next species to know the result of the exon analysis of all the species, summarizing the number of the exons, the number of the insertion variation, the number of the deletion variation and the frame shift mutation information of each gene in all the species according to the variation information of all the species, and listing all the information into a general table to complete the comparison of the gene variation.
The method integrates Blast comparison results by calling a Solar program, compares a gene to a region close to the genome for combination, then filters redundant parts of result information, and then creates a step of meeting the requirement of comparing variation conditions of different genes of different species.
While the present invention has been described with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, which are illustrative and not restrictive, and it will be apparent to those skilled in the art that various changes and modifications can be made therein without departing from the spirit and scope of the invention as defined in the appended claims.

Claims (10)

1. A Python language-based gene variation detection method is characterized by comprising the following steps:
s1: determining Query protein sequences required for comparison and genome sequence information of a plurality of species to be compared, and constructing a sequence comparison library according to the genome sequence information;
s2: performing Blast comparison analysis on a Query protein sequence and a sequence comparison library of a species, and respectively comparing each Query protein in the Query protein sequence to a corresponding target genome adjacent region in the sequence comparison library according to a comparison analysis result to integrate, so as to obtain comparison region information of each Query protein in a genome in the sequence comparison library;
s3: creating and initializing a list and a set, sequentially comparing each Query protein in a Query protein sequence with the set, writing the current Query protein into the set when the current Query protein does not exist in the set, recording a corresponding comparison score of the current Query protein as first comparison score and comparison region information, storing the first comparison score and the comparison region information into the list, if the current Query protein exists in the set, calculating a second comparison score of the current Query protein being compared, removing all information of the current Query protein existing in the list when the first comparison score of the current Query protein stored in the list is smaller than the second comparison score, and writing the second comparison score and the comparison region information of the current Query protein being compared into the list, otherwise, not operating until the list writes the best comparison result information of each Query protein;
s4: comparing the list obtained in the step S3 with the Query protein sequence, and removing the Query protein which is not compared in the Query protein sequence to obtain an updated Query protein sequence;
s5: comparing the updated Query protein sequence with the sequence comparison library of the next species, and repeating the steps S2-S4 until the optimal comparison result information of each species of the Query protein sequence is obtained;
s6: setting an extraction length base number, respectively passing the optimal comparison result information of each species and the corresponding Query protein sequence through the exon, extracting and recording the number information and variation condition of each exon of the Query protein sequence in different species according to the operation result of the exon.
2. The method of claim 1, wherein the comparison region information comprises a start position and an end position of the comparison region, the comparison ratio and the similarity information in the comparison region.
3. The method of claim 2, wherein the first alignment score is the product of the alignment and similarity of the current Query protein in the stored set and the second alignment score is the product of the alignment and similarity of the current Query protein being aligned.
4. The method of claim 1, wherein the step of Blast alignment analysis of Query protein sequences with a library of sequence alignments of a species in step S2 specifically comprises:
calculating whether multithreading is supported or not by using parallel, if so, segmenting and analyzing data of a Query protein sequence and a sequence comparison library of one species, and performing multithreading Blast comparison analysis;
otherwise, single-thread Blast alignment analysis of Query protein sequences and a species sequence alignment library.
5. The method according to claim 1, wherein the extraction length base is set to 50000bp in step S6.
6. A Python language-based gene variation detection system is characterized by comprising:
the data storage module is used for storing a sequence alignment database constructed by the Query protein sequences required by alignment and the genome sequence information of a plurality of species to be aligned;
the Blast comparison module is connected with the data storage module and is used for respectively comparing and analyzing the Query protein sequences and the sequence comparison library of each species through Blast, and respectively comparing each Query protein sequence to the adjacent region of the corresponding target genome in the sequence comparison library according to the comparison and analysis result to integrate so as to obtain the comparison region information of each Query protein sequence in the genome in the sequence comparison library of each species;
an optimization module, connected to the Blast comparison module, provided with an initialized list and set, for sequentially comparing each Query protein in the Query protein sequence with the set, when no current Query protein exists in the set, writing the current Query protein into the set, and recording the corresponding comparison score of the current Query protein as a first comparison score and comparison area information, storing the first comparison score and the comparison area information into the list, if the current Query protein exists in the set, calculating a second comparison score of the current Query protein being compared, when the first comparison score of the current Query protein stored in the list is smaller than the second comparison score, removing all information of the current Query protein existing in the list, writing the second comparison score and the comparison area information of the current Query protein being compared into the list, otherwise, not operating until the list writes the best comparison result information of each Query protein
And removing Query proteins which are not compared with the finally obtained list in the Query protein sequence to obtain an updated Query protein sequence;
and the Exonete comparison module is connected with the optimization module, is connected with the data storage module and is used for setting an extraction length base number, respectively passing the optimal comparison result information and the corresponding Query protein sequence through the Exonete, and extracting and recording the number information and variation condition of each exon of the Query protein sequence in different species according to the operation result of the Exonete.
7. The system according to claim 6, wherein the information of the comparison region obtained by the Blast comparison module includes a start position, an end position of the comparison region, the comparison rate and the similarity information in the comparison region.
8. The system of claim 7, wherein the first alignment score of the optimization module is a product of alignment and similarity of a current Query protein in the stored set and the second alignment score is a product of alignment and similarity of a current Query protein being aligned.
9. The system of claim 6, wherein the Blast alignment module is further provided with a thread unit for calculating whether the Blast alignment module supports multithreading, and if so, the data of the Query protein sequence and the sequence alignment library of one species are segmented and analyzed to perform the multithreading Blast alignment analysis.
10. The system of claim 6, wherein the extract length base is set in the exon alignment module to 50000 bp.
CN202110016893.9A 2021-01-07 2021-01-07 Python language-based gene variation detection method and system Active CN112634988B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110016893.9A CN112634988B (en) 2021-01-07 2021-01-07 Python language-based gene variation detection method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110016893.9A CN112634988B (en) 2021-01-07 2021-01-07 Python language-based gene variation detection method and system

Publications (2)

Publication Number Publication Date
CN112634988A true CN112634988A (en) 2021-04-09
CN112634988B CN112634988B (en) 2021-10-08

Family

ID=75290967

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110016893.9A Active CN112634988B (en) 2021-01-07 2021-01-07 Python language-based gene variation detection method and system

Country Status (1)

Country Link
CN (1) CN112634988B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114758724A (en) * 2022-05-23 2022-07-15 内江师范学院 Antibacterial peptide screening method and system

Citations (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104603283A (en) * 2012-08-01 2015-05-06 深圳华大基因研究院 Method and system to determine biomarkers related to abnormal condition
CN105219765A (en) * 2015-11-09 2016-01-06 中国水产科学研究院 Protein sequence is utilized to build genomic method and apparatus
CN105760711A (en) * 2016-02-02 2016-07-13 江南大学 Method for using KNN calculation and similarity comparison to predict protein subcellular section
CN106682393A (en) * 2016-11-29 2017-05-17 北京荣之联科技股份有限公司 Genomic sequence alignment method and genomic sequence alignment device
CN107679616A (en) * 2017-10-20 2018-02-09 江南大学 A kind of residue interactive network alignment algorithm SI MAGNA of calling sequence information
CN108009405A (en) * 2017-12-26 2018-05-08 重庆佰诺吉生物科技有限公司 A kind of method based on machine learning techniques prediction Bacterial outer membrane proteins matter
US20180240032A1 (en) * 2017-02-23 2018-08-23 Edico Genome, Corp. Bioinformatics systems, apparatuses, and methods executed on a quantum processing platform
CN108604260A (en) * 2016-01-11 2018-09-28 艾迪科基因组公司 For scene or the genomics architecture of DNA based on cloud and RNA processing and analysis
US20190206512A1 (en) * 2017-07-21 2019-07-04 James Lu Genomic services platform supporting multiple application providers
CN110600078A (en) * 2019-08-23 2019-12-20 北京百迈客生物科技有限公司 Method for detecting genome structure variation based on nanopore sequencing
WO2020023882A1 (en) * 2018-07-27 2020-01-30 Myriad Women's Health, Inc. Method for detecting genetic variation in highly homologous sequences by independent alignment and pairing of sequence reads
CN110993023A (en) * 2019-11-29 2020-04-10 北京优迅医学检验实验室有限公司 Detection method and detection device for complex mutation
CN111145833A (en) * 2019-12-16 2020-05-12 南京理工大学 Deep multi-sequence alignment method for protein complex
CN111276189A (en) * 2020-02-26 2020-06-12 广州市金域转化医学研究院有限公司 Chromosome balance translocation detection and analysis system based on NGS and application thereof
CN111304308A (en) * 2020-03-02 2020-06-19 北京泛生子基因科技有限公司 Method for auditing detection result of high-throughput sequencing gene variation
CN111445949A (en) * 2020-03-27 2020-07-24 武汉古奥基因科技有限公司 Method for annotating genome of high-altitude polyploid fish by using nanopore sequencing data

Patent Citations (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104603283A (en) * 2012-08-01 2015-05-06 深圳华大基因研究院 Method and system to determine biomarkers related to abnormal condition
CN105219765A (en) * 2015-11-09 2016-01-06 中国水产科学研究院 Protein sequence is utilized to build genomic method and apparatus
CN108604260A (en) * 2016-01-11 2018-09-28 艾迪科基因组公司 For scene or the genomics architecture of DNA based on cloud and RNA processing and analysis
CN105760711A (en) * 2016-02-02 2016-07-13 江南大学 Method for using KNN calculation and similarity comparison to predict protein subcellular section
CN106682393A (en) * 2016-11-29 2017-05-17 北京荣之联科技股份有限公司 Genomic sequence alignment method and genomic sequence alignment device
US20180240032A1 (en) * 2017-02-23 2018-08-23 Edico Genome, Corp. Bioinformatics systems, apparatuses, and methods executed on a quantum processing platform
US20190206512A1 (en) * 2017-07-21 2019-07-04 James Lu Genomic services platform supporting multiple application providers
CN107679616A (en) * 2017-10-20 2018-02-09 江南大学 A kind of residue interactive network alignment algorithm SI MAGNA of calling sequence information
CN108009405A (en) * 2017-12-26 2018-05-08 重庆佰诺吉生物科技有限公司 A kind of method based on machine learning techniques prediction Bacterial outer membrane proteins matter
WO2020023882A1 (en) * 2018-07-27 2020-01-30 Myriad Women's Health, Inc. Method for detecting genetic variation in highly homologous sequences by independent alignment and pairing of sequence reads
CN110600078A (en) * 2019-08-23 2019-12-20 北京百迈客生物科技有限公司 Method for detecting genome structure variation based on nanopore sequencing
CN110993023A (en) * 2019-11-29 2020-04-10 北京优迅医学检验实验室有限公司 Detection method and detection device for complex mutation
CN111145833A (en) * 2019-12-16 2020-05-12 南京理工大学 Deep multi-sequence alignment method for protein complex
CN111276189A (en) * 2020-02-26 2020-06-12 广州市金域转化医学研究院有限公司 Chromosome balance translocation detection and analysis system based on NGS and application thereof
CN111304308A (en) * 2020-03-02 2020-06-19 北京泛生子基因科技有限公司 Method for auditing detection result of high-throughput sequencing gene variation
CN111445949A (en) * 2020-03-27 2020-07-24 武汉古奥基因科技有限公司 Method for annotating genome of high-altitude polyploid fish by using nanopore sequencing data

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
CHIRAG JAIN 等: "Accelerating Sequence Alignment to Graphs", 《BIORXIV》 *
SVEN WARRIS 等: "pyPaSWAS: Python-based multi-core CPU and GPU sequence alignment", 《PLOS ONE》 *
牟少华 等: "全基因组重测序分析 4 个毛竹变型的秆形和秆色变异", 《华北农学报》 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114758724A (en) * 2022-05-23 2022-07-15 内江师范学院 Antibacterial peptide screening method and system

Also Published As

Publication number Publication date
CN112634988B (en) 2021-10-08

Similar Documents

Publication Publication Date Title
CN111341383B (en) Method, device and storage medium for detecting copy number variation
CN107944228B (en) Visualization method for gene sequencing variation site
WO2017143585A1 (en) Method and apparatus for assembling separated long fragment sequences
JPWO2020058176A5 (en)
KR102345994B1 (en) Method and apparatus for screening gene related with disease in next generation sequence analysis
CN112086131B (en) Screening method for false positive variation sites in resequencing database
CN112634988B (en) Python language-based gene variation detection method and system
CN113096737B (en) Method and system for automatically analyzing pathogen type
CN116434837B (en) Chromosome balance translocation detection analysis system based on NGS
CN117727363A (en) Method and system for analyzing tumor gene mutation detection biological information of multiple sequencing platforms
CN112308603A (en) Similarity expansion-based rapid store site selection method and device and storage medium
JP6356015B2 (en) Gene expression information analyzing apparatus, gene expression information analyzing method, and program
CN109698011A (en) Indel regional correction method and system based on short sequence alignment
CN111028885B (en) Method and device for detecting yak RNA editing site
CN112837746B (en) Probe design method and positioning method for wheat exon sequencing gene positioning
US10671632B1 (en) Automated pipeline
JPH11178575A (en) Apparatus for analyzing dna base sequence, analysis and recording medium
CN113793641B (en) Method for rapidly judging sample gender from FASTQ file
CN116646010B (en) Human virus detection method and device, equipment and storage medium
CN114464252B (en) Method and device for detecting structural variation
CN116469468B (en) Editing gene carrier residue detection method and system based on Bayes model
CN113711026B (en) Outlier detection method of theoretical quality
CN115391284B (en) Method, system and computer readable storage medium for quickly identifying gene data file
CN118072835B (en) Machine learning-based bioinformatics data processing method, system and medium
US20080229018A1 (en) Save data discrimination method, save data discrimination apparatus, and a computer-readable medium storing save a data discrimination program

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant