CN111402956A - Sequence comparison method, device, equipment and medium - Google Patents
Sequence comparison method, device, equipment and medium Download PDFInfo
- Publication number
- CN111402956A CN111402956A CN202010130211.2A CN202010130211A CN111402956A CN 111402956 A CN111402956 A CN 111402956A CN 202010130211 A CN202010130211 A CN 202010130211A CN 111402956 A CN111402956 A CN 111402956A
- Authority
- CN
- China
- Prior art keywords
- target
- reference sub
- sequence
- segment
- score
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
- 238000000034 method Methods 0.000 title claims abstract description 67
- 238000012216 screening Methods 0.000 claims abstract description 16
- 238000002864 sequence alignment Methods 0.000 claims description 34
- 238000004590 computer program Methods 0.000 claims description 14
- 239000012634 fragment Substances 0.000 claims description 9
- 230000011218 segmentation Effects 0.000 claims description 6
- 108090000623 proteins and genes Proteins 0.000 description 7
- 238000010586 diagram Methods 0.000 description 6
- 238000012545 processing Methods 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 3
- 238000004891 communication Methods 0.000 description 2
- 238000009877 rendering Methods 0.000 description 2
- 238000001514 detection method Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 201000010099 disease Diseases 0.000 description 1
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 description 1
- 230000002068 genetic effect Effects 0.000 description 1
- 230000005055 memory storage Effects 0.000 description 1
- 231100000915 pathological change Toxicity 0.000 description 1
- 230000036285 pathological change Effects 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
- G16B30/10—Sequence alignment; Homology search
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
- G16B20/30—Detection of binding sites or motifs
Landscapes
- Life Sciences & Earth Sciences (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Engineering & Computer Science (AREA)
- Biotechnology (AREA)
- General Health & Medical Sciences (AREA)
- Biophysics (AREA)
- Analytical Chemistry (AREA)
- Bioinformatics & Computational Biology (AREA)
- Chemical & Material Sciences (AREA)
- Evolutionary Biology (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Medical Informatics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Theoretical Computer Science (AREA)
- Genetics & Genomics (AREA)
- Molecular Biology (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
- Image Analysis (AREA)
Abstract
The application discloses a sequence comparison method, a device, equipment and a medium, wherein the method comprises the following steps: segmenting a sequence read to be compared to obtain a target seed corresponding to the sequence read to be compared; segmenting a target reference sequence obtained in advance to obtain a reference sub-segment; comparing all target seeds in the sequence read to be compared with the reference sub-segments according to a preset rule to obtain a target score; screening a target reference sub-segment from the reference sub-segments according to the target score; and determining the position of the target reference sub-segment in the target reference sequence as the exact matching position of the sequence read to be aligned in the target reference sequence. Therefore, invalid matching positions can be filtered, computing resources are saved, the sequence comparison performance is improved, and the workload of subsequent sequence expansion is reduced.
Description
Technical Field
The present application relates to the field of genetic testing technologies, and in particular, to a method, an apparatus, a device, and a medium for sequence comparison.
Background
With the rapid development of biological gene detection technology, the individual gene is extracted to compare gene sequences, the possibility of suffering from various diseases is predicted, the gene of individual pathological changes is locked, and the technology for preventing and treating in advance is mature. The human gene library is currently about 30 hundred million base pairs, and several days are required for completing the gene sequence alignment of a person by adopting a general computer software processing platform. The sequence alignment mainly comprises two stages of seed finding and expansion. In order to improve the accuracy of the sequence alignment, it is necessary to find the position where the seed of the sequence read to be aligned appears in the reference sequence as much as possible. In the prior art, a large number of invalid positions are compared, so that a large number of computing resources are wasted, and the performance of the whole sequence comparison is greatly reduced.
Disclosure of Invention
In view of this, an object of the present application is to provide a method, an apparatus, a device, and a medium for sequence alignment, which can filter out invalid matching positions, save computing resources, improve the performance of sequence alignment, and reduce the workload of subsequent sequence extension. The specific scheme is as follows:
in a first aspect, the present application discloses a method of sequence alignment, comprising:
segmenting a to-be-compared sequence read to obtain a target seed corresponding to the to-be-detected sequence;
segmenting a target reference sequence obtained in advance to obtain a reference sub-segment;
comparing all target seeds in the sequence read to be compared with the reference sub-segments according to a preset rule to obtain a target score;
screening a target reference sub-segment from the reference sub-segments according to the target score;
and determining the position of the target reference sub-segment in the target reference sequence as the exact matching position of the sequence read to be aligned in the target reference sequence.
Optionally, the comparing all the target seeds in the sequence read to be compared with the reference sub-segments according to a preset rule to obtain a target score, including:
and comparing all the target seeds in the sequence read to be compared with the reference sub-segments according to a preset rule to obtain a target score.
Optionally, comparing all target seeds in the sequence read to be compared with any reference sub-segment according to a preset rule to obtain a target score, including:
initializing the target score;
comparing all target seed in the sequence read to be compared with the reference sub-segment, and judging whether a target condition occurs in the comparison process;
and if the target condition occurs in the comparison process, subtracting the corresponding penalty score from the target score to obtain the target score corresponding to the reference sub-segment.
Optionally, the determining whether the target condition occurs in the comparison process includes:
judging whether all target seeds in the sequence read to be compared hit on the reference sub-segment and the hit positions on the reference sub-segment are discontinuous in the comparison process;
and/or judging whether the first target seed of all the target seeds in the sequence to be compared is hit on the reference sub-segment or not in the comparison process, and the second target seed of all the target seeds in the sequence to be compared is not hit on the reference sub-segment.
Optionally, comparing all target seeds in the sequence read to be compared with any reference sub-segment according to a preset rule to obtain a target score, including:
initializing the target score;
comparing all target seed in the sequence read to be compared with the reference subfragment;
and if the target seed hits on the reference sub-segment, adding the corresponding reward score to the target score to obtain the target score corresponding to the reference sub-segment.
Optionally, the screening out the target reference sub-segment from the reference sub-segments according to the target score includes:
judging whether the target score is greater than or equal to a preset score threshold value;
and if the target score is greater than or equal to a preset score threshold value, determining the reference sub-segment corresponding to the target score as a target reference sub-segment.
Optionally, the screening out the target reference sub-segment from the reference sub-segments according to the target score includes:
normalizing the target score;
judging whether the normalized target score is greater than or equal to a preset score threshold value or not;
and if the normalized target score is greater than or equal to a preset score threshold value, determining the reference sub-segment corresponding to the normalized target score as a target reference sub-segment.
In a second aspect, the present application discloses a sequence alignment apparatus comprising:
the first segmentation module is used for segmenting a sequence read to be compared to obtain a target seed corresponding to the sequence read to be compared;
the second segmentation module is used for segmenting a target reference sequence obtained in advance to obtain a reference sub-segment;
the comparison module is used for comparing all target seeds in the sequence read to be compared with the reference sub-segments according to a preset rule to obtain a target score;
the fragment screening module is used for screening target reference sub-fragments from the reference sub-fragments according to the target scores;
a position determining module, configured to determine a position of the target reference sub-segment in the target reference sequence as an exact matching position of the sequence read to be aligned in the target reference sequence.
In a third aspect, the present application discloses a sequence alignment apparatus comprising:
a memory and a processor;
wherein the memory is used for storing a computer program;
the processor is used for executing the computer program to realize the sequence alignment method disclosed in the foregoing.
In a fourth aspect, the present application discloses a computer readable storage medium for storing a computer program, wherein the computer program, when executed by a processor, implements the sequence alignment method disclosed above.
It can be seen that, according to the present application, a to-be-compared sequence read is firstly segmented to obtain a target seed corresponding to the to-be-compared sequence read, a pre-obtained target reference sequence is segmented to obtain a reference sub-segment, then all target seed in the to-be-compared sequence read is compared with the reference sub-segment according to a preset rule to obtain a target score, a target reference sub-segment is screened out from the reference sub-segment according to the target score, and then the position of the target reference sub-segment in the target reference sequence is determined as an accurate matching position of the to-be-compared sequence read in the target reference sequence. Therefore, according to the method, sequences to be compared and target reference sequences obtained in advance need to be segmented respectively, target seed and reference sub-segments are correspondingly obtained, then all target seed in the sequences to be compared and the reference sub-segments are compared according to preset rules, target scores are obtained, target reference sub-segments are screened out from the reference sub-segments according to the target scores, then the positions of the target reference sub-segments in the target reference sequences are determined to be accurate matching positions of the sequences to be compared, invalid matching positions can be filtered out, computing resources are saved, the sequence comparison performance is improved, and the workload of subsequent sequence expansion is reduced.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
FIG. 1 is a flow chart of a method of sequence alignment disclosed herein;
FIG. 2 is a flow chart of a specific sequence alignment method disclosed herein;
FIG. 3 is a process diagram of an accurate general matching position in a sequence comparison pair disclosed in the present application;
FIG. 4 is a schematic diagram of a sequence alignment apparatus disclosed herein;
FIG. 5 is a block diagram of a sequence alignment apparatus disclosed herein;
fig. 6 is a block diagram of an electronic device disclosed in the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
At present, a large number of invalid positions exist in gene sequence comparison, a large number of computing resources are wasted, and the performance of the whole sequence comparison is greatly reduced. In view of this, the present application provides a sequence alignment method, which can filter out invalid matching positions, save computing resources, improve the performance of sequence alignment, and reduce the workload of subsequent sequence extension.
Referring to fig. 1, the present application discloses a sequence alignment method, which includes:
step S11: and segmenting the to-be-compared sequence read to obtain a target seed corresponding to the to-be-compared sequence read.
In a specific implementation process, in order to perform accurate comparison, a to-be-compared sequence read needs to be segmented first, so as to obtain a target seed corresponding to the to-be-compared sequence read. Wherein seed is a sequence with exact match and waiting for expansion in the sequence alignment. For example, the sequence read to be aligned includes 100 bases, 20 bases are divided into one segment to obtain the target seed, the first target seed is from 0 th to 19 th bases, the second target seed is from 1 st to 20 th bases, the third target seed is from 2 nd to 21 st bases, and so on.
Step S12: and segmenting the pre-obtained target reference sequence to obtain a reference sub-segment.
It will be appreciated that after segmenting the sequence read to be aligned, it is also necessary to correspondingly segment the target reference sequence to obtain reference subfragments for alignment with the target seed. Since the target reference sequence is generally longer, it is also necessary to segment the target reference sequence for alignment, and the length of the reference subfragment is generally greater than or equal to the length of the sequence read to be aligned.
Step S13: and comparing all the target seeds in the sequence read to be compared with the reference sub-segments according to a preset rule to obtain a target score.
After the reference sub-segments are obtained, all the target seeds in the sequence read to be compared need to be compared with the reference sub-segments according to a preset rule, so as to obtain target scores. Comparing all target seeds in the sequence read to be compared with the reference sub-segments according to a preset rule to obtain a target score, wherein the target score comprises: and comparing all the target seeds in the sequence read to be compared with the reference sub-segments according to a preset rule to obtain a target score. For example, if the number of the target seed obtained by the sequence read to be compared is 2, and the number of the reference sub-segments obtained by the target reference sequence is 2, the first target seed is compared with the first reference sub-segment according to a preset rule, then the second target seed is compared with the first reference sub-segment to obtain a corresponding target score, the first target seed is compared with the second reference sub-segment according to a preset rule, and then the second target seed is compared with the second reference sub-segment to obtain a corresponding target score.
Step S14: and screening out target reference sub-segments from the reference sub-segments according to the target scores.
In a specific implementation process, after the target score is obtained, a target reference sub-segment needs to be screened from the reference sub-segments according to the target score, so as to determine an exact match of the sequence read to be aligned in the target reference sequence.
In a first specific embodiment, the screening a target reference sub-segment from the reference sub-segments according to the target score includes: judging whether the target score is greater than or equal to a preset score threshold value; and if the target score is greater than or equal to a preset score threshold value, determining the reference sub-segment corresponding to the target score as a target reference sub-segment. Specifically, after the target score is obtained, whether the target score is greater than or equal to a preset score threshold is judged, and if the target score is greater than or equal to the preset score threshold, the reference sub-segment corresponding to the target score is determined as the target reference sub-segment.
In a second specific embodiment, the screening a target reference sub-segment from the reference sub-segments according to the target score includes: normalizing the target score; judging whether the normalized target score is greater than or equal to a preset score threshold value or not; and if the normalized target score is greater than or equal to a preset score threshold value, determining the reference sub-segment corresponding to the normalized target score as a target reference sub-segment.
Step S15: and determining the position of the target reference sub-segment in the target reference sequence as the exact matching position of the sequence read to be aligned in the target reference sequence.
It will be appreciated that after the target reference sub-segment is determined, the position of the target reference sub-segment in the target reference sequence may be determined as the exact matching position of the sequence read to be aligned in the target reference sequence.
It can be seen that, according to the present application, a to-be-compared sequence read is firstly segmented to obtain a target seed corresponding to the to-be-compared sequence read, a pre-obtained target reference sequence is segmented to obtain a reference sub-segment, then all target seed in the to-be-compared sequence read is compared with the reference sub-segment according to a preset rule to obtain a target score, a target reference sub-segment is screened out from the reference sub-segment according to the target score, and then the position of the target reference sub-segment in the target reference sequence is determined as an accurate matching position of the to-be-compared sequence read in the target reference sequence. Therefore, according to the method, sequences to be compared and target reference sequences obtained in advance need to be segmented respectively, target seed and reference sub-segments are correspondingly obtained, then all target seed in the sequences to be compared and the reference sub-segments are compared according to preset rules, target scores are obtained, target reference sub-segments are screened out from the reference sub-segments according to the target scores, then the positions of the target reference sub-segments in the target reference sequences are determined to be accurate matching positions of the sequences to be compared, invalid matching positions can be filtered out, computing resources are saved, the sequence comparison performance is improved, and the workload of subsequent sequence expansion is reduced.
Referring to fig. 2, the embodiment of the present application discloses a specific sequence alignment method, which comprises:
step S21: and segmenting the sequence read to be compared to obtain a target seed corresponding to the sequence read to be compared.
Step S22: and segmenting the pre-obtained target reference sequence to obtain a reference sub-segment.
Step S23: and comparing all the target seeds in the sequence read to be compared with the reference sub-segments according to a preset rule to obtain a target score.
In a specific implementation process, the comparing the target seed with the reference sub-segment according to a preset rule to obtain a target score includes: and comparing all the target seeds in the sequence read to be compared with the reference sub-segments according to a preset rule to obtain a target score.
In a first specific embodiment, comparing all target seeds in the sequence read to be compared with any reference sub-segment according to a preset rule to obtain a target score, including: initializing the target score; comparing all target seed in the sequence read to be compared with the reference sub-segment, and judging whether a target condition occurs in the comparison process; and if the target condition occurs in the comparison process, subtracting the corresponding penalty score from the target score to obtain the target score corresponding to the reference sub-segment. Wherein, the judging whether the target condition appears in the comparison process comprises: judging whether all target seeds in the sequence read to be compared hit on the reference sub-segment and the hit positions on the reference sub-segment are discontinuous in the comparison process; and/or judging whether the first target seed of all the target seeds in the sequence to be compared is hit on the reference sub-segment or not in the comparison process, and the second target seed of all the target seeds in the sequence to be compared is not hit on the reference sub-segment. Wherein, the first target seed and the second target seed form all target seeds in the sequence to be aligned, and the condition that the first target seed of all target seeds in the sequence to be aligned hits on the reference sub-segment and the second target seed of all target seeds in the sequence to be aligned does not hit on the reference sub-segment further includes: the first target seed of all the target seeds in the sequence to be aligned read hits on the reference sub-segment, the positions of the first target seed hitting on the reference sub-segment are continuous, and the second target seed of all the target seeds in the sequence to be aligned read does not hit on the reference sub-segment. And the first target seed of all the target seeds in the sequence to be aligned is hit on the reference sub-segment, the hit position of the first target seed on the reference sub-segment is discontinuous, and the second target seed of all the target seeds in the sequence to be aligned is not hit on the reference sub-segment. Specifically, the target scores are initialized, that is, the same target score is given to each target seed, so as to assume that all the target seeds can hit on the corresponding reference sub-segment, and the hit positions on one reference sub-segment are continuous, then all the target seeds in the to-be-compared sequence read are compared with the reference sub-segment, if all the target seeds hit on the reference sub-segment continuously, the initialized target score is taken as the final target score, and if all the target seeds hit on the reference sub-segment but the hit positions are discontinuous in the comparison process, or if part of the target seeds in all the target seeds of the to-be-compared sequence read hit on the reference sub-segment and part of the target seeds do not hit on the reference sub-segment, the corresponding penalty score is subtracted. For example, when all the target seeds hit on the reference sub-segment but the hit positions are not continuous, and how many more positions appear among the hit positions, the final target score is obtained by subtracting the corresponding number of 7 points from the initialized target score. And when part of the target seeds in all the target seeds of the sequence to be compared hit the reference sub-piece and part of the target seeds do not hit the reference sub-piece, subtracting 7 corresponding target scores from the initialized target scores to obtain the final target scores. Referring to fig. 3, a process diagram of the precise general matching position in the sequence comparison pair is shown. Firstly, segmenting a sequence to be aligned read (read) to obtain a target seed, as shown in the figures 1 to 8, segmenting a target Reference sequence (Reference) to obtain Reference sub-segments comprising R _ segment 1 to R _ segment n, setting corresponding initialized target scores to be 100 when all target seeds in the sequence to be aligned read continuously hit on the Reference sub-segments, wherein the target seeds 0-8 hit on the R _ segment 1 and the hit positions are continuous to obtain corresponding target scores of 100, all target seeds in the sequence to be aligned read hit on the R _ segment 2, but two redundant positions are added in the hit positions, two 7 scores are deducted, namely 14 scores are obtained, a target score of 86 is obtained, the R _ segment 3 has a target seed miss, a target score of 93 is obtained, and the like, the R _ segment n-1 obtains a target score of 65, the R _ segment n gets a target score of 58.
In a second specific embodiment, comparing all target seeds in the sequence read to be compared with any reference sub-segment according to a preset rule to obtain a target score, including: initializing the target score; comparing all target seed in the sequence read to be compared with the reference subfragment; and if the target seed hits on the reference sub-segment, adding the corresponding reward score to the target score to obtain the target score corresponding to the reference sub-segment.
Step S24: and judging whether the target score is greater than or equal to a preset score threshold value.
After the target score is obtained, determining a target reference sub-segment from the reference sub-segments according to the target score. Specifically, it may be determined whether the target score is greater than or equal to a preset score threshold; and if the target score is greater than or equal to a preset score threshold value, determining the reference sub-segment corresponding to the target score as a target reference sub-segment.
Step S25: and if the target score is greater than or equal to a preset score threshold value, determining the reference sub-segment corresponding to the target score as a target reference sub-segment.
Step S26: and determining the position of the target reference sub-segment in the target reference sequence as the exact matching position of the sequence read to be aligned in the target reference sequence.
Referring to fig. 4, the present application discloses a sequence alignment apparatus, including:
the first segmentation module 11 is configured to segment the to-be-aligned sequence read to obtain a target seed corresponding to the to-be-aligned sequence read;
a second segmentation module 12, configured to segment a pre-obtained target reference sequence to obtain a reference sub-segment;
a comparison module 13, configured to compare all target seeds in the sequence read to be compared with the reference sub-segments according to a preset rule, so as to obtain a target score;
a fragment screening module 14, configured to screen a target reference sub-fragment from the reference sub-fragments according to the target score;
a position determining module 15, configured to determine a position of the target reference sub-segment in the target reference sequence as an exact matching position of the sequence read to be aligned in the target reference sequence.
It can be seen that, according to the present application, a to-be-compared sequence read is firstly segmented to obtain a target seed corresponding to the to-be-compared sequence read, a pre-obtained target reference sequence is segmented to obtain a reference sub-segment, then all target seed in the to-be-compared sequence read is compared with the reference sub-segment according to a preset rule to obtain a target score, a target reference sub-segment is screened out from the reference sub-segment according to the target score, and then the position of the target reference sub-segment in the target reference sequence is determined as an accurate matching position of the to-be-compared sequence read in the target reference sequence. Therefore, according to the method, sequences to be compared and target reference sequences obtained in advance need to be segmented respectively, target seed and reference sub-segments are correspondingly obtained, then all target seed in the sequences to be compared and the reference sub-segments are compared according to preset rules, target scores are obtained, target reference sub-segments are screened out from the reference sub-segments according to the target scores, then the positions of the target reference sub-segments in the target reference sequences are determined to be accurate matching positions of the sequences to be compared, invalid matching positions can be filtered out, computing resources are saved, the sequence comparison performance is improved, and the workload of subsequent sequence expansion is reduced.
Further, referring to fig. 5, the present application further discloses a sequence alignment apparatus, including: a processor 21 and a memory 22.
Wherein the memory 22 is used for storing a computer program; the processor 21 is configured to execute the computer program to implement the sequence alignment method disclosed in the foregoing embodiments.
For the specific process of the sequence comparison method, reference may be made to the corresponding contents disclosed in the foregoing embodiments, which are not described herein again.
Further, referring to fig. 6, a schematic structural diagram of an electronic device 20 provided in the embodiment of the present application is shown, where the electronic device 20 may specifically include, but is not limited to, a tablet computer, a notebook computer, or a desktop computer.
In general, the electronic device 20 in the present embodiment includes: a processor 21 and a memory 22.
The processor 21 may also include a main processor, which is a processor for processing data in a wake-up state and is also referred to as a Central Processing Unit (CPU), and a coprocessor, which is a low power consumption processor for processing data in a standby state, the processor 21 may be integrated with a GPU (graphics processing unit) for rendering and rendering images to be displayed on a display screen, in some embodiments, the processor 21 may include an AI (intelligent processor) for learning about AI operations.
The memory 22 may include one or more computer-readable storage media, which may be non-transitory, and the memory 22 may also include a high speed random access memory, and a non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices, in this embodiment, the memory 22 is used to store at least the following computer program 221, wherein the computer program is loaded and executed by the processor 21 to implement the steps of the sequence alignment method disclosed in any of the foregoing embodiments.
In some embodiments, the electronic device 20 may further include a display 23, an input/output interface 24, a communication interface 25, a sensor 26, a power supply 27, and a communication bus 28.
Those skilled in the art will appreciate that the configuration shown in FIG. 6 is not limiting of electronic device 20 and may include more or fewer components than those shown.
Further, an embodiment of the present application also discloses a computer readable storage medium for storing a computer program, wherein the computer program, when executed by a processor, implements the following steps:
segmenting a sequence read to be compared to obtain a target seed corresponding to the sequence read to be compared; segmenting a target reference sequence obtained in advance to obtain a reference sub-segment; comparing all target seeds in the sequence read to be compared with the reference sub-segments according to a preset rule to obtain a target score; screening a target reference sub-segment from the reference sub-segments according to the target score; and determining the position of the target reference sub-segment in the target reference sequence as the exact matching position of the sequence read to be aligned in the target reference sequence.
It can be seen that, according to the present application, a to-be-compared sequence read is firstly segmented to obtain a target seed corresponding to the to-be-compared sequence read, a pre-obtained target reference sequence is segmented to obtain a reference sub-segment, then all target seed in the to-be-compared sequence read is compared with the reference sub-segment according to a preset rule to obtain a target score, a target reference sub-segment is screened out from the reference sub-segment according to the target score, and then the position of the target reference sub-segment in the target reference sequence is determined as an accurate matching position of the to-be-compared sequence read in the target reference sequence. Therefore, according to the method, sequences to be compared and target reference sequences obtained in advance need to be segmented respectively, target seed and reference sub-segments are correspondingly obtained, then all target seed in the sequences to be compared and the reference sub-segments are compared according to preset rules, target scores are obtained, target reference sub-segments are screened out from the reference sub-segments according to the target scores, then the positions of the target reference sub-segments in the target reference sequences are determined to be accurate matching positions of the sequences to be compared, invalid matching positions can be filtered out, computing resources are saved, the sequence comparison performance is improved, and the workload of subsequent sequence expansion is reduced.
In this embodiment, when the computer subprogram stored in the computer-readable storage medium is executed by the processor, the following steps may be specifically implemented: and comparing all the target seeds in the sequence read to be compared with the reference sub-segments according to a preset rule to obtain a target score.
In this embodiment, when the computer subprogram stored in the computer-readable storage medium is executed by the processor, the following steps may be specifically implemented: initializing the target score; comparing all target seed in the sequence read to be compared with the reference sub-segment, and judging whether a target condition occurs in the comparison process; and if the target condition occurs in the comparison process, subtracting the corresponding penalty score from the target score to obtain the target score corresponding to the reference sub-segment.
In this embodiment, when the computer subprogram stored in the computer-readable storage medium is executed by the processor, the following steps may be specifically implemented: judging whether all target seeds in the sequence read to be compared hit on the reference sub-segment and the hit positions on the reference sub-segment are discontinuous in the comparison process; and/or judging whether the first target seed of all the target seeds in the sequence to be compared is hit on the reference sub-segment or not in the comparison process, and the second target seed of all the target seeds in the sequence to be compared is not hit on the reference sub-segment.
In this embodiment, when the computer subprogram stored in the computer-readable storage medium is executed by the processor, the following steps may be specifically implemented: initializing the target score; comparing all target seed in the sequence read to be compared with the reference subfragment; and if the target seed hits on the reference sub-segment, adding the corresponding reward score to the target score to obtain the target score corresponding to the reference sub-segment.
In this embodiment, when the computer subprogram stored in the computer-readable storage medium is executed by the processor, the following steps may be specifically implemented: judging whether the target score is greater than or equal to a preset score threshold value; and if the target score is greater than or equal to a preset score threshold value, determining the reference sub-segment corresponding to the target score as a target reference sub-segment.
In this embodiment, when the computer subprogram stored in the computer-readable storage medium is executed by the processor, the following steps may be specifically implemented: normalizing the target score; judging whether the normalized target score is greater than or equal to a preset score threshold value or not; and if the normalized target score is greater than or equal to a preset score threshold value, determining the reference sub-segment corresponding to the normalized target score as a target reference sub-segment.
The embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same or similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
Finally, it is further noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of other elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The sequence alignment method, device, apparatus, and medium provided by the present application are introduced in detail, and specific examples are applied in the description to explain the principles and embodiments of the present application, and the description of the above embodiments is only used to help understand the method and core ideas of the present application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.
Claims (10)
1. A method of sequence alignment, comprising:
segmenting a sequence read to be compared to obtain a target seed corresponding to the sequence read to be compared;
segmenting a target reference sequence obtained in advance to obtain a reference sub-segment;
comparing all target seeds in the sequence read to be compared with the reference sub-segments according to a preset rule to obtain a target score;
screening a target reference sub-segment from the reference sub-segments according to the target score;
and determining the position of the target reference sub-segment in the target reference sequence as the exact matching position of the sequence read to be aligned in the target reference sequence.
2. The sequence alignment method of claim 1, wherein the aligning all target seeds in the sequence read to be aligned with the reference sub-segments according to a preset rule to obtain a target score comprises:
and comparing all the target seeds in the sequence read to be compared with the reference sub-segments according to a preset rule to obtain a target score.
3. The sequence alignment method of claim 2, wherein the step of comparing all target seeds in the sequence read to be aligned with any reference sub-segment according to a preset rule to obtain a target score comprises:
initializing the target score;
comparing all target seed in the sequence read to be compared with the reference sub-segment, and judging whether a target condition occurs in the comparison process;
and if the target condition occurs in the comparison process, subtracting the corresponding penalty score from the target score to obtain the target score corresponding to the reference sub-segment.
4. The method of claim 3, wherein the determining whether the target condition occurs during the alignment process comprises:
judging whether all target seeds in the sequence read to be compared hit on the reference sub-segment and the hit positions on the reference sub-segment are discontinuous in the comparison process;
and/or judging whether the first target seed of all the target seeds in the sequence to be compared is hit on the reference sub-segment or not in the comparison process, and the second target seed of all the target seeds in the sequence to be compared is not hit on the reference sub-segment.
5. The sequence alignment method of claim 2, wherein the step of comparing all target seeds in the sequence read to be aligned with any reference sub-segment according to a preset rule to obtain a target score comprises:
initializing the target score;
comparing all target seed in the sequence read to be compared with the reference subfragment;
and if the target seed hits on the reference sub-segment, adding the corresponding reward score to the target score to obtain the target score corresponding to the reference sub-segment.
6. The method of sequence alignment according to claim 1, wherein said screening said reference sub-segments for a target reference sub-segment according to said target score comprises:
judging whether the target score is greater than or equal to a preset score threshold value;
and if the target score is greater than or equal to a preset score threshold value, determining the reference sub-segment corresponding to the target score as a target reference sub-segment.
7. The method of sequence alignment according to claim 1, wherein said screening said reference sub-segments for a target reference sub-segment according to said target score comprises:
normalizing the target score;
judging whether the normalized target score is greater than or equal to a preset score threshold value or not;
and if the normalized target score is greater than or equal to a preset score threshold value, determining the reference sub-segment corresponding to the normalized target score as a target reference sub-segment.
8. A sequence alignment apparatus, comprising:
the first segmentation module is used for segmenting a sequence read to be compared to obtain a target seed corresponding to the sequence read to be compared;
the second segmentation module is used for segmenting a target reference sequence obtained in advance to obtain a reference sub-segment;
the comparison module is used for comparing all target seeds in the sequence read to be compared with the reference sub-segments according to a preset rule to obtain a target score;
the fragment screening module is used for screening target reference sub-fragments from the reference sub-fragments according to the target scores;
a position determining module, configured to determine a position of the target reference sub-segment in the target reference sequence as an exact matching position of the sequence read to be aligned in the target reference sequence.
9. A sequence alignment apparatus, comprising:
a memory and a processor;
wherein the memory is used for storing a computer program;
the processor for executing the computer program to implement the sequence alignment method of any one of claims 1 to 7.
10. A computer-readable storage medium storing a computer program, wherein the computer program when executed by a processor implements the method of sequence alignment of any of claims 1 to 7.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010130211.2A CN111402956A (en) | 2020-02-28 | 2020-02-28 | Sequence comparison method, device, equipment and medium |
PCT/CN2020/126350 WO2021169387A1 (en) | 2020-02-28 | 2020-11-04 | Sequence alignment method, apparatus and device, and medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010130211.2A CN111402956A (en) | 2020-02-28 | 2020-02-28 | Sequence comparison method, device, equipment and medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111402956A true CN111402956A (en) | 2020-07-10 |
Family
ID=71430385
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010130211.2A Withdrawn CN111402956A (en) | 2020-02-28 | 2020-02-28 | Sequence comparison method, device, equipment and medium |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN111402956A (en) |
WO (1) | WO2021169387A1 (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2021169387A1 (en) * | 2020-02-28 | 2021-09-02 | 苏州浪潮智能科技有限公司 | Sequence alignment method, apparatus and device, and medium |
Family Cites Families (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR101508817B1 (en) * | 2012-10-29 | 2015-04-08 | 삼성에스디에스 주식회사 | System and method for aligning genome sequence |
CN106682393B (en) * | 2016-11-29 | 2019-05-17 | 北京荣之联科技股份有限公司 | Genome sequence comparison method and device |
CN108985008B (en) * | 2018-06-29 | 2022-03-08 | 郑州云海信息技术有限公司 | Method and system for rapidly comparing gene data |
CN109887547B (en) * | 2019-03-06 | 2020-10-02 | 苏州浪潮智能科技有限公司 | Gene sequence comparison filtering acceleration processing method, system and device |
CN110379461A (en) * | 2019-06-28 | 2019-10-25 | 苏州浪潮智能科技有限公司 | A kind of gene data comparison method, device, equipment and medium |
CN110517728B (en) * | 2019-08-29 | 2022-04-29 | 苏州浪潮智能科技有限公司 | Gene sequence comparison method and device |
CN110797085B (en) * | 2019-10-25 | 2022-07-08 | 浪潮(北京)电子信息产业有限公司 | Method, system, equipment and storage medium for inquiring gene data |
CN111402956A (en) * | 2020-02-28 | 2020-07-10 | 苏州浪潮智能科技有限公司 | Sequence comparison method, device, equipment and medium |
-
2020
- 2020-02-28 CN CN202010130211.2A patent/CN111402956A/en not_active Withdrawn
- 2020-11-04 WO PCT/CN2020/126350 patent/WO2021169387A1/en active Application Filing
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2021169387A1 (en) * | 2020-02-28 | 2021-09-02 | 苏州浪潮智能科技有限公司 | Sequence alignment method, apparatus and device, and medium |
Also Published As
Publication number | Publication date |
---|---|
WO2021169387A1 (en) | 2021-09-02 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103946855B (en) | For detection towards the methods, devices and systems returning programming attack | |
US7987473B1 (en) | Accelerated class check | |
US9026739B2 (en) | Multimode prefetcher | |
US8516205B2 (en) | Method and apparatus for providing efficient context classification | |
US20090282225A1 (en) | Store queue | |
US8990486B2 (en) | Hardware and file system agnostic mechanism for achieving capsule support | |
US20110010506A1 (en) | Data prefetcher with multi-level table for predicting stride patterns | |
JP2000276381A (en) | Method for estimating task execution time | |
CN111402956A (en) | Sequence comparison method, device, equipment and medium | |
US20160092115A1 (en) | Implementing storage policies regarding use of memory regions | |
JP7262520B2 (en) | Methods, apparatus, apparatus and computer readable storage media for executing instructions | |
CN114168318A (en) | Training method of storage release model, storage release method and equipment | |
US20150370565A1 (en) | Data processing device and method, and processor unit of same | |
US20050268040A1 (en) | Cache system having branch target address cache | |
US9507725B2 (en) | Store forwarding for data caches | |
US20210165654A1 (en) | Eliminating execution of instructions that produce a constant result | |
US10862485B1 (en) | Lookup table index for a processor | |
CN111381881A (en) | AHB (advanced high-performance bus) interface-based low-power-consumption instruction caching method and device | |
CN114840258B (en) | Multi-level hybrid algorithm filtering type branch prediction method and prediction system | |
US20160283338A1 (en) | Boot operations in memory devices | |
AU2017438670B2 (en) | Simulation device, simulation method, and simulation program | |
CN113688785A (en) | Multi-supervision-based face recognition method and device, computer equipment and storage medium | |
US5903915A (en) | Cache detection using timing differences | |
JP7173308B2 (en) | DETECTION DEVICE, DETECTION METHOD AND DETECTION PROGRAM | |
US9342319B1 (en) | Accelerated class check |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WW01 | Invention patent application withdrawn after publication | ||
WW01 | Invention patent application withdrawn after publication |
Application publication date: 20200710 |