CN111402956A - Sequence comparison method, device, equipment and medium - Google Patents

Sequence comparison method, device, equipment and medium Download PDF

Info

Publication number
CN111402956A
CN111402956A CN202010130211.2A CN202010130211A CN111402956A CN 111402956 A CN111402956 A CN 111402956A CN 202010130211 A CN202010130211 A CN 202010130211A CN 111402956 A CN111402956 A CN 111402956A
Authority
CN
China
Prior art keywords
target
reference sub
sequence
segment
score
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN202010130211.2A
Other languages
Chinese (zh)
Inventor
尹云峰
任智新
金良
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Inspur Intelligent Technology Co Ltd
Original Assignee
Suzhou Inspur Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Inspur Intelligent Technology Co Ltd filed Critical Suzhou Inspur Intelligent Technology Co Ltd
Priority to CN202010130211.2A priority Critical patent/CN111402956A/en
Publication of CN111402956A publication Critical patent/CN111402956A/en
Priority to PCT/CN2020/126350 priority patent/WO2021169387A1/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/30Detection of binding sites or motifs

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Engineering & Computer Science (AREA)
  • Biotechnology (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Analytical Chemistry (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Chemical & Material Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Genetics & Genomics (AREA)
  • Molecular Biology (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Image Analysis (AREA)

Abstract

The application discloses a sequence comparison method, a device, equipment and a medium, wherein the method comprises the following steps: segmenting a sequence read to be compared to obtain a target seed corresponding to the sequence read to be compared; segmenting a target reference sequence obtained in advance to obtain a reference sub-segment; comparing all target seeds in the sequence read to be compared with the reference sub-segments according to a preset rule to obtain a target score; screening a target reference sub-segment from the reference sub-segments according to the target score; and determining the position of the target reference sub-segment in the target reference sequence as the exact matching position of the sequence read to be aligned in the target reference sequence. Therefore, invalid matching positions can be filtered, computing resources are saved, the sequence comparison performance is improved, and the workload of subsequent sequence expansion is reduced.

Description

Sequence comparison method, device, equipment and medium
Technical Field
The present application relates to the field of genetic testing technologies, and in particular, to a method, an apparatus, a device, and a medium for sequence comparison.
Background
With the rapid development of biological gene detection technology, the individual gene is extracted to compare gene sequences, the possibility of suffering from various diseases is predicted, the gene of individual pathological changes is locked, and the technology for preventing and treating in advance is mature. The human gene library is currently about 30 hundred million base pairs, and several days are required for completing the gene sequence alignment of a person by adopting a general computer software processing platform. The sequence alignment mainly comprises two stages of seed finding and expansion. In order to improve the accuracy of the sequence alignment, it is necessary to find the position where the seed of the sequence read to be aligned appears in the reference sequence as much as possible. In the prior art, a large number of invalid positions are compared, so that a large number of computing resources are wasted, and the performance of the whole sequence comparison is greatly reduced.
Disclosure of Invention
In view of this, an object of the present application is to provide a method, an apparatus, a device, and a medium for sequence alignment, which can filter out invalid matching positions, save computing resources, improve the performance of sequence alignment, and reduce the workload of subsequent sequence extension. The specific scheme is as follows:
in a first aspect, the present application discloses a method of sequence alignment, comprising:
segmenting a to-be-compared sequence read to obtain a target seed corresponding to the to-be-detected sequence;
segmenting a target reference sequence obtained in advance to obtain a reference sub-segment;
comparing all target seeds in the sequence read to be compared with the reference sub-segments according to a preset rule to obtain a target score;
screening a target reference sub-segment from the reference sub-segments according to the target score;
and determining the position of the target reference sub-segment in the target reference sequence as the exact matching position of the sequence read to be aligned in the target reference sequence.
Optionally, the comparing all the target seeds in the sequence read to be compared with the reference sub-segments according to a preset rule to obtain a target score, including:
and comparing all the target seeds in the sequence read to be compared with the reference sub-segments according to a preset rule to obtain a target score.
Optionally, comparing all target seeds in the sequence read to be compared with any reference sub-segment according to a preset rule to obtain a target score, including:
initializing the target score;
comparing all target seed in the sequence read to be compared with the reference sub-segment, and judging whether a target condition occurs in the comparison process;
and if the target condition occurs in the comparison process, subtracting the corresponding penalty score from the target score to obtain the target score corresponding to the reference sub-segment.
Optionally, the determining whether the target condition occurs in the comparison process includes:
judging whether all target seeds in the sequence read to be compared hit on the reference sub-segment and the hit positions on the reference sub-segment are discontinuous in the comparison process;
and/or judging whether the first target seed of all the target seeds in the sequence to be compared is hit on the reference sub-segment or not in the comparison process, and the second target seed of all the target seeds in the sequence to be compared is not hit on the reference sub-segment.
Optionally, comparing all target seeds in the sequence read to be compared with any reference sub-segment according to a preset rule to obtain a target score, including:
initializing the target score;
comparing all target seed in the sequence read to be compared with the reference subfragment;
and if the target seed hits on the reference sub-segment, adding the corresponding reward score to the target score to obtain the target score corresponding to the reference sub-segment.
Optionally, the screening out the target reference sub-segment from the reference sub-segments according to the target score includes:
judging whether the target score is greater than or equal to a preset score threshold value;
and if the target score is greater than or equal to a preset score threshold value, determining the reference sub-segment corresponding to the target score as a target reference sub-segment.
Optionally, the screening out the target reference sub-segment from the reference sub-segments according to the target score includes:
normalizing the target score;
judging whether the normalized target score is greater than or equal to a preset score threshold value or not;
and if the normalized target score is greater than or equal to a preset score threshold value, determining the reference sub-segment corresponding to the normalized target score as a target reference sub-segment.
In a second aspect, the present application discloses a sequence alignment apparatus comprising:
the first segmentation module is used for segmenting a sequence read to be compared to obtain a target seed corresponding to the sequence read to be compared;
the second segmentation module is used for segmenting a target reference sequence obtained in advance to obtain a reference sub-segment;
the comparison module is used for comparing all target seeds in the sequence read to be compared with the reference sub-segments according to a preset rule to obtain a target score;
the fragment screening module is used for screening target reference sub-fragments from the reference sub-fragments according to the target scores;
a position determining module, configured to determine a position of the target reference sub-segment in the target reference sequence as an exact matching position of the sequence read to be aligned in the target reference sequence.
In a third aspect, the present application discloses a sequence alignment apparatus comprising:
a memory and a processor;
wherein the memory is used for storing a computer program;
the processor is used for executing the computer program to realize the sequence alignment method disclosed in the foregoing.
In a fourth aspect, the present application discloses a computer readable storage medium for storing a computer program, wherein the computer program, when executed by a processor, implements the sequence alignment method disclosed above.
It can be seen that, according to the present application, a to-be-compared sequence read is firstly segmented to obtain a target seed corresponding to the to-be-compared sequence read, a pre-obtained target reference sequence is segmented to obtain a reference sub-segment, then all target seed in the to-be-compared sequence read is compared with the reference sub-segment according to a preset rule to obtain a target score, a target reference sub-segment is screened out from the reference sub-segment according to the target score, and then the position of the target reference sub-segment in the target reference sequence is determined as an accurate matching position of the to-be-compared sequence read in the target reference sequence. Therefore, according to the method, sequences to be compared and target reference sequences obtained in advance need to be segmented respectively, target seed and reference sub-segments are correspondingly obtained, then all target seed in the sequences to be compared and the reference sub-segments are compared according to preset rules, target scores are obtained, target reference sub-segments are screened out from the reference sub-segments according to the target scores, then the positions of the target reference sub-segments in the target reference sequences are determined to be accurate matching positions of the sequences to be compared, invalid matching positions can be filtered out, computing resources are saved, the sequence comparison performance is improved, and the workload of subsequent sequence expansion is reduced.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
FIG. 1 is a flow chart of a method of sequence alignment disclosed herein;
FIG. 2 is a flow chart of a specific sequence alignment method disclosed herein;
FIG. 3 is a process diagram of an accurate general matching position in a sequence comparison pair disclosed in the present application;
FIG. 4 is a schematic diagram of a sequence alignment apparatus disclosed herein;
FIG. 5 is a block diagram of a sequence alignment apparatus disclosed herein;
fig. 6 is a block diagram of an electronic device disclosed in the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
At present, a large number of invalid positions exist in gene sequence comparison, a large number of computing resources are wasted, and the performance of the whole sequence comparison is greatly reduced. In view of this, the present application provides a sequence alignment method, which can filter out invalid matching positions, save computing resources, improve the performance of sequence alignment, and reduce the workload of subsequent sequence extension.
Referring to fig. 1, the present application discloses a sequence alignment method, which includes:
step S11: and segmenting the to-be-compared sequence read to obtain a target seed corresponding to the to-be-compared sequence read.
In a specific implementation process, in order to perform accurate comparison, a to-be-compared sequence read needs to be segmented first, so as to obtain a target seed corresponding to the to-be-compared sequence read. Wherein seed is a sequence with exact match and waiting for expansion in the sequence alignment. For example, the sequence read to be aligned includes 100 bases, 20 bases are divided into one segment to obtain the target seed, the first target seed is from 0 th to 19 th bases, the second target seed is from 1 st to 20 th bases, the third target seed is from 2 nd to 21 st bases, and so on.
Step S12: and segmenting the pre-obtained target reference sequence to obtain a reference sub-segment.
It will be appreciated that after segmenting the sequence read to be aligned, it is also necessary to correspondingly segment the target reference sequence to obtain reference subfragments for alignment with the target seed. Since the target reference sequence is generally longer, it is also necessary to segment the target reference sequence for alignment, and the length of the reference subfragment is generally greater than or equal to the length of the sequence read to be aligned.
Step S13: and comparing all the target seeds in the sequence read to be compared with the reference sub-segments according to a preset rule to obtain a target score.
After the reference sub-segments are obtained, all the target seeds in the sequence read to be compared need to be compared with the reference sub-segments according to a preset rule, so as to obtain target scores. Comparing all target seeds in the sequence read to be compared with the reference sub-segments according to a preset rule to obtain a target score, wherein the target score comprises: and comparing all the target seeds in the sequence read to be compared with the reference sub-segments according to a preset rule to obtain a target score. For example, if the number of the target seed obtained by the sequence read to be compared is 2, and the number of the reference sub-segments obtained by the target reference sequence is 2, the first target seed is compared with the first reference sub-segment according to a preset rule, then the second target seed is compared with the first reference sub-segment to obtain a corresponding target score, the first target seed is compared with the second reference sub-segment according to a preset rule, and then the second target seed is compared with the second reference sub-segment to obtain a corresponding target score.
Step S14: and screening out target reference sub-segments from the reference sub-segments according to the target scores.
In a specific implementation process, after the target score is obtained, a target reference sub-segment needs to be screened from the reference sub-segments according to the target score, so as to determine an exact match of the sequence read to be aligned in the target reference sequence.
In a first specific embodiment, the screening a target reference sub-segment from the reference sub-segments according to the target score includes: judging whether the target score is greater than or equal to a preset score threshold value; and if the target score is greater than or equal to a preset score threshold value, determining the reference sub-segment corresponding to the target score as a target reference sub-segment. Specifically, after the target score is obtained, whether the target score is greater than or equal to a preset score threshold is judged, and if the target score is greater than or equal to the preset score threshold, the reference sub-segment corresponding to the target score is determined as the target reference sub-segment.
In a second specific embodiment, the screening a target reference sub-segment from the reference sub-segments according to the target score includes: normalizing the target score; judging whether the normalized target score is greater than or equal to a preset score threshold value or not; and if the normalized target score is greater than or equal to a preset score threshold value, determining the reference sub-segment corresponding to the normalized target score as a target reference sub-segment.
Step S15: and determining the position of the target reference sub-segment in the target reference sequence as the exact matching position of the sequence read to be aligned in the target reference sequence.
It will be appreciated that after the target reference sub-segment is determined, the position of the target reference sub-segment in the target reference sequence may be determined as the exact matching position of the sequence read to be aligned in the target reference sequence.
It can be seen that, according to the present application, a to-be-compared sequence read is firstly segmented to obtain a target seed corresponding to the to-be-compared sequence read, a pre-obtained target reference sequence is segmented to obtain a reference sub-segment, then all target seed in the to-be-compared sequence read is compared with the reference sub-segment according to a preset rule to obtain a target score, a target reference sub-segment is screened out from the reference sub-segment according to the target score, and then the position of the target reference sub-segment in the target reference sequence is determined as an accurate matching position of the to-be-compared sequence read in the target reference sequence. Therefore, according to the method, sequences to be compared and target reference sequences obtained in advance need to be segmented respectively, target seed and reference sub-segments are correspondingly obtained, then all target seed in the sequences to be compared and the reference sub-segments are compared according to preset rules, target scores are obtained, target reference sub-segments are screened out from the reference sub-segments according to the target scores, then the positions of the target reference sub-segments in the target reference sequences are determined to be accurate matching positions of the sequences to be compared, invalid matching positions can be filtered out, computing resources are saved, the sequence comparison performance is improved, and the workload of subsequent sequence expansion is reduced.
Referring to fig. 2, the embodiment of the present application discloses a specific sequence alignment method, which comprises:
step S21: and segmenting the sequence read to be compared to obtain a target seed corresponding to the sequence read to be compared.
Step S22: and segmenting the pre-obtained target reference sequence to obtain a reference sub-segment.
Step S23: and comparing all the target seeds in the sequence read to be compared with the reference sub-segments according to a preset rule to obtain a target score.
In a specific implementation process, the comparing the target seed with the reference sub-segment according to a preset rule to obtain a target score includes: and comparing all the target seeds in the sequence read to be compared with the reference sub-segments according to a preset rule to obtain a target score.
In a first specific embodiment, comparing all target seeds in the sequence read to be compared with any reference sub-segment according to a preset rule to obtain a target score, including: initializing the target score; comparing all target seed in the sequence read to be compared with the reference sub-segment, and judging whether a target condition occurs in the comparison process; and if the target condition occurs in the comparison process, subtracting the corresponding penalty score from the target score to obtain the target score corresponding to the reference sub-segment. Wherein, the judging whether the target condition appears in the comparison process comprises: judging whether all target seeds in the sequence read to be compared hit on the reference sub-segment and the hit positions on the reference sub-segment are discontinuous in the comparison process; and/or judging whether the first target seed of all the target seeds in the sequence to be compared is hit on the reference sub-segment or not in the comparison process, and the second target seed of all the target seeds in the sequence to be compared is not hit on the reference sub-segment. Wherein, the first target seed and the second target seed form all target seeds in the sequence to be aligned, and the condition that the first target seed of all target seeds in the sequence to be aligned hits on the reference sub-segment and the second target seed of all target seeds in the sequence to be aligned does not hit on the reference sub-segment further includes: the first target seed of all the target seeds in the sequence to be aligned read hits on the reference sub-segment, the positions of the first target seed hitting on the reference sub-segment are continuous, and the second target seed of all the target seeds in the sequence to be aligned read does not hit on the reference sub-segment. And the first target seed of all the target seeds in the sequence to be aligned is hit on the reference sub-segment, the hit position of the first target seed on the reference sub-segment is discontinuous, and the second target seed of all the target seeds in the sequence to be aligned is not hit on the reference sub-segment. Specifically, the target scores are initialized, that is, the same target score is given to each target seed, so as to assume that all the target seeds can hit on the corresponding reference sub-segment, and the hit positions on one reference sub-segment are continuous, then all the target seeds in the to-be-compared sequence read are compared with the reference sub-segment, if all the target seeds hit on the reference sub-segment continuously, the initialized target score is taken as the final target score, and if all the target seeds hit on the reference sub-segment but the hit positions are discontinuous in the comparison process, or if part of the target seeds in all the target seeds of the to-be-compared sequence read hit on the reference sub-segment and part of the target seeds do not hit on the reference sub-segment, the corresponding penalty score is subtracted. For example, when all the target seeds hit on the reference sub-segment but the hit positions are not continuous, and how many more positions appear among the hit positions, the final target score is obtained by subtracting the corresponding number of 7 points from the initialized target score. And when part of the target seeds in all the target seeds of the sequence to be compared hit the reference sub-piece and part of the target seeds do not hit the reference sub-piece, subtracting 7 corresponding target scores from the initialized target scores to obtain the final target scores. Referring to fig. 3, a process diagram of the precise general matching position in the sequence comparison pair is shown. Firstly, segmenting a sequence to be aligned read (read) to obtain a target seed, as shown in the figures 1 to 8, segmenting a target Reference sequence (Reference) to obtain Reference sub-segments comprising R _ segment 1 to R _ segment n, setting corresponding initialized target scores to be 100 when all target seeds in the sequence to be aligned read continuously hit on the Reference sub-segments, wherein the target seeds 0-8 hit on the R _ segment 1 and the hit positions are continuous to obtain corresponding target scores of 100, all target seeds in the sequence to be aligned read hit on the R _ segment 2, but two redundant positions are added in the hit positions, two 7 scores are deducted, namely 14 scores are obtained, a target score of 86 is obtained, the R _ segment 3 has a target seed miss, a target score of 93 is obtained, and the like, the R _ segment n-1 obtains a target score of 65, the R _ segment n gets a target score of 58.
In a second specific embodiment, comparing all target seeds in the sequence read to be compared with any reference sub-segment according to a preset rule to obtain a target score, including: initializing the target score; comparing all target seed in the sequence read to be compared with the reference subfragment; and if the target seed hits on the reference sub-segment, adding the corresponding reward score to the target score to obtain the target score corresponding to the reference sub-segment.
Step S24: and judging whether the target score is greater than or equal to a preset score threshold value.
After the target score is obtained, determining a target reference sub-segment from the reference sub-segments according to the target score. Specifically, it may be determined whether the target score is greater than or equal to a preset score threshold; and if the target score is greater than or equal to a preset score threshold value, determining the reference sub-segment corresponding to the target score as a target reference sub-segment.
Step S25: and if the target score is greater than or equal to a preset score threshold value, determining the reference sub-segment corresponding to the target score as a target reference sub-segment.
Step S26: and determining the position of the target reference sub-segment in the target reference sequence as the exact matching position of the sequence read to be aligned in the target reference sequence.
Referring to fig. 4, the present application discloses a sequence alignment apparatus, including:
the first segmentation module 11 is configured to segment the to-be-aligned sequence read to obtain a target seed corresponding to the to-be-aligned sequence read;
a second segmentation module 12, configured to segment a pre-obtained target reference sequence to obtain a reference sub-segment;
a comparison module 13, configured to compare all target seeds in the sequence read to be compared with the reference sub-segments according to a preset rule, so as to obtain a target score;
a fragment screening module 14, configured to screen a target reference sub-fragment from the reference sub-fragments according to the target score;
a position determining module 15, configured to determine a position of the target reference sub-segment in the target reference sequence as an exact matching position of the sequence read to be aligned in the target reference sequence.
It can be seen that, according to the present application, a to-be-compared sequence read is firstly segmented to obtain a target seed corresponding to the to-be-compared sequence read, a pre-obtained target reference sequence is segmented to obtain a reference sub-segment, then all target seed in the to-be-compared sequence read is compared with the reference sub-segment according to a preset rule to obtain a target score, a target reference sub-segment is screened out from the reference sub-segment according to the target score, and then the position of the target reference sub-segment in the target reference sequence is determined as an accurate matching position of the to-be-compared sequence read in the target reference sequence. Therefore, according to the method, sequences to be compared and target reference sequences obtained in advance need to be segmented respectively, target seed and reference sub-segments are correspondingly obtained, then all target seed in the sequences to be compared and the reference sub-segments are compared according to preset rules, target scores are obtained, target reference sub-segments are screened out from the reference sub-segments according to the target scores, then the positions of the target reference sub-segments in the target reference sequences are determined to be accurate matching positions of the sequences to be compared, invalid matching positions can be filtered out, computing resources are saved, the sequence comparison performance is improved, and the workload of subsequent sequence expansion is reduced.
Further, referring to fig. 5, the present application further discloses a sequence alignment apparatus, including: a processor 21 and a memory 22.
Wherein the memory 22 is used for storing a computer program; the processor 21 is configured to execute the computer program to implement the sequence alignment method disclosed in the foregoing embodiments.
For the specific process of the sequence comparison method, reference may be made to the corresponding contents disclosed in the foregoing embodiments, which are not described herein again.
Further, referring to fig. 6, a schematic structural diagram of an electronic device 20 provided in the embodiment of the present application is shown, where the electronic device 20 may specifically include, but is not limited to, a tablet computer, a notebook computer, or a desktop computer.
In general, the electronic device 20 in the present embodiment includes: a processor 21 and a memory 22.
The processor 21 may also include a main processor, which is a processor for processing data in a wake-up state and is also referred to as a Central Processing Unit (CPU), and a coprocessor, which is a low power consumption processor for processing data in a standby state, the processor 21 may be integrated with a GPU (graphics processing unit) for rendering and rendering images to be displayed on a display screen, in some embodiments, the processor 21 may include an AI (intelligent processor) for learning about AI operations.
The memory 22 may include one or more computer-readable storage media, which may be non-transitory, and the memory 22 may also include a high speed random access memory, and a non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices, in this embodiment, the memory 22 is used to store at least the following computer program 221, wherein the computer program is loaded and executed by the processor 21 to implement the steps of the sequence alignment method disclosed in any of the foregoing embodiments.
In some embodiments, the electronic device 20 may further include a display 23, an input/output interface 24, a communication interface 25, a sensor 26, a power supply 27, and a communication bus 28.
Those skilled in the art will appreciate that the configuration shown in FIG. 6 is not limiting of electronic device 20 and may include more or fewer components than those shown.
Further, an embodiment of the present application also discloses a computer readable storage medium for storing a computer program, wherein the computer program, when executed by a processor, implements the following steps:
segmenting a sequence read to be compared to obtain a target seed corresponding to the sequence read to be compared; segmenting a target reference sequence obtained in advance to obtain a reference sub-segment; comparing all target seeds in the sequence read to be compared with the reference sub-segments according to a preset rule to obtain a target score; screening a target reference sub-segment from the reference sub-segments according to the target score; and determining the position of the target reference sub-segment in the target reference sequence as the exact matching position of the sequence read to be aligned in the target reference sequence.
It can be seen that, according to the present application, a to-be-compared sequence read is firstly segmented to obtain a target seed corresponding to the to-be-compared sequence read, a pre-obtained target reference sequence is segmented to obtain a reference sub-segment, then all target seed in the to-be-compared sequence read is compared with the reference sub-segment according to a preset rule to obtain a target score, a target reference sub-segment is screened out from the reference sub-segment according to the target score, and then the position of the target reference sub-segment in the target reference sequence is determined as an accurate matching position of the to-be-compared sequence read in the target reference sequence. Therefore, according to the method, sequences to be compared and target reference sequences obtained in advance need to be segmented respectively, target seed and reference sub-segments are correspondingly obtained, then all target seed in the sequences to be compared and the reference sub-segments are compared according to preset rules, target scores are obtained, target reference sub-segments are screened out from the reference sub-segments according to the target scores, then the positions of the target reference sub-segments in the target reference sequences are determined to be accurate matching positions of the sequences to be compared, invalid matching positions can be filtered out, computing resources are saved, the sequence comparison performance is improved, and the workload of subsequent sequence expansion is reduced.
In this embodiment, when the computer subprogram stored in the computer-readable storage medium is executed by the processor, the following steps may be specifically implemented: and comparing all the target seeds in the sequence read to be compared with the reference sub-segments according to a preset rule to obtain a target score.
In this embodiment, when the computer subprogram stored in the computer-readable storage medium is executed by the processor, the following steps may be specifically implemented: initializing the target score; comparing all target seed in the sequence read to be compared with the reference sub-segment, and judging whether a target condition occurs in the comparison process; and if the target condition occurs in the comparison process, subtracting the corresponding penalty score from the target score to obtain the target score corresponding to the reference sub-segment.
In this embodiment, when the computer subprogram stored in the computer-readable storage medium is executed by the processor, the following steps may be specifically implemented: judging whether all target seeds in the sequence read to be compared hit on the reference sub-segment and the hit positions on the reference sub-segment are discontinuous in the comparison process; and/or judging whether the first target seed of all the target seeds in the sequence to be compared is hit on the reference sub-segment or not in the comparison process, and the second target seed of all the target seeds in the sequence to be compared is not hit on the reference sub-segment.
In this embodiment, when the computer subprogram stored in the computer-readable storage medium is executed by the processor, the following steps may be specifically implemented: initializing the target score; comparing all target seed in the sequence read to be compared with the reference subfragment; and if the target seed hits on the reference sub-segment, adding the corresponding reward score to the target score to obtain the target score corresponding to the reference sub-segment.
In this embodiment, when the computer subprogram stored in the computer-readable storage medium is executed by the processor, the following steps may be specifically implemented: judging whether the target score is greater than or equal to a preset score threshold value; and if the target score is greater than or equal to a preset score threshold value, determining the reference sub-segment corresponding to the target score as a target reference sub-segment.
In this embodiment, when the computer subprogram stored in the computer-readable storage medium is executed by the processor, the following steps may be specifically implemented: normalizing the target score; judging whether the normalized target score is greater than or equal to a preset score threshold value or not; and if the normalized target score is greater than or equal to a preset score threshold value, determining the reference sub-segment corresponding to the normalized target score as a target reference sub-segment.
The embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same or similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
Finally, it is further noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of other elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The sequence alignment method, device, apparatus, and medium provided by the present application are introduced in detail, and specific examples are applied in the description to explain the principles and embodiments of the present application, and the description of the above embodiments is only used to help understand the method and core ideas of the present application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims (10)

1. A method of sequence alignment, comprising:
segmenting a sequence read to be compared to obtain a target seed corresponding to the sequence read to be compared;
segmenting a target reference sequence obtained in advance to obtain a reference sub-segment;
comparing all target seeds in the sequence read to be compared with the reference sub-segments according to a preset rule to obtain a target score;
screening a target reference sub-segment from the reference sub-segments according to the target score;
and determining the position of the target reference sub-segment in the target reference sequence as the exact matching position of the sequence read to be aligned in the target reference sequence.
2. The sequence alignment method of claim 1, wherein the aligning all target seeds in the sequence read to be aligned with the reference sub-segments according to a preset rule to obtain a target score comprises:
and comparing all the target seeds in the sequence read to be compared with the reference sub-segments according to a preset rule to obtain a target score.
3. The sequence alignment method of claim 2, wherein the step of comparing all target seeds in the sequence read to be aligned with any reference sub-segment according to a preset rule to obtain a target score comprises:
initializing the target score;
comparing all target seed in the sequence read to be compared with the reference sub-segment, and judging whether a target condition occurs in the comparison process;
and if the target condition occurs in the comparison process, subtracting the corresponding penalty score from the target score to obtain the target score corresponding to the reference sub-segment.
4. The method of claim 3, wherein the determining whether the target condition occurs during the alignment process comprises:
judging whether all target seeds in the sequence read to be compared hit on the reference sub-segment and the hit positions on the reference sub-segment are discontinuous in the comparison process;
and/or judging whether the first target seed of all the target seeds in the sequence to be compared is hit on the reference sub-segment or not in the comparison process, and the second target seed of all the target seeds in the sequence to be compared is not hit on the reference sub-segment.
5. The sequence alignment method of claim 2, wherein the step of comparing all target seeds in the sequence read to be aligned with any reference sub-segment according to a preset rule to obtain a target score comprises:
initializing the target score;
comparing all target seed in the sequence read to be compared with the reference subfragment;
and if the target seed hits on the reference sub-segment, adding the corresponding reward score to the target score to obtain the target score corresponding to the reference sub-segment.
6. The method of sequence alignment according to claim 1, wherein said screening said reference sub-segments for a target reference sub-segment according to said target score comprises:
judging whether the target score is greater than or equal to a preset score threshold value;
and if the target score is greater than or equal to a preset score threshold value, determining the reference sub-segment corresponding to the target score as a target reference sub-segment.
7. The method of sequence alignment according to claim 1, wherein said screening said reference sub-segments for a target reference sub-segment according to said target score comprises:
normalizing the target score;
judging whether the normalized target score is greater than or equal to a preset score threshold value or not;
and if the normalized target score is greater than or equal to a preset score threshold value, determining the reference sub-segment corresponding to the normalized target score as a target reference sub-segment.
8. A sequence alignment apparatus, comprising:
the first segmentation module is used for segmenting a sequence read to be compared to obtain a target seed corresponding to the sequence read to be compared;
the second segmentation module is used for segmenting a target reference sequence obtained in advance to obtain a reference sub-segment;
the comparison module is used for comparing all target seeds in the sequence read to be compared with the reference sub-segments according to a preset rule to obtain a target score;
the fragment screening module is used for screening target reference sub-fragments from the reference sub-fragments according to the target scores;
a position determining module, configured to determine a position of the target reference sub-segment in the target reference sequence as an exact matching position of the sequence read to be aligned in the target reference sequence.
9. A sequence alignment apparatus, comprising:
a memory and a processor;
wherein the memory is used for storing a computer program;
the processor for executing the computer program to implement the sequence alignment method of any one of claims 1 to 7.
10. A computer-readable storage medium storing a computer program, wherein the computer program when executed by a processor implements the method of sequence alignment of any of claims 1 to 7.
CN202010130211.2A 2020-02-28 2020-02-28 Sequence comparison method, device, equipment and medium Withdrawn CN111402956A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202010130211.2A CN111402956A (en) 2020-02-28 2020-02-28 Sequence comparison method, device, equipment and medium
PCT/CN2020/126350 WO2021169387A1 (en) 2020-02-28 2020-11-04 Sequence alignment method, apparatus and device, and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010130211.2A CN111402956A (en) 2020-02-28 2020-02-28 Sequence comparison method, device, equipment and medium

Publications (1)

Publication Number Publication Date
CN111402956A true CN111402956A (en) 2020-07-10

Family

ID=71430385

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010130211.2A Withdrawn CN111402956A (en) 2020-02-28 2020-02-28 Sequence comparison method, device, equipment and medium

Country Status (2)

Country Link
CN (1) CN111402956A (en)
WO (1) WO2021169387A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021169387A1 (en) * 2020-02-28 2021-09-02 苏州浪潮智能科技有限公司 Sequence alignment method, apparatus and device, and medium

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101508817B1 (en) * 2012-10-29 2015-04-08 삼성에스디에스 주식회사 System and method for aligning genome sequence
CN106682393B (en) * 2016-11-29 2019-05-17 北京荣之联科技股份有限公司 Genome sequence comparison method and device
CN108985008B (en) * 2018-06-29 2022-03-08 郑州云海信息技术有限公司 Method and system for rapidly comparing gene data
CN109887547B (en) * 2019-03-06 2020-10-02 苏州浪潮智能科技有限公司 Gene sequence comparison filtering acceleration processing method, system and device
CN110379461A (en) * 2019-06-28 2019-10-25 苏州浪潮智能科技有限公司 A kind of gene data comparison method, device, equipment and medium
CN110517728B (en) * 2019-08-29 2022-04-29 苏州浪潮智能科技有限公司 Gene sequence comparison method and device
CN110797085B (en) * 2019-10-25 2022-07-08 浪潮(北京)电子信息产业有限公司 Method, system, equipment and storage medium for inquiring gene data
CN111402956A (en) * 2020-02-28 2020-07-10 苏州浪潮智能科技有限公司 Sequence comparison method, device, equipment and medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021169387A1 (en) * 2020-02-28 2021-09-02 苏州浪潮智能科技有限公司 Sequence alignment method, apparatus and device, and medium

Also Published As

Publication number Publication date
WO2021169387A1 (en) 2021-09-02

Similar Documents

Publication Publication Date Title
CN103946855B (en) For detection towards the methods, devices and systems returning programming attack
US7987473B1 (en) Accelerated class check
US9026739B2 (en) Multimode prefetcher
US8516205B2 (en) Method and apparatus for providing efficient context classification
US20090282225A1 (en) Store queue
US8990486B2 (en) Hardware and file system agnostic mechanism for achieving capsule support
US20110010506A1 (en) Data prefetcher with multi-level table for predicting stride patterns
JP2000276381A (en) Method for estimating task execution time
CN111402956A (en) Sequence comparison method, device, equipment and medium
US20160092115A1 (en) Implementing storage policies regarding use of memory regions
JP7262520B2 (en) Methods, apparatus, apparatus and computer readable storage media for executing instructions
CN114168318A (en) Training method of storage release model, storage release method and equipment
US20150370565A1 (en) Data processing device and method, and processor unit of same
US20050268040A1 (en) Cache system having branch target address cache
US9507725B2 (en) Store forwarding for data caches
US20210165654A1 (en) Eliminating execution of instructions that produce a constant result
US10862485B1 (en) Lookup table index for a processor
CN111381881A (en) AHB (advanced high-performance bus) interface-based low-power-consumption instruction caching method and device
CN114840258B (en) Multi-level hybrid algorithm filtering type branch prediction method and prediction system
US20160283338A1 (en) Boot operations in memory devices
AU2017438670B2 (en) Simulation device, simulation method, and simulation program
CN113688785A (en) Multi-supervision-based face recognition method and device, computer equipment and storage medium
US5903915A (en) Cache detection using timing differences
JP7173308B2 (en) DETECTION DEVICE, DETECTION METHOD AND DETECTION PROGRAM
US9342319B1 (en) Accelerated class check

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication
WW01 Invention patent application withdrawn after publication

Application publication date: 20200710