CN110648718B - Mutation detection method and device, storage medium and electronic equipment - Google Patents

Mutation detection method and device, storage medium and electronic equipment Download PDF

Info

Publication number
CN110648718B
CN110648718B CN201911192425.6A CN201911192425A CN110648718B CN 110648718 B CN110648718 B CN 110648718B CN 201911192425 A CN201911192425 A CN 201911192425A CN 110648718 B CN110648718 B CN 110648718B
Authority
CN
China
Prior art keywords
basic data
data
probability
sequenced
activation region
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911192425.6A
Other languages
Chinese (zh)
Other versions
CN110648718A (en
Inventor
刘兵
张凯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Medical Jiyun Medical Data Research Institute Co Ltd
Original Assignee
Nanjing Medical Jiyun Medical Data Research Institute Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Medical Jiyun Medical Data Research Institute Co Ltd filed Critical Nanjing Medical Jiyun Medical Data Research Institute Co Ltd
Priority to CN201911192425.6A priority Critical patent/CN110648718B/en
Publication of CN110648718A publication Critical patent/CN110648718A/en
Application granted granted Critical
Publication of CN110648718B publication Critical patent/CN110648718B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B5/00ICT specially adapted for modelling or simulations in systems biology, e.g. gene-regulatory networks, protein interaction networks or metabolic networks
    • G16B5/20Probabilistic models

Landscapes

  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Biotechnology (AREA)
  • General Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Medical Informatics (AREA)
  • Evolutionary Biology (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Probability & Statistics with Applications (AREA)
  • Physiology (AREA)
  • Genetics & Genomics (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The present disclosure relates to the field of data processing technologies, and in particular, to a variation detection method and apparatus, a computer-readable storage medium, and an electronic device, where the method includes: adding the basic data into a probability calculation queue; acquiring a group of basic data groups in a probability calculation queue, calculating the probability value of each segment to be tested in each activation region being a haplotype based on each basic data in the basic data groups in a parallel mode, and adding the basic data and the corresponding probability value into a probability output queue; and acquiring basic data and a corresponding probability value in the probability output queue, and calculating variation information in the activation region according to the basic data and the probability value. According to the technical scheme, the probability values corresponding to the plurality of basic data can be acquired and calculated simultaneously in a parallel mode, the problem that the probability value calculation speed is low due to the fact that the probability values are calculated in sequence is avoided, and then the speed of variation detection is increased.

Description

Mutation detection method and device, storage medium and electronic equipment
Technical Field
The present disclosure relates to the field of data processing technologies, and in particular, to a variation detection method and apparatus, a computer-readable storage medium, and an electronic device.
Background
The same medication and greater variability in effect often occur during the course of disease treatment, and this is largely due to genetic differences between individuals. In order to be able to better target different individuals for treatment, researchers are constantly investigating how to perform mutation detection in a large amount of genetic data.
Variation detection refers to the detection of genetic variation information at the individual or population level of a species by means of high-throughput sequencing. The gatk (genome Analysis toolkit) haplotypecall tool is a common mutation detection tool, can reduce false positives caused by sequencing errors, and has the advantage of high accuracy. However, when there are many samples, the mutation detection speed is slow and the efficiency is low because the calculation needs to be performed for each activation region one by one.
It is to be noted that the information disclosed in the above background section is only for enhancement of understanding of the background of the present disclosure, and thus may include information that does not constitute prior art known to those of ordinary skill in the art.
Disclosure of Invention
The present disclosure is directed to a mutation detection method and apparatus, a computer-readable storage medium, and an electronic device, so as to overcome the problems of slow speed and low efficiency of mutation detection at least to some extent.
Additional features and advantages of the disclosure will be set forth in the detailed description which follows, or in part will be obvious from the description, or may be learned by practice of the disclosure.
According to a first aspect of the present disclosure, there is provided a mutation detection method, including:
adding the basic data into a probability calculation queue; wherein, the basic data comprises a segment to be tested in an activation region and a haplotype;
acquiring a group of basic data groups in a probability calculation queue, calculating probability values of the haplotypes of the fragments to be sequenced in the activation regions based on the basic data in the basic data groups in a parallel mode, and adding the basic data and the corresponding probability values into a probability output queue;
and acquiring the basic data and the corresponding probability value in the probability output queue, and calculating variation information in the activation region according to the basic data and the probability value.
In an exemplary embodiment of the disclosure, based on the foregoing scheme, before adding the base data to the probability calculation queue, the method further includes:
and generating at least one piece of basic data according to the fragments to be sequenced and preset genetic data.
In an exemplary embodiment of the disclosure, based on the foregoing scheme, the generating at least one basic data according to the fragments to be sequenced and preset genetic data includes:
comparing the fragment to be sequenced with the preset genetic data to obtain comparison data, and identifying at least one activation region according to the comparison data;
determining the haplotypes in each activation region according to the fragments to be sequenced in each activation region and the preset genetic data;
and generating basic data according to the fragments to be sequenced and the haplotypes in the activation regions.
In an exemplary embodiment of the disclosure, based on the foregoing scheme, the determining the haplotype in each activation region according to the to-be-sequenced fragment in each activation region and the preset genetic data includes:
and locally assembling the fragments to be sequenced in each activation region and the preset genetic data, and determining the haplotypes in the activation regions according to the local assembly result.
In an exemplary embodiment of the disclosure, based on the foregoing scheme, the generating at least one piece of basic data according to the segment to be sequenced and the preset genetic data is performed in a multi-thread manner.
In an exemplary embodiment of the disclosure, based on the above scheme, the calculating a probability value that each of the to-be-sequenced fragments in each of the activation regions is the haplotype based on each of the basis data in the basis data set includes:
and respectively inputting the basic data into a preset model to calculate the probability value of the haplotype of the segment to be sequenced in the activation region corresponding to the basic data.
In an exemplary embodiment of the present disclosure, based on the foregoing scheme, the preset model includes a pair of hidden markov models.
In an exemplary embodiment of the disclosure, based on the foregoing scheme, the calculating variant information in the activation region according to the base data and the probability value includes:
and counting the probability value that the fragment to be sequenced is the haplotype to determine the variation information of each variation data point on the fragment to be sequenced in the activation region.
In an exemplary embodiment of the present disclosure, based on the foregoing scheme, a bayesian statistical method is adopted for performing statistics when the probability value that the segment to be sequenced is the haplotype is calculated.
In an exemplary embodiment of the disclosure, based on the foregoing scheme, the computing, in a parallel manner, the probability values that the segments to be sequenced in the activation regions are the haplotypes based on the basic data in the basic data group, and adding the basic data and the corresponding probability values to a probability output queue is implemented in parallel by a programmable logic gate array.
In an exemplary embodiment of the disclosure, based on the foregoing scheme, the calculating variant information in the activation region according to the base data and the probability value is performed in a multi-thread manner.
In an exemplary embodiment of the present disclosure, based on the foregoing scheme, the maximum number of pieces of basic data included in the basic data group is determined according to a preset number.
According to a second aspect of the present disclosure, there is provided a variation detecting apparatus comprising:
the data generation module is used for adding the basic data into the probability calculation queue; wherein, the basic data comprises a segment to be tested in an activation region and a haplotype;
a probability calculation module, configured to obtain a group of basic data groups in a probability calculation queue, calculate, in a parallel manner, probability values that the segments to be sequenced in the activation regions are the haplotypes based on the basic data in the basic data groups, and add the basic data and the corresponding probability values to a probability output queue;
and the variation calculation module is used for acquiring the basic data and the corresponding probability value in the probability output queue and calculating variation information in the activation region according to the basic data and the probability value.
According to a third aspect of the present disclosure, there is provided a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the mutation detection method as described in the first aspect of the embodiments above.
According to a fourth aspect of the embodiments of the present disclosure, there is provided an electronic apparatus including:
a processor; and
storage means for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to implement the mutation detection method as described in the first aspect of the embodiments above.
The technical scheme provided by the embodiment of the disclosure can have the following beneficial effects:
in the mutation detection method provided by an embodiment of the present disclosure, basic data is added to a probability calculation queue, a group of basic data groups is obtained in the probability calculation queue, a probability value that each to-be-sequenced fragment in each activation region is the haplotype is calculated in a parallel manner based on each basic data in the basic data groups, the basic data and the corresponding probability value are added to a probability output queue, and finally, the basic data and the corresponding probability value in the probability output queue are read, and mutation information in the activation region is calculated according to the basic data and the probability value. By adding the basic data into the probability calculation queue, the probability values corresponding to the basic data can be simultaneously acquired and calculated in a parallel mode, the problem of low probability value calculation speed caused by sequential calculation of the probability values is avoided, and the speed of mutation detection is further increased.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure. It is to be understood that the drawings in the following description are merely exemplary of the disclosure, and that other drawings may be derived from those drawings by one of ordinary skill in the art without the exercise of inventive faculty. In the drawings:
FIG. 1 schematically illustrates a flow chart of a mutation detection method in an exemplary embodiment of the present disclosure;
FIG. 2 schematically shows a flowchart of a method for generating at least one basic data from a segment to be sequenced and preset genetic data in an exemplary embodiment of the present disclosure;
FIG. 3 is a flowchart illustrating a variation detection method when the preset model is a pair HMM model according to an exemplary embodiment of the disclosure;
FIG. 4 is a schematic diagram illustrating a variation detecting apparatus according to an exemplary embodiment of the disclosure;
FIG. 5 schematically illustrates a structural diagram of a computer system suitable for use with an electronic device that implements an exemplary embodiment of the present disclosure;
fig. 6 schematically illustrates a schematic diagram of a computer-readable storage medium, according to some embodiments of the present disclosure.
Detailed Description
Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art. The described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.
Furthermore, the drawings are merely schematic illustrations of the present disclosure and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and thus their repetitive description will be omitted. Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices.
In the process of detecting the relevant mutation, the processes of generating the basic data, calculating the probability value and calculating the mutation information are generally performed in sequence in a serial manner. In the process of calculating the probability value, the probability that each fragment to be sequenced is a haplotype needs to be calculated for each fragment to be sequenced, and the calculation amount is a process which occupies a larger proportion in the whole mutation detection process, so that the time consumption of the process of calculating the probability value is a step which greatly influences the mutation detection speed. In the serial manner, the amount of data for which each probability value calculation is performed is limited to one activation region, i.e., one base data, and thus the speed of mutation detection is slow. For the above reasons, the speed of mutation detection can be increased by increasing the amount of data targeted for each execution.
In the present exemplary embodiment, first, a mutation detection method is provided, which can be applied to a mutation detection process of genetic data. Referring to fig. 1, the mutation detection method may include the following steps:
s110, adding the basic data into a probability calculation queue;
s120, reading a group of basic data groups in a probability calculation queue, calculating probability values of the haplotypes of the fragments to be sequenced in the activation regions based on the basic data in the basic data groups in a parallel mode, and adding the basic data and the corresponding probability values into a probability output queue;
s130, reading the next basic data and the corresponding probability value in the probability output queue, and calculating variation information in the activation region according to the basic data and the probability value.
According to the mutation detection method provided in the exemplary embodiment, the probability values corresponding to a plurality of pieces of basic data can be simultaneously acquired and calculated in a parallel manner by adding the pieces of basic data to the probability calculation queue, so that the problem of low probability value calculation speed caused by sequentially calculating the probability values is avoided, and the mutation detection speed is further increased.
Hereinafter, each step of the variation detection method in the present exemplary embodiment will be described in more detail with reference to the drawings and the embodiments.
Step S110, adding the basic data to the probability calculation queue.
In an example embodiment of the present disclosure, the segment to be sequenced is a plurality of test data segments obtained by randomly breaking genetic data to be tested, and therefore, the number of the segments to be sequenced may also be multiple. Meanwhile, in order to improve the analysis and detection capability of the test data fragment, the test data fragment can be amplified through an amplification technology, and the amplified test data fragment is used as a fragment to be sequenced.
In an example embodiment of the present disclosure, before the adding the base data to the probability calculation queue, the method further comprises: and generating at least one piece of basic data according to the fragments to be sequenced and preset genetic data.
In an example embodiment of the present disclosure, the preset genetic data may include known genetic data of various species, such as genes, gene products, and the like.
In an example embodiment of the present disclosure, the step of generating the basic data according to the segment to be sequenced and the preset genetic data may be executed in multiple threads or in a single thread, and the present disclosure does not particularly limit this. When the multi-thread execution is performed, the basic data can be added into the probability calculation queue through the multiple threads, so that the problem of low mutation detection efficiency caused by no basic data or less basic data in the probability calculation queue can be solved.
In an exemplary embodiment of the disclosure, referring to fig. 2, the generating at least one basic data according to the fragments to be sequenced and the preset genetic data includes the following steps S210 to S220:
step S210, comparing the segment to be sequenced with the preset genetic data to obtain comparison data, and identifying at least one activation region according to the comparison data.
In an example embodiment of the present disclosure, the segment to be sequenced may be covered to a matching portion in the preset genetic data, then the segment to be sequenced is compared with the matching portion in the preset genetic data to obtain comparison data, and finally the comparison data is traversed to calculate a likelihood value that each data point in the preset genetic data is an activation point, and an activation region is identified according to the likelihood value. For example, data segments with likelihood values greater than a certain threshold may be identified as activation regions.
Step S220, determining the haplotypes in each activation region according to the fragments to be sequenced in each activation region and the preset genetic data.
Specifically, determining the haplotype in each activation region according to the to-be-sequenced fragment in each activation region and the preset genetic data comprises: and locally assembling the fragments to be sequenced in each activation region and the preset genetic data, and determining the haplotypes in the activation regions according to the local assembly result.
In an exemplary embodiment of the present disclosure, after the activation region is determined, it is necessary to assemble a haplotype that may appear in the activation region according to the fragments to be sequenced in the activation region and preset genetic data. For example, a part of the predetermined genetic data in an activation region corresponds to TATGAATGTAGGCT, and the activation region covers the sequence of the to-be-sequenced fragment: fragment to be sequenced 1: TATGAATGTGGGCT, fragment 2 to be sequenced: TATGCATGTAGGCT and fragment 3 to be sequenced: TATGCATGTGGGCT, since the 5 th data point of the sequence is A or C, respectively, and the 9 th data point is A or G, respectively, it can be determined that the variation data points in the activation region are the 5 th and 9 th data points, and the haplotypes that may occur are the 5 th and 9 th data points, respectively, in the following combinations: AA. The sequence of CA, AG, CG, other data points is consistent with the preset genetic data.
Step S230, generating basic data according to the fragments to be sequenced and the haplotypes in each activation region.
In an example embodiment of the present disclosure, the basic data is generated from the fragments to be sequenced within the activation region and the haplotype determined in step S220. The basic data need to be generated according to each activation region, and one basic data comprises a segment to be sequenced and a haplotype in one activation region, so that calculation can be performed on each activation region, and the problem of data confusion is avoided.
In addition, after generating a basic data and adding the basic data into the probability calculation queue, other basic data generation can be started directly according to the fragments to be sequenced in an activation region without the basic data and the preset genetic data, so that the basic data is added into the probability calculation queue continuously.
Step S120, a group of basic data groups in a probability calculation queue is obtained, the probability value that each segment to be sequenced in each activation region is the haplotype is calculated on the basis of each basic data in the basic data groups in a parallel mode, and the basic data and the corresponding probability value are added into a probability output queue.
TABLE 1 probability values of 4 haplotypes for fragments 1, 2, 3 to be sequenced
Figure DEST_PATH_IMAGE001
In an example embodiment of the present disclosure, the basic data obtained in step S110 are sequentially added to a probability calculation queue, and then each time a group of basic data sets is obtained, a probability value that the segment to be sequenced in the activation region included in each basic data in the group of basic data sets is haplotype is calculated in a parallel manner. For example, in the above example, haplotypes were haplotypes for the following combinations at the 5 th data point and the 9 th data point, respectively: AA. And when the data point sequences of the CA, AG and CG are consistent with the preset genetic data, probability values that the segment 1 to be sequenced, the segment 2 to be sequenced and the segment 3 to be sequenced are the 4 haplotypes can be respectively calculated, and the calculation results can be shown in table 1.
In an example embodiment of the present disclosure, the calculating, based on the respective basic data in the basic data set, a probability value that the respective to-be-sequenced fragments in the respective activation regions are the haplotypes includes: and respectively inputting the basic data into a preset model to calculate the probability value of the haplotype of the segment to be sequenced in the activation region corresponding to the basic data. Wherein the preset model may include a pair of hidden markov models (pair hmm models). In addition, other models for calculating probability values of haplotypes when the fragments to be sequenced are available are also possible, and the disclosure is not limited thereto.
Furthermore, the probability value that the segment to be detected is a haplotype in the activation region corresponding to each basic data can be simultaneously calculated by a field programmable gate array (FPGA computing platform) in a parallel mode according to a preset model, so that the purpose of accelerating the mutation detection speed is realized. In addition, since the FPGA computing platform is accelerated by using a plurality of parallel computing units, the more computing units participating in the operation, the higher the acceleration efficiency, and therefore, when more probability values can be computed simultaneously in parallel, the better the acceleration effect is. By putting the basic data into the probability calculation queue and acquiring the basic data group from the probability calculation queue, the probability values of the activation regions can be calculated in parallel according to the fragments to be tested in the different activation regions and the preset genetic data, and calculation aiming at each activation region one by one is avoided, so that the calculation capability of the FPGA calculation platform can be fully utilized, the speed of mutation detection is increased, and the efficiency is improved.
Further, the maximum number of the basic data included in the basic data group may be determined according to a preset number. The preset number can be set according to the computing capacity of the parallel computing platform. For example, when the platform used in the parallel computing is an FPGA computing platform, assuming that the FPGA computing platform allows 10 parallel computing to be simultaneously run, the maximum number of the basic data groups including the basic data may be set to 10, that is, at most one basic data group including 10 basic data may be obtained from the probability computing queue for each computing, so as to fully utilize the computing power of the FPGA computing platform.
In addition, after the basic data and the corresponding probability value are added into the probability output queue, another group of basic data groups which are not subjected to probability calculation can be directly read for calculation, and the basic data and the corresponding probability value are added into the probability output queue continuously.
Step S130, reading the basic data and the corresponding probability value in the probability output queue, and calculating variation information in the activation region according to the basic data and the probability value.
In an example embodiment of the present disclosure, the calculating variant information in the activation region according to the base data and the probability value includes: and counting the probability value that the fragment to be sequenced is the haplotype to determine the variation information of each variation data point on the fragment to be sequenced in the activation region.
TABLE 2 genotype probabilities for the 5 th data points of fragments 1, 2, 3 to be sequenced
Figure 948064DEST_PATH_IMAGE002
Specifically, when the statistics is performed according to the probability value of the haplotype, a bayesian statistics method may be used for the statistics, and other statistical methods may also be used, which is not limited in this disclosure. For example, assuming that the fragments to be sequenced in table 1 are human gene data, the genotype probability of the 5 th data point is determined according to the probability values in table 1, as shown in table 2.
Then, the probability values can be calculated in a Bayesian statistical manner according to the probability values, and since the human gene is diploid, the probabilities of the 5 th data point C/C, C/A, and A/A can be calculated. The results calculated from the data in table 2 were 0.0012, 0.00065 and 0.00027, respectively, and thus the variation information of the 5 th data point was determined to be C/C.
In an example embodiment of the present disclosure, the process of calculating the variation information in the activation region according to the basic data and the probability value may be executed through multiple threads or may be executed through a single thread, which is not particularly limited by the present disclosure. When the multi-thread is executed, the FPGA computing platform can compute a group of basic data groups in parallel, and a plurality of basic data and corresponding probability values can be added into the probability output queue at the same time, so that the step of computing the variation information through the multi-thread can further accelerate the variation detection speed and improve the variation detection efficiency.
In addition, since the process of generating the basic data according to the segment to be sequenced and the preset genetic data in step S110 can be executed through multiple threads, setting the probability calculation queue can avoid the problem of call competition caused by simultaneously calling one calculation platform by multiple threads for generating the basic data when calculating one by one for each activation region, thereby accelerating the speed of mutation detection and improving the efficiency.
The following takes a preset model as a pair hmm model as an example, and details of implementation of the technical solution of the embodiment of the present disclosure are described in detail with reference to fig. 3:
step S310, comparing the fragments to be sequenced with preset genetic data to obtain comparison data, and identifying at least one activation region according to the comparison data;
step S320, assembling haplotypes in the activation region, and generating basic data by using the fragments to be tested and the haplotypes in the activation region;
step S330, adding the generated basic data into a probability calculation queue;
step S340, acquiring a group of basic data groups in the probability calculation queue, and calculating each basic data in the basic data groups in a parallel mode based on a pair HMM model of an FPGA calculation platform to acquire each basic data and a corresponding probability value;
step S350, adding each basic data in the basic data group and the corresponding probability value into a probability output queue;
and step S360, acquiring the basic data and the corresponding probability value in the probability output queue, and calculating the variation information in the activation region according to the basic data and the corresponding probability value.
Specifically, as shown in fig. 3, after adding the generated basic data to the probability calculation queue, the process of generating the next basic data may be started directly. Meanwhile, after each basic data in the basic data group and the corresponding probability value are added into the probability output queue, the process of acquiring the next group of basic data group and executing probability value calculation can be directly started. In addition, when the variation information in the activation region is calculated according to the basic data and the corresponding probability value, the process of obtaining the next basic data to calculate the variation information can be directly started. The probability calculation queue and the probability output queue are arranged to reconstruct the process of generating the basic data, the process of calculating the probability and the calculation process of the variation information, so that the problems of low variation detection speed and low efficiency caused by calculation of each basic data one by one are solved.
It is noted that the above-mentioned figures are merely schematic illustrations of processes involved in methods according to exemplary embodiments of the present disclosure, and are not intended to be limiting. It will be readily understood that the processes shown in the above figures are not intended to indicate or limit the chronological order of the processes. In addition, it is also readily understood that these processes may be performed synchronously or asynchronously, e.g., in multiple modules.
Embodiments of the disclosed apparatus are described below, which can be used to perform the above-described variation detection methods of the present disclosure. Referring to fig. 4, the mutation detection apparatus 400 includes: a data generation module 410, a probability calculation module 420, and a variance calculation module 430.
Wherein, the data generating module 410 may be configured to add the basic data to a probability calculation queue; wherein, the basic data comprises a segment to be tested in an activation region and a haplotype;
the probability calculation module 420 may be configured to obtain a set of basic data groups in a probability calculation queue, calculate, in a parallel manner, probability values that each of the segments to be sequenced in each of the activation regions is the haplotype based on each of the basic data groups, and add the basic data and the corresponding probability values to a probability output queue;
the variant calculation module 430 may be configured to obtain the basic data and the corresponding probability value in the probability output queue, and calculate variant information in the activation region according to the basic data and the probability value.
In an exemplary embodiment of the disclosure, based on the foregoing scheme, the data generating module 410 may be configured to generate at least one of the basic data according to the segment to be sequenced and the preset genetic data.
In an exemplary embodiment of the disclosure, based on the foregoing scheme, the data generating module 410 may be configured to compare the fragment to be sequenced with the preset genetic data to obtain comparison data, and identify at least one of the activation regions according to the comparison data; determining the haplotypes in each activation region according to the fragments to be sequenced in each activation region and the preset genetic data; and generating basic data according to the fragments to be sequenced and the haplotypes in the activation regions.
In an exemplary embodiment of the disclosure, based on the foregoing scheme, the data generating module 410 may be configured to perform local assembly on the segment to be sequenced and the preset genetic data in each of the activation regions, and determine the haplotype in the activation regions according to the result of the local assembly.
In an exemplary embodiment of the disclosure, based on the foregoing scheme, the generating at least one piece of basic data according to the segment to be sequenced and the preset genetic data is performed in a multi-thread manner.
In an exemplary embodiment of the disclosure, based on the foregoing solution, the probability calculation module 420 may be configured to input each basic data into a preset model respectively to calculate a probability value that the segment to be sequenced is the haplotype in the activation region corresponding to the basic data.
In an exemplary embodiment of the present disclosure, based on the foregoing scheme, the preset model includes a pair of hidden markov models.
In an exemplary embodiment of the disclosure, based on the foregoing scheme, the variation calculating module 430 may be configured to determine variation information of each variation data point on the fragment to be sequenced in the activation region by counting probability values that the fragment to be sequenced is the haplotype.
In an exemplary embodiment of the present disclosure, based on the foregoing scheme, a bayesian statistical method is adopted for performing statistics when the probability value that the segment to be sequenced is the haplotype is calculated.
In an exemplary embodiment of the disclosure, based on the foregoing scheme, the computing, in a parallel manner, the probability values that the segments to be sequenced in the activation regions are the haplotypes based on the basic data in the basic data group, and adding the basic data and the corresponding probability values to a probability output queue is implemented in parallel by a programmable logic gate array.
In an exemplary embodiment of the disclosure, based on the foregoing scheme, the calculating variant information in the activation region according to the base data and the probability value is performed in a multi-thread manner.
In an exemplary embodiment of the present disclosure, based on the foregoing scheme, the maximum number of pieces of basic data included in the basic data group is determined according to a preset number.
Since each functional module of the variation detection apparatus in the exemplary embodiment of the present disclosure corresponds to a step of the exemplary embodiment of the variation detection method, please refer to the embodiment of the variation detection method in the present disclosure for details that are not disclosed in the embodiment of the present disclosure.
It should be noted that although in the above detailed description several modules or units of the device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit, according to embodiments of the present disclosure. Conversely, the features and functions of one module or unit described above may be further divided into embodiments by a plurality of modules or units.
In addition, in an exemplary embodiment of the present disclosure, an electronic device capable of implementing the above mutation detection method is also provided.
As will be appreciated by one skilled in the art, aspects of the present disclosure may be embodied as a system, method or program product. Accordingly, various aspects of the present disclosure may be embodied in the form of: an entirely hardware embodiment, an entirely software embodiment (including firmware, microcode, etc.) or an embodiment combining hardware and software aspects that may all generally be referred to herein as a "circuit," module "or" system.
An electronic device 500 according to such an embodiment of the present disclosure is described below with reference to fig. 5. The electronic device 500 shown in fig. 5 is only an example and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.
As shown in fig. 5, the electronic device 500 is embodied in the form of a general purpose computing device. The components of the electronic device 500 may include, but are not limited to: the at least one processing unit 510, the at least one memory unit 520, a bus 530 connecting various system components (including the memory unit 520 and the processing unit 510), and a display unit 540.
Wherein the storage unit stores program code that is executable by the processing unit 510 to cause the processing unit 510 to perform steps according to various exemplary embodiments of the present disclosure as described in the above section "exemplary methods" of this specification. For example, the processing unit 510 may execute step S110 as shown in fig. 1: adding the basic data into a probability calculation queue; wherein, the basic data comprises a segment to be tested in an activation region and a haplotype; s120: acquiring a group of basic data groups in a probability calculation queue, calculating probability values of the haplotypes of the fragments to be sequenced in the activation regions based on the basic data in the basic data groups in a parallel mode, and adding the basic data and the corresponding probability values into a probability output queue; s130: and acquiring the basic data and the corresponding probability value in the probability output queue, and calculating variation information in the activation region according to the basic data and the probability value.
As another example, the electronic device may implement the steps shown in fig. 2 to 3.
The storage unit 520 may include readable media in the form of volatile storage units, such as a random access memory unit (RAM) 521 and/or a cache memory unit 522, and may further include a read only memory unit (ROM) 523.
The storage unit 520 may also include a program/utility 524 having a set (at least one) of program modules 525, such program modules 525 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.
Bus 530 may be one or more of any of several types of bus structures including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.
The electronic device 500 may also communicate with one or more external devices 570 (e.g., keyboard, pointing device, Bluetooth device, etc.), with one or more devices that enable a user to interact with the electronic device 500, and/or with any devices (e.g., router, modem, etc.) that enable the electronic device 500 to communicate with one or more other computing devices. Such communication may occur via input/output (I/O) interfaces 550. Also, the electronic device 500 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the internet) via the network adapter 560. As shown, the network adapter 560 communicates with the other modules of the electronic device 500 over the bus 530. It should be appreciated that although not shown in the figures, other hardware and/or software modules may be used in conjunction with the electronic device 500, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.
Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which may be a personal computer, a server, a terminal device, or a network device, etc.) to execute the method according to the embodiments of the present disclosure.
In an exemplary embodiment of the present disclosure, there is also provided a computer-readable storage medium having stored thereon a program product capable of implementing the above-described method of the present specification. In some possible embodiments, aspects of the present disclosure may also be implemented in the form of a program product comprising program code for causing a terminal device to perform the steps according to various exemplary embodiments of the present disclosure described in the "exemplary methods" section above of this specification, when the program product is run on the terminal device.
Referring to fig. 6, a program product 600 for implementing the above method according to an embodiment of the present disclosure is described, which may employ a portable compact disc read only memory (CD-ROM) and include program code, and may be run on a terminal device, such as a personal computer. However, the program product of the present disclosure is not limited thereto, and in this document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
A computer readable signal medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Program code for carrying out operations for the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).
Furthermore, the above-described figures are merely schematic illustrations of processes included in methods according to exemplary embodiments of the present disclosure, and are not intended to be limiting. It will be readily understood that the processes shown in the above figures are not intended to indicate or limit the chronological order of the processes. In addition, it is also readily understood that these processes may be performed synchronously or asynchronously, e.g., in multiple modules.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This disclosure is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.
It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is to be limited only by the terms of the appended claims.

Claims (14)

1. A mutation detection method, comprising:
adding the basic data into a probability calculation queue; wherein, the basic data comprises a segment to be tested in an activation region and a haplotype;
acquiring a group of basic data groups in a probability calculation queue, inputting the basic data groups into a calculation platform, calculating probability values of the haplotypes of the fragments to be sequenced in the activation regions based on the basic data in the basic data groups in a parallel mode, and adding the basic data and the corresponding probability values into a probability output queue; the maximum quantity of the basic data included in the basic data group is determined according to a preset quantity, and the preset quantity is determined according to the computing capacity of the computing platform;
and acquiring the basic data and the corresponding probability value in the probability output queue, and calculating variation information in the activation region according to the basic data and the probability value.
2. The method of claim 1, wherein prior to said adding the base data to the probability calculation queue, the method further comprises:
and generating at least one piece of basic data according to the fragments to be sequenced and preset genetic data.
3. The method of claim 2, wherein the generating at least one of the basic data according to the fragments to be sequenced and the preset genetic data comprises:
comparing the fragment to be sequenced with the preset genetic data to obtain comparison data, and identifying at least one activation region according to the comparison data;
determining the haplotypes in each activation region according to the fragments to be sequenced in each activation region and the preset genetic data;
and generating basic data according to the fragments to be sequenced and the haplotypes in the activation regions.
4. The method of claim 3, wherein said determining said haplotype in each of said activation regions from said fragments to be sequenced and said predetermined genetic data in each of said activation regions comprises:
and locally assembling the fragments to be sequenced in each activation region and the preset genetic data, and determining the haplotypes in the activation regions according to the local assembly result.
5. The method according to claim 2, wherein the generating at least one of the basic data according to the segment to be sequenced and the preset genetic data is performed in a multi-thread manner.
6. The method of claim 1, wherein said calculating a probability value that each of the fragments to be sequenced in each of the activation regions is the haplotype based on each of the basis data in the set of basis data comprises:
and respectively inputting the basic data into a preset model to calculate the probability value of the haplotype of the segment to be sequenced in the activation region corresponding to the basic data.
7. The method of claim 6, wherein the pre-set model comprises a pair of hidden Markov models.
8. The method of claim 1, wherein the calculating variant information in the activation region according to the base data and the probability value comprises:
and counting the probability value that the fragment to be sequenced is the haplotype to determine the variation information of each variation data point on the fragment to be sequenced in the activation region.
9. The method of claim 8, wherein the probability value that the to-be-sequenced fragment is the haplotype is counted by a Bayesian statistical method.
10. The method of claim 1, wherein computing the probability values that the respective segments to be sequenced in the respective activation regions are the haplotypes based on the respective basis data in the basis data sets in a parallel manner, and wherein adding the basis data and the corresponding probability values to a probability output queue is performed in parallel by a programmable logic gate array.
11. The method of claim 1, wherein the calculating variant information in the activation region from the base data and the probability value is performed in a multi-threaded manner.
12. A variation detecting apparatus, comprising:
the data generation module is used for adding the basic data into the probability calculation queue; wherein, the basic data comprises a segment to be tested in an activation region and a haplotype;
a probability calculation module, configured to obtain a group of basic data groups in a probability calculation queue, input the basic data groups into a calculation platform, calculate, in a parallel manner, probability values that each segment to be sequenced in each activation region is the haplotype based on each basic data in the basic data groups, and add the basic data and the corresponding probability values to a probability output queue; (ii) a The maximum quantity of the basic data included in the basic data group is determined according to a preset quantity, and the preset quantity is determined according to the computing capacity of the computing platform;
and the variation calculation module is used for acquiring the basic data and the corresponding probability value in the probability output queue and calculating variation information in the activation region according to the basic data and the probability value.
13. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the mutation detection method according to any one of claims 1 to 11.
14. An electronic device, comprising:
a processor; and
memory storing one or more programs that, when executed by the one or more processors, cause the one or more processors to implement the variation detection method of any of claims 1 to 11.
CN201911192425.6A 2019-11-28 2019-11-28 Mutation detection method and device, storage medium and electronic equipment Active CN110648718B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911192425.6A CN110648718B (en) 2019-11-28 2019-11-28 Mutation detection method and device, storage medium and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911192425.6A CN110648718B (en) 2019-11-28 2019-11-28 Mutation detection method and device, storage medium and electronic equipment

Publications (2)

Publication Number Publication Date
CN110648718A CN110648718A (en) 2020-01-03
CN110648718B true CN110648718B (en) 2020-03-17

Family

ID=69014729

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911192425.6A Active CN110648718B (en) 2019-11-28 2019-11-28 Mutation detection method and device, storage medium and electronic equipment

Country Status (1)

Country Link
CN (1) CN110648718B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108121897A (en) * 2016-11-29 2018-06-05 华为技术有限公司 A kind of genome mutation detection method and detection device
CN109416928A (en) * 2016-06-07 2019-03-01 伊路米纳有限公司 For carrying out the bioinformatics system, apparatus and method of second level and/or tertiary treatment

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109416928A (en) * 2016-06-07 2019-03-01 伊路米纳有限公司 For carrying out the bioinformatics system, apparatus and method of second level and/or tertiary treatment
CN108121897A (en) * 2016-11-29 2018-06-05 华为技术有限公司 A kind of genome mutation detection method and detection device

Also Published As

Publication number Publication date
CN110648718A (en) 2020-01-03

Similar Documents

Publication Publication Date Title
CN109558597B (en) Text translation method and device, equipment and storage medium
US20210082539A1 (en) Gene mutation identification method and apparatus, and storage medium
US11262717B2 (en) Optimization device and control method of optimization device based on temperature statistical information
CN114649055B (en) Methods, devices and media for detecting single nucleotide variations and indels
Huang et al. Evaluation of variant detection software for pooled next-generation sequence data
CN111400600A (en) Message pushing method, device, equipment and storage medium
US20190259468A1 (en) System and Method for Correlated Error Event Mitigation for Variant Calling
CN114237911A (en) CUDA-based gene data processing method and device and CUDA framework
Rivera-Rivera et al. LS³: A Method for Improving Phylogenomic Inferences When Evolutionary Rates Are Heterogeneous among Taxa
CN110648718B (en) Mutation detection method and device, storage medium and electronic equipment
US11640662B2 (en) Somatic mutation detection apparatus and method with reduced sequencing platform-specific error
CN110909824B (en) Test data checking method and device, storage medium and electronic equipment
CN116227573B (en) Segmentation model training method, image segmentation device and related media
CN112420211A (en) Early warning method and device for unknown infectious diseases, electronic equipment and computer medium
US20140067749A1 (en) Method of evaluating genomic sequences
CN110751227A (en) Data processing method, device, equipment and storage medium
CN113327646B (en) Sequencing sequence processing method and device, storage medium and electronic equipment
CN110797081B (en) Activation area identification method and device, storage medium and electronic equipment
CN114116688B (en) Data processing and quality inspection method and device and readable storage medium
CN110570908B (en) Sequencing sequence polymorphic identification method and device, storage medium and electronic equipment
CN111190671B (en) Window display control method and device and electronic equipment
CN110297989B (en) Test method, device, equipment and medium for anomaly detection
CN114416462A (en) Machine behavior identification method and device, electronic equipment and storage medium
US11335434B2 (en) Feature selection for efficient epistasis modeling for phenotype prediction
CN109858121B (en) Method, device, equipment and medium for determining key value of survival curve target factor

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant