CN110797081A - Activation area identification method and device, storage medium and electronic equipment - Google Patents

Activation area identification method and device, storage medium and electronic equipment Download PDF

Info

Publication number
CN110797081A
CN110797081A CN201910989749.6A CN201910989749A CN110797081A CN 110797081 A CN110797081 A CN 110797081A CN 201910989749 A CN201910989749 A CN 201910989749A CN 110797081 A CN110797081 A CN 110797081A
Authority
CN
China
Prior art keywords
data
preset
probability
windows
activation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910989749.6A
Other languages
Chinese (zh)
Other versions
CN110797081B (en
Inventor
赵俊涛
蔡怡然
沈一鸣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Medical Jiyun Medical Data Research Institute Co Ltd
Original Assignee
Nanjing Medical Jiyun Medical Data Research Institute Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Medical Jiyun Medical Data Research Institute Co Ltd filed Critical Nanjing Medical Jiyun Medical Data Research Institute Co Ltd
Priority to CN201910989749.6A priority Critical patent/CN110797081B/en
Publication of CN110797081A publication Critical patent/CN110797081A/en
Application granted granted Critical
Publication of CN110797081B publication Critical patent/CN110797081B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding

Abstract

The present disclosure relates to the field of data processing technologies, and in particular, to a method and an apparatus for identifying an activation region, a computer-readable storage medium, and an electronic device, where the method includes: acquiring comparison data of the data to be identified and preset genetic data, and partitioning the comparison data according to a preset rule to acquire partitioned data blocks; traversing each data block according to the length of a preset window in a parallel mode to calculate the probability value that a data point in preset genetic data is an activation point; and respectively smoothing all probability values in each window to obtain a probability curve corresponding to each window, and identifying the activation region in each window according to the probability curve. According to the technical scheme of the embodiment of the disclosure, the data blocks are obtained by blocking the comparison data according to the preset rules, and then the data blocks are processed in parallel respectively, so that the identification efficiency of the activation region can be increased, and the problem of limiting the variation detection speed caused by low identification efficiency of the activation region is avoided.

Description

Activation area identification method and device, storage medium and electronic equipment
Technical Field
The present disclosure relates to the field of data processing technologies, and in particular, to an activation region identification method and apparatus, a computer-readable storage medium, and an electronic device.
Background
The same medication and greater variability in effect often occur during the course of disease treatment, and this is largely due to genetic differences between individuals. In order to be able to better target different individuals for treatment, researchers are constantly investigating how to perform mutation detection in a large amount of genetic data.
Current mutation detection typically relies on the algorithm Genome Analysis ToolKit (GATK). The algorithm identifies the activation region first, then realigns the sequence to a reference genome, and finally calculates the genotype probability of the data point based on a Bayesian model to identify the variation. However, mutation detection by this method often has the problems of low detection speed and high time consumption.
It is to be noted that the information disclosed in the above background section is only for enhancement of understanding of the background of the present disclosure, and thus may include information that does not constitute prior art known to those of ordinary skill in the art.
Disclosure of Invention
The present disclosure is directed to an activation region identification method and apparatus, a computer-readable storage medium, and an electronic device, so as to overcome the problems of low mutation detection speed and high time consumption at least to some extent.
Additional features and advantages of the disclosure will be set forth in the detailed description which follows, or in part will be obvious from the description, or may be learned by practice of the disclosure.
According to a first aspect of the present disclosure, there is provided an activation region identification method, including:
acquiring comparison data of data to be identified and preset genetic data, and partitioning the comparison data according to a preset rule to acquire partitioned data blocks;
traversing each data block according to the length of a preset window in a parallel mode to calculate the probability value that a data point in the preset genetic data is an activation point;
and smoothing all the probability values in the windows respectively to obtain probability curves corresponding to the windows, and identifying the activation regions in the windows according to the probability curves.
In an exemplary embodiment of the present disclosure, based on the foregoing scheme, the preset rule includes a chromosome rule and a preset block value;
the blocking the comparison data according to a preset rule to obtain a blocked data block includes:
dividing the comparison data according to the chromosome where the preset genetic data is located to obtain chromosome data corresponding to each chromosome;
and partitioning each chromosome data according to a preset partitioning value to obtain at least one data block.
In an exemplary embodiment of the present disclosure, based on the foregoing scheme, the preset block value includes a preset block length or a preset block number.
In an exemplary embodiment of the present disclosure, based on the foregoing scheme, traversing each of the data blocks according to a preset window length to calculate a probability value that a data point in the preset genetic data is an activation point includes:
searching a first data point covered by data to be identified in preset genetic data corresponding to each data block through a preset tool;
and traversing each data block by a preset window length from the first data point corresponding to each data block to calculate the probability value corresponding to each data block.
In an exemplary embodiment of the present disclosure, based on the foregoing scheme, the calculating a probability value corresponding to each data point includes:
calculating the matching degree of all the data to be identified covered on each data point and preset genetic data;
and calculating the average value of the corresponding matching degree of each data point, and configuring the average value as the probability value of each data point as an activation point.
In an exemplary embodiment of the present disclosure, based on the foregoing scheme, the smoothing all the probability values in each window to obtain a probability curve corresponding to each window includes:
and respectively carrying out smoothing processing on all the probability values in the windows in parallel to obtain probability curves corresponding to the windows.
In an exemplary embodiment of the present disclosure, based on the foregoing scheme, the smoothing process includes a gaussian filtering process.
In an exemplary embodiment of the present disclosure, based on the foregoing scheme, determining whether there is an active region in each of the windows according to the probability curve includes:
identifying continuous regions with probability values larger than a preset threshold value in the probability curves corresponding to the windows, and configuring the continuous regions as activated regions in the windows; the continuous region comprises probability values corresponding to at least a preset number of continuous data points.
In an exemplary embodiment of the disclosure, based on the foregoing scheme, the traversing the data blocks according to the preset window length in a parallel manner to calculate the probability value that the data point in the preset genetic data is the activation point is implemented in parallel by a programmable logic gate array.
According to a second aspect of the present disclosure, there is provided an activation region identification apparatus including:
the data blocking module is used for acquiring comparison data of the data to be identified and preset genetic data and blocking the comparison data according to a preset rule to acquire blocked data blocks;
the probability calculation module is used for traversing the data blocks in a parallel mode according to the length of a preset window so as to calculate the probability value that the data points in the preset genetic data are the activation points;
and the region identification module is used for respectively smoothing all the probability values in the windows to obtain probability curves corresponding to the windows and judging whether the activated regions exist in the windows according to the probability curves.
In an exemplary embodiment of the present disclosure, based on the foregoing solution, the area identification module includes:
and the smoothing unit is used for respectively performing smoothing processing on all the probability values in the windows in parallel so as to obtain the probability curves corresponding to the windows.
According to a third aspect of the present disclosure, there is provided a computer-readable storage medium, on which a computer program is stored, which when executed by a processor, implements the activation region identification method as described in the first aspect of the embodiments above.
According to a fourth aspect of the embodiments of the present disclosure, there is provided an electronic apparatus including:
a processor; and
storage means for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to implement the activation region identification method as described in the first aspect of the embodiments above.
The technical scheme provided by the embodiment of the disclosure can have the following beneficial effects:
in the method for identifying the activation region provided by the embodiment of the disclosure, firstly, comparison data of data to be identified and preset genetic data is obtained, the comparison data is blocked according to a preset rule to obtain blocked data blocks, and then, each data block is traversed according to a preset window length in a parallel manner to calculate a probability value that a data point in the preset genetic data is an activation point; and finally, smoothing all the probability values in the windows respectively to obtain probability curves corresponding to the windows, and identifying the activation regions in the windows according to the probability curves. According to the technical scheme provided by the embodiment of the disclosure, the data blocks are obtained by blocking the comparison data according to the preset rules, and then the data blocks are respectively processed in parallel, so that the identification efficiency of the activation region can be increased, and the problem of limiting the variation detection speed caused by low identification efficiency of the activation region is avoided.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure. It is to be understood that the drawings in the following description are merely exemplary of the disclosure, and that other drawings may be derived from those drawings by one of ordinary skill in the art without the exercise of inventive faculty. In the drawings:
fig. 1 schematically illustrates a flow chart of an activation region identification method in an exemplary embodiment of the present disclosure;
fig. 2 is a flowchart schematically illustrating a method for blocking the comparison data according to a preset rule to obtain a blocked data block in an exemplary embodiment of the present disclosure;
fig. 3 schematically illustrates a flowchart of a method of traversing each of the data blocks by a preset window length to calculate a probability value that a data point in the preset genetic data is an activation point in an exemplary embodiment of the present disclosure;
fig. 4 schematically illustrates a flowchart of a method for calculating probability values corresponding to the data points in an exemplary embodiment of the present disclosure;
fig. 5 schematically illustrates a composition diagram of an activation region recognition apparatus in an exemplary embodiment of the present disclosure;
FIG. 6 schematically illustrates a structural diagram of a computer system suitable for use with an electronic device that implements an exemplary embodiment of the present disclosure;
fig. 7 schematically illustrates a schematic diagram of a computer-readable storage medium, according to some embodiments of the present disclosure.
Detailed Description
Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art. The described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.
Furthermore, the drawings are merely schematic illustrations of the present disclosure and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and thus their repetitive description will be omitted. Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices.
In the calculation process of identifying the activation region by the GATK algorithm, the activation region can only be identified by traversing the whole comparison data through a single thread, and the subsequent steps need to be carried out based on the result of identifying the activation region, so that the step of identifying the activation region becomes the speed-limiting step of performing mutation detection by the GATK algorithm. For the above reasons, it is possible to increase the mutation detection speed by increasing the recognition speed of the activation region.
In the present exemplary embodiment, first, an activation region identification method is provided, which can be applied to a mutation detection process for genetic data. Referring to fig. 1, the above-described activation region identification method may include the steps of:
s110, acquiring comparison data of the data to be identified and preset genetic data, and blocking the comparison data according to a preset rule to acquire blocked data blocks;
s120, traversing each data block according to the length of a preset window in a parallel mode to calculate the probability value that a data point in the preset genetic data is an activation point;
s130, smoothing is carried out on all the probability values in the windows respectively to obtain probability curves corresponding to the windows, and the activation regions in the windows are identified according to the probability curves.
According to the method for identifying the activation region provided in the exemplary embodiment, the data blocks are obtained by blocking the comparison data according to the preset rule, and then the data blocks are processed in parallel, so that the identification efficiency of the activation region can be increased, and the problem of limiting the variation detection speed due to low identification efficiency of the activation region is avoided.
Hereinafter, each step of the activation area recognition method in the present exemplary embodiment will be described in more detail with reference to the drawings and the embodiments.
Step S110, acquiring comparison data of the data to be identified and preset genetic data, and blocking the comparison data according to a preset rule to acquire a blocked data block.
In an example embodiment of the present disclosure, in the process of performing mutation detection, the data required for identifying the activation region is comparison data of the data to be identified and the preset genetic data, wherein the comparison data is data generated by overlaying the data to be identified onto a matched part of the preset genetic data, and the overlay is only to overlay the data onto the preset genetic data and does not replace the matched part of the preset genetic data. Through the comparison data, the data to be identified can be corresponding to the matched part in the preset genetic data, so that the further calculation and identification of the activation region in the data to be identified are facilitated.
In an example embodiment of the present disclosure, the preset genetic data may include known genetic data of various species, such as genes, gene products, and the like. The data to be identified is a plurality of test data segments obtained by randomly breaking the genetic data to be tested, so that the number of the data to be identified is also a plurality. Meanwhile, in order to improve the analysis and detection capability of the test data fragment, the test data fragment can be amplified through an amplification technology, and the amplified test data fragment is used as data to be identified.
In an example embodiment of the present disclosure, the preset rule includes a chromosome rule and a preset blocking value, and at this time, the blocking the alignment data according to the preset rule to obtain a blocked data block may include, as shown in fig. 2, steps S210 to S220:
and step S210, dividing the comparison data according to the chromosome where the preset genetic data is located to obtain chromosome data corresponding to each chromosome.
In an example embodiment of the present disclosure, the preset genetic gene may include genes of a plurality of chromosomes, and the alignment data may be divided into chromosome data corresponding to each chromosome according to a chromosome in which the preset genetic gene is located.
Step S220, partitioning each chromosome data according to a preset partitioning value to obtain at least one data block.
In an example embodiment of the present disclosure, the preset blocking value includes a preset blocking length or a preset number of blocks. Since the chromosome data corresponding to each chromosome is still likely to be very long, the chromosome data corresponding to each chromosome can be divided into at least one data block by setting a preset block length or a preset number of blocks. Specifically, the preset block length and the preset number may be set according to a species to which the preset genetic data belongs, or may be set according to a user's requirement, which is not limited in this disclosure. For example, if the preset block length is 2000 base pairs, dividing the chromosome data into one data block every 2000 base pairs; if the preset number of blocks is 20, the chromosome data is divided into 20 data blocks on average.
In an example embodiment of the present disclosure, when the predetermined genetic genes include genes of a plurality of chromosomes, the step of performing the blocking process on the chromosomes may also be implemented in parallel by an FPGA computing platform (programmable gate array computing platform). Specifically, all chromosomes can be input into the FPGA computing platform, and the step of blocking processing is executed in parallel for each chromosome.
And step S120, traversing each data block according to the length of a preset window in a parallel mode to calculate the probability value that the data point in the preset genetic data is the activation point.
In an example embodiment of the present disclosure, the preset window length may be set according to a requirement of mutation detection, or may be consistent with a window length setting of a mutation detection system. Specifically, it can be set to 300 base pairs.
In an example embodiment of the present disclosure, the traversing the data blocks according to the preset window length in a parallel manner to calculate the probability value that the data point in the preset genetic data is the activation point may be implemented in parallel by an FPGA computing platform. Specifically, after the data is partitioned in step S110 to obtain data blocks, the partitioned data are input to the FPGA computing platform, and traversal computation is performed in parallel. In addition, other ways may also be used to implement parallel traversal, which is not specifically limited by this disclosure. The data blocks are traversed simultaneously in a parallel mode, so that the step of traversing the whole comparison data by a single thread is avoided, the speed of traversing the comparison data to identify the activation region is increased, and the problem of limiting the variation detection speed caused by low identification speed is avoided.
Further, referring to fig. 3, the traversing each data block according to the preset window length to calculate the probability value that the data point in the preset genetic data is the activation point includes the following steps S310 to S320:
step S310, searching a first data point covered by the data to be identified in the preset genetic data corresponding to each data block through a preset tool.
Step S320, starting from the first data point corresponding to each data block, traversing each data block by a preset window length to calculate a probability value corresponding to each data block.
In an example embodiment of the present disclosure, since the plurality of data to be identified may not necessarily completely cover the preset genetic data, during traversal of each data block, a first data point covered by the data to be identified in the data block may be identified by a preset tool, and then traversal may be performed with a preset window length using the first data point as a starting point, so as to calculate probability values corresponding to the data points in the windows, respectively. Specifically, the preset tool may be a sampools-filieup tool in the GATK algorithm, or other tools for locating the first data point. By identifying the first data point in the data block, which is covered by the data to be identified, and starting traversal with the first data point, the data point, which is at the front end of the data block and is not covered by the data to be identified, can be prevented from being calculated, so that the problem of speed reduction of identifying the activation region caused by useless calculation is avoided.
In an example embodiment of the present disclosure, the calculating the probability value corresponding to each data point, as shown in fig. 4, includes the following steps S410 to S420:
step S410, calculating a matching degree between all the data to be identified covered on each data point and preset genetic data.
In an example embodiment of the present disclosure, a plurality of data to be identified may be repeatedly overlaid on a data point on the preset genetic data, and a corresponding overlaid data point may be overlaid with a plurality of data to be identified at the same time. For a certain data point, the matching degree of all the data to be identified covered on the data point and the preset genetic data covered by the data to be identified needs to be calculated. Furthermore, the matching degree obtained by calculation can be normalized, and then the subsequent calculation process is carried out.
Step S420, calculating an average value of the corresponding matching degrees of each data point, and configuring the average value as a probability value that each data point is an activation point.
In an example embodiment of the present disclosure, an average value of matching degrees of all data to be identified corresponding to the data point may be calculated, and the average value may be configured as a probability value that the data point is an activation point.
Step S130, performing smoothing processing on all the probability values in each window to obtain a probability curve corresponding to each window, and identifying an activation region in each window according to the probability curve.
In an example embodiment of the present disclosure, smoothing all the probability values in each window to obtain a probability curve corresponding to each window may include: and respectively carrying out smoothing processing on all the probability values in the windows in parallel to obtain probability curves corresponding to the windows.
In an example embodiment of the present disclosure, the smoothing step may also be implemented in parallel by an FPGA computing platform. Specifically, all data points in the preset window length and the corresponding probability values may be input to the FPGA computing platform, and the smoothing process may be performed in parallel. Further, the smoothing process may include a gaussian filtering process, and the probability value of each final data point can be obtained through the gaussian filtering process, so as to obtain a probability curve corresponding to each window. By parallel smoothing, the problem of low smoothing efficiency caused by circularly accumulating the probability value of each data point in the GATK algorithm can be avoided.
In an example embodiment of the present disclosure, determining whether an activation region exists in each of the windows according to the probability curve includes: and identifying continuous regions with probability values larger than a preset threshold value in the probability curves corresponding to the windows, and configuring the continuous regions as activated regions in the windows.
In an example embodiment of the present disclosure, the continuous region includes probability values corresponding to at least a preset number of data points. The preset threshold and the preset number can be set according to different species, and can also be set according to the requirement of variation detection. For example, in a window of 300 base pairs, when the preset threshold is 0.002 and the preset number is 50, a region with a probability value greater than 0.002 corresponding to more than 50 consecutive data points is searched in the activation curve corresponding to the current 300 base pairs, and the region is the identified activation region.
It is noted that the above-mentioned figures are merely schematic illustrations of processes involved in methods according to exemplary embodiments of the present disclosure, and are not intended to be limiting. It will be readily understood that the processes shown in the above figures are not intended to indicate or limit the chronological order of the processes. In addition, it is also readily understood that these processes may be performed synchronously or asynchronously, e.g., in multiple modules.
In addition, in an exemplary embodiment of the present disclosure, an activation region recognition apparatus is also provided. Referring to fig. 5, the activation area recognition apparatus 500 includes: a data chunking module 510, a probability calculation module 520, and a region identification module 530.
The data blocking module 510 may be configured to obtain comparison data between data to be identified and preset genetic data, and block the comparison data according to a preset rule to obtain a blocked data block;
the probability calculation module 520 may be configured to traverse the data blocks according to a preset window length in a parallel manner to calculate a probability value that a data point in the preset genetic data is an activation point;
the region identification module 530 may be configured to perform smoothing on all the probability values in each window to obtain a probability curve corresponding to each window, and determine whether an activation region exists in each window according to the probability curve.
In an exemplary embodiment of the present disclosure, based on the foregoing solution, the region identification module includes a smoothing unit 531, which is configured to perform smoothing on all the probability values in each of the windows in parallel, respectively, so as to obtain a probability curve corresponding to the window.
In an exemplary embodiment of the present disclosure, based on the foregoing scheme, the data partitioning module 510 may be configured to partition the comparison data according to a chromosome in which preset genetic data is located to obtain chromosome data corresponding to each chromosome; and partitioning each chromosome data according to a preset partitioning value to obtain at least one data block.
In an exemplary embodiment of the present disclosure, based on the foregoing scheme, the preset block value includes a preset block length or a preset block number.
In an exemplary embodiment of the present disclosure, based on the foregoing scheme, the probability calculation module 520 may be configured to search, by using a preset tool, a first data point covered by data to be identified in preset genetic data corresponding to each data block; and traversing each data block by a preset window length from the first data point corresponding to each data block to calculate the probability value corresponding to each data block.
In an exemplary embodiment of the present disclosure, based on the foregoing scheme, the probability calculation module 520 may be configured to calculate a matching degree between all the data to be identified covered on each of the data points and preset genetic data; and calculating the average value of the corresponding matching degree of each data point, and configuring the average value as the probability value of each data point as an activation point.
In an exemplary embodiment of the present disclosure, based on the foregoing scheme, the smoothing process includes a gaussian filtering process.
In an exemplary embodiment of the present disclosure, based on the foregoing solution, the region identification module 530 may be configured to identify a continuous region having a probability value greater than a preset threshold in the probability curve corresponding to each of the windows, and configure the continuous region as an activation region in the window; the continuous region comprises probability values corresponding to at least a preset number of continuous data points.
In an exemplary embodiment of the disclosure, based on the foregoing scheme, the traversing the data blocks according to the preset window length in a parallel manner to calculate the probability value that the data point in the preset genetic data is the activation point is implemented in parallel by a programmable logic gate array.
For details that are not disclosed in the embodiments of the apparatus of the present disclosure, please refer to the embodiments of the above-mentioned method for identifying an activation region of the present disclosure for the details that are not disclosed in the embodiments of the apparatus of the present disclosure.
It should be noted that although in the above detailed description several modules or units of the device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit, according to embodiments of the present disclosure. Conversely, the features and functions of one module or unit described above may be further divided into embodiments by a plurality of modules or units.
In addition, in an exemplary embodiment of the present disclosure, an electronic device capable of implementing the above-described activation region identification method is also provided.
As will be appreciated by one skilled in the art, aspects of the present disclosure may be embodied as a system, method or program product. Accordingly, various aspects of the present disclosure may be embodied in the form of: an entirely hardware embodiment, an entirely software embodiment (including firmware, microcode, etc.) or an embodiment combining hardware and software aspects that may all generally be referred to herein as a "circuit," module "or" system.
An electronic device 600 according to such an embodiment of the present disclosure is described below with reference to fig. 6. The electronic device 600 shown in fig. 6 is only an example and should not bring any limitations to the function and scope of use of the embodiments of the present disclosure.
As shown in fig. 6, the electronic device 600 is embodied in the form of a general purpose computing device. The components of the electronic device 600 may include, but are not limited to: the at least one processing unit 610, the at least one memory unit 620, a bus 630 connecting different system components (including the memory unit 620 and the processing unit 610), and a display unit 640.
Wherein the storage unit stores program code that is executable by the processing unit 610 to cause the processing unit 610 to perform steps according to various exemplary embodiments of the present disclosure as described in the above section "exemplary methods" of this specification. For example, the processing unit 610 may perform step S110 as shown in fig. 1: acquiring comparison data of data to be identified and preset genetic data, and partitioning the comparison data according to a preset rule to acquire partitioned data blocks; s120: traversing each data block according to the length of a preset window in a parallel mode to calculate the probability value that a data point in the preset genetic data is an activation point; s130: and smoothing all the probability values in the windows respectively to obtain probability curves corresponding to the windows, and identifying the activation regions in the windows according to the probability curves.
As another example, the electronic device may implement the steps shown in fig. 2 to 4.
The storage unit 620 may include readable media in the form of volatile memory units, such as a random access memory unit (RAM)621 and/or a cache memory unit 622, and may further include a read only memory unit (ROM) 623.
The storage unit 620 may also include a program/utility 624 having a set (at least one) of program modules 625, such program modules 625 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.
Bus 630 may be one or more of several types of bus structures, including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.
The electronic device 600 may also communicate with one or more external devices 670 (e.g., keyboard, pointing device, bluetooth device, etc.), with one or more devices that enable a user to interact with the electronic device 600, and/or with any devices (e.g., router, modem, etc.) that enable the electronic device 600 to communicate with one or more other computing devices. Such communication may occur via an input/output (I/O) interface 650. Also, the electronic device 600 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network such as the Internet) via the network adapter 660. As shown, the network adapter 660 communicates with the other modules of the electronic device 600 over the bus 630. It should be appreciated that although not shown in the figures, other hardware and/or software modules may be used in conjunction with the electronic device 600, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.
Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which may be a personal computer, a server, a terminal device, or a network device, etc.) to execute the method according to the embodiments of the present disclosure.
In an exemplary embodiment of the present disclosure, there is also provided a computer-readable storage medium having stored thereon a program product capable of implementing the above-described method of the present specification. In some possible embodiments, aspects of the present disclosure may also be implemented in the form of a program product comprising program code for causing a terminal device to perform the steps according to various exemplary embodiments of the present disclosure described in the "exemplary methods" section above of this specification, when the program product is run on the terminal device.
Referring to fig. 7, a program product 700 for implementing the above method according to an embodiment of the present disclosure is described, which may employ a portable compact disc read only memory (CD-ROM) and include program code, and may be run on a terminal device, such as a personal computer. However, the program product of the present disclosure is not limited thereto, and in this document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
A computer readable signal medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Program code for carrying out operations for the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).
Furthermore, the above-described figures are merely schematic illustrations of processes included in methods according to exemplary embodiments of the present disclosure, and are not intended to be limiting. It will be readily understood that the processes shown in the above figures are not intended to indicate or limit the chronological order of the processes. In addition, it is also readily understood that these processes may be performed synchronously or asynchronously, e.g., in multiple modules.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This disclosure is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.
It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is to be limited only by the terms of the appended claims.

Claims (13)

1. An active area identification method, comprising:
acquiring comparison data of data to be identified and preset genetic data, and partitioning the comparison data according to a preset rule to acquire partitioned data blocks;
traversing each data block according to the length of a preset window in a parallel mode to calculate the probability value that a data point in the preset genetic data is an activation point;
and smoothing all the probability values in the windows respectively to obtain probability curves corresponding to the windows, and identifying the activation regions in the windows according to the probability curves.
2. The method of claim 1, wherein the preset rules include chromosome rules and preset partitioning values;
the blocking the comparison data according to a preset rule to obtain a blocked data block includes:
dividing the comparison data according to the chromosome where the preset genetic data is located to obtain chromosome data corresponding to each chromosome;
and partitioning each chromosome data according to a preset partitioning value to obtain at least one data block.
3. The method of claim 2, wherein the preset block value comprises a preset block length or a preset number of blocks.
4. The method of claim 1, wherein traversing each of the data blocks by a preset window length to calculate a probability value that a data point in the preset genetic data is an activation point comprises:
searching a first data point covered by data to be identified in preset genetic data corresponding to each data block through a preset tool;
and traversing each data block by a preset window length from the first data point corresponding to each data block to calculate the probability value corresponding to each data block.
5. The method of claim 4, wherein said calculating probability values corresponding to each of said data points comprises:
calculating the matching degree of all the data to be identified covered on each data point and preset genetic data;
and calculating the average value of the corresponding matching degree of each data point, and configuring the average value as the probability value of each data point as an activation point.
6. The method of claim 1, wherein smoothing all the probability values in each window to obtain a probability curve corresponding to each window comprises:
and respectively carrying out smoothing processing on all the probability values in the windows in parallel to obtain probability curves corresponding to the windows.
7. The method of claim 6, wherein the smoothing process comprises a Gaussian filtering process.
8. The method of claim 1, wherein determining whether an activation region exists in each of the windows according to the probability curve comprises:
identifying continuous regions with probability values larger than a preset threshold value in the probability curves corresponding to the windows, and configuring the continuous regions as activated regions in the windows; the continuous region comprises probability values corresponding to at least a preset number of continuous data points.
9. The method of claim 1, wherein traversing the data blocks in parallel according to a preset window length to calculate the probability value that a data point in the preset genetic data is an activation point is implemented in parallel by a programmable logic gate array.
10. An activation region identification apparatus, comprising:
the data blocking module is used for acquiring comparison data of the data to be identified and preset genetic data and blocking the comparison data according to a preset rule to acquire blocked data blocks;
the probability calculation module is used for traversing the data blocks in a parallel mode according to the length of a preset window so as to calculate the probability value that the data points in the preset genetic data are the activation points;
and the region identification module is used for respectively smoothing all the probability values in the windows to obtain probability curves corresponding to the windows and judging whether the activated regions exist in the windows according to the probability curves.
11. The apparatus of claim 10, wherein the region identification module comprises:
and the smoothing unit is used for respectively performing smoothing processing on all the probability values in the windows in parallel so as to obtain the probability curves corresponding to the windows.
12. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out an activation region identification method according to any one of claims 1 to 9.
13. An electronic device, comprising:
a processor; and
memory for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to implement an activation region identification method as claimed in any one of claims 1 to 9.
CN201910989749.6A 2019-10-17 2019-10-17 Activation area identification method and device, storage medium and electronic equipment Active CN110797081B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910989749.6A CN110797081B (en) 2019-10-17 2019-10-17 Activation area identification method and device, storage medium and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910989749.6A CN110797081B (en) 2019-10-17 2019-10-17 Activation area identification method and device, storage medium and electronic equipment

Publications (2)

Publication Number Publication Date
CN110797081A true CN110797081A (en) 2020-02-14
CN110797081B CN110797081B (en) 2020-11-10

Family

ID=69440387

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910989749.6A Active CN110797081B (en) 2019-10-17 2019-10-17 Activation area identification method and device, storage medium and electronic equipment

Country Status (1)

Country Link
CN (1) CN110797081B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114496077A (en) * 2022-04-15 2022-05-13 北京贝瑞和康生物技术有限公司 Methods, devices, and media for detecting single nucleotide variations and indels

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020009710A1 (en) * 1999-12-16 2002-01-24 Stuart Tugendreich Random domain mapping
CN104762402A (en) * 2015-04-21 2015-07-08 广州定康信息科技有限公司 Method for rapidly detecting human genome single base mutation and micro-insertion deletion
CN105986008A (en) * 2015-01-27 2016-10-05 深圳华大基因科技有限公司 CNV detection method and CNV detection apparatus
CN106603591A (en) * 2015-10-14 2017-04-26 北京聚道科技有限公司 Processing method and system facing transmission and preprocessing of genome detection data
CN108121897A (en) * 2016-11-29 2018-06-05 华为技术有限公司 A kind of genome mutation detection method and detection device
CN108573125A (en) * 2018-04-19 2018-09-25 上海亿康医学检验所有限公司 Method for detecting genome copy number variation and device comprising same
CN109416928A (en) * 2016-06-07 2019-03-01 伊路米纳有限公司 For carrying out the bioinformatics system, apparatus and method of second level and/or tertiary treatment

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020009710A1 (en) * 1999-12-16 2002-01-24 Stuart Tugendreich Random domain mapping
CN105986008A (en) * 2015-01-27 2016-10-05 深圳华大基因科技有限公司 CNV detection method and CNV detection apparatus
CN104762402A (en) * 2015-04-21 2015-07-08 广州定康信息科技有限公司 Method for rapidly detecting human genome single base mutation and micro-insertion deletion
CN106603591A (en) * 2015-10-14 2017-04-26 北京聚道科技有限公司 Processing method and system facing transmission and preprocessing of genome detection data
CN109416928A (en) * 2016-06-07 2019-03-01 伊路米纳有限公司 For carrying out the bioinformatics system, apparatus and method of second level and/or tertiary treatment
CN108121897A (en) * 2016-11-29 2018-06-05 华为技术有限公司 A kind of genome mutation detection method and detection device
CN108573125A (en) * 2018-04-19 2018-09-25 上海亿康医学检验所有限公司 Method for detecting genome copy number variation and device comprising same

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114496077A (en) * 2022-04-15 2022-05-13 北京贝瑞和康生物技术有限公司 Methods, devices, and media for detecting single nucleotide variations and indels
CN114496077B (en) * 2022-04-15 2022-06-21 北京贝瑞和康生物技术有限公司 Methods, devices, and media for detecting single nucleotide variations and indels

Also Published As

Publication number Publication date
CN110797081B (en) 2020-11-10

Similar Documents

Publication Publication Date Title
US20210082539A1 (en) Gene mutation identification method and apparatus, and storage medium
CN109597810B (en) Task segmentation method, device, medium and electronic equipment
US10733537B2 (en) Ensemble based labeling
CN114649055B (en) Methods, devices and media for detecting single nucleotide variations and indels
CN110797081B (en) Activation area identification method and device, storage medium and electronic equipment
CN110162518B (en) Data grouping method, device, electronic equipment and storage medium
CN112990625A (en) Method and device for allocating annotation tasks and server
CN109272165B (en) Registration probability estimation method and device, storage medium and electronic equipment
CN110704614B (en) Information processing method and device for predicting user group type in application
CN110348581B (en) User feature optimizing method, device, medium and electronic equipment in user feature group
CN116866047A (en) Method, medium and device for determining malicious equipment in industrial equipment network
CN114758720B (en) Method, apparatus and medium for detecting copy number variation
CN110751227A (en) Data processing method, device, equipment and storage medium
CN110570908B (en) Sequencing sequence polymorphic identification method and device, storage medium and electronic equipment
CN115293126A (en) Method and device for removing duplicate of large-scale text data, electronic equipment and storage medium
CN110532304B (en) Data processing method and device, computer readable storage medium and electronic device
CN114020916A (en) Text classification method and device, storage medium and electronic equipment
CN110797087B (en) Sequencing sequence processing method and device, storage medium and electronic equipment
CN110648718B (en) Mutation detection method and device, storage medium and electronic equipment
CN113624245A (en) Navigation method and device, computer storage medium and electronic equipment
CN116244488A (en) Target data identification method and identification device
CN110647519B (en) Method and device for predicting missing attribute value in test sample
US20230214409A1 (en) Merging totally ordered sets
CN113127238B (en) Method and device for exporting data in database, medium and equipment
CN113723890B (en) Information processing method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant