WO2016208827A1

WO2016208827A1 - Method and device for analyzing gene

Info

Publication number: WO2016208827A1
Application number: PCT/KR2015/012925
Authority: WO
Inventors: 박웅양; 김상철; 남재용
Original assignee: 사회복지법인 삼성생명공익재단
Priority date: 2015-06-24
Filing date: 2015-11-30
Publication date: 2016-12-29

Abstract

A method and a device for analyzing a gene: generate a reference data set by carrying out deep sequencing of reference genes; analyze, by carrying out deep sequencing of genes to be inspected, the depth of the genes to be inspected; and determine, by comparing the analyzed depth and the depth of the reference genes included in the reference data set, whether copy-number variation (CNV) genes exist in the genes to be inspected.

Description

Methods and apparatus for analyzing genes

The present invention relates to a method and apparatus for analyzing genes, and more particularly, to a method and apparatus for analyzing genes of copy number variation (CNV).

A genome is all the genetic information of a living thing. For sequencing of an individual's genome, various technologies such as DNA chips, Next Generation Sequencing technology, and Next Next Generation Sequencing technology have been developed. Analysis of genetic information such as nucleic acid sequences, proteins, etc., is widely used to find genes that express diseases such as diabetes and cancer, or to identify correlations between genetic diversity and individual expression characteristics. In particular, the genetic data collected from the individual is important in identifying the genetic characteristics of the individual associated with different symptoms or disease progression. Therefore, genetic data such as nucleic acid sequences, proteins, etc. of an individual are essential data for identifying current and future disease-related information to prevent disease or to select an optimal treatment method at an early stage of disease. Techniques for accurately analyzing individual genetic data and diagnosing an individual's disease using genome detection equipment that detects single nucleotide polymorphism (SNP) and copy number variation (CNV) as genetic information of an organism are being studied.

To provide a method and apparatus for analyzing a gene. The technical problem to be achieved by the present embodiment is not limited to the technical problems as described above, and further technical problems can be inferred from the following embodiments.

According to one aspect, a method of analyzing a gene comprises: generating a reference data set relating to depths of reads aligned to each of the reference genes by performing deep sequencing on reference genes; Analyzing depths of reads aligned with each of the test genes by performing the deep sequencing on the test genes; And comparing the analyzed depths with depths of the reference genes included in the reference data set to determine whether a copy number variation (CNV) gene is present among the test genes.

In addition, the analyzing step analyzes the depth of the reads aligned with exon sites of the test genes.

In the determining, the presence of the copy number variation (CNV) gene may be determined by comparing the depths between the reference genes and the test genes for the same exon region.

The determining may include exon sites of which the difference in depths of exon sites corresponding to each other between the reference genes and the test genes are not statistically significant among the exon sites of the test genes. If so, the copy number mutation (CNV) gene is determined to exist.

The generating may further include obtaining lead-depths corresponding to the reference genes for each of the people through the deep sequencing of a plurality of people's gene data; Clustering the people into different groups according to the obtained distribution of lead-depths; Acquiring standard depths of each of the reference genes representing each of the groups by normalizing the read-depths acquired for each of the reference genes per group, wherein the reference data set includes the groups For each, data representing standard depths of each of the reference genes is included.

The determining may further include determining a group among the groups having the smallest statistical difference between the distribution of the analyzed depths and the distribution of the standard depths; And determining whether the copy number variation (CNV) gene is present by comparing the analyzed depths with standard depths corresponding to the determined group.

The method further includes obtaining the genetic data of the people from public genomic data or public HapMap data.

In addition, the reference genes or the test genes may be obtained from biopsy tissue, formalin-fixed, paraffin-embedded (FFPE) tissue.

In addition, when it is determined that the copy number variation (CNV) gene is present among the test genes, the method may further include performing an annotation for identifying a drug corresponding to the copy number variation (CNV) gene.

According to another aspect, there is provided a computer-readable recording medium having recorded thereon a program for executing the method on a computer.

According to another aspect, an apparatus for analyzing a gene may include: a reference data generator configured to generate a reference data set about depths of reads aligned with each of the reference genes by performing deep sequencing on reference genes; An analysis unit which analyzes depths of reads aligned with each of the test genes by performing the deep sequencing on the test genes; And a determining unit determining whether a copy number variation (CNV) gene exists among the test genes by comparing the analyzed depths with depths of the reference genes included in the reference data set.

In addition, the analysis unit analyzes the depth of the reads aligned with exon sites of the test genes.

In addition, the determination unit determines the existence of the copy number variation (CNV) gene by comparing the depths between the reference genes and the test genes for the same exon region.

In addition, the determination unit, when there is an exon region of the exon regions of the test genes, the difference in the depth of the exon regions corresponding to each other between the reference genes and the test genes are not statistically significant (significant) It is determined that the copy number mutation (CNV) gene is present.

In addition, the reference data generator obtains read-depths corresponding to the reference genes for each of the people through the deep sequencing of a plurality of people's gene data, and according to the distribution of the read-depths. Clustering the people into different groups, normalizing the read-depths obtained for each of the reference genes per group, thereby obtaining standard depths of each of the reference genes representing each of the groups, and The reference data set includes, for each of the groups, data representing standard depths of each of the reference genes.

The determination unit may determine a group having the smallest statistical difference between the distribution of the analyzed depths and the distribution of the standard depths among the groups, and compare the analyzed depths with the standard depths corresponding to the determined group. Thus, it is determined whether the copy number mutation (CNV) gene is present.

In addition, the reference data generator obtains the genetic data of the people from public genomic data or public map data (HapMap).

In addition, when it is determined that the copy number variation (CNV) gene is present among the test genes, the determination unit performs an annotation for identifying a drug corresponding to the copy number variation (CNV) gene.

As described above, it is possible to analyze more accurately whether the copy number variation (CNV) gene is present from the test gene of the subject.

1 is a view for explaining a gene analysis apparatus according to an embodiment.

2 is a block diagram illustrating hardware configurations of a gene analysis apparatus according to an exemplary embodiment.

3 is a flowchart of a method of generating a reference data set according to an embodiment.

FIG. 4 is a diagram for describing obtaining lead-depths corresponding to reference genes for each of a plurality of people (eg, normal people), according to an exemplary embodiment.

5 is a diagram for describing deep sequencing of exon regions according to an embodiment.

FIG. 6 is a diagram illustrating clustering people into different groups according to a distribution of lead-depths obtained from a normal group 400 according to an embodiment.

FIG. 7 is a diagram for describing standard depths of each of reference genes representing a group according to an embodiment.

FIG. 8 is a diagram for describing deep sequencing of test genes obtained from biological samples of a subject, according to an exemplary embodiment.

9 is a flowchart of a method of determining whether a copy number variation (CNV) gene is present according to an embodiment.

FIG. 10 illustrates a method for determining whether a copy number variation (CNV) gene is present according to an embodiment.

11 is a flowchart of a method of analyzing a gene, according to an embodiment.

12 is a block diagram illustrating hardware configurations of a computing device according to an embodiment.

The terminology used in the present embodiments is to select general terms widely used now, considering the functions of the present embodiments, but this will vary depending on the intention or precedent of the person skilled in the art, the emergence of new technologies, etc. Can be. In addition, in certain cases, there is also a term arbitrarily selected, in which case the meaning will be described in detail in the description of the corresponding embodiment. Therefore, the terms used in the present embodiments should be defined based on the meanings of the terms and the contents throughout the embodiments, rather than simply the names of the terms.

In the descriptions of the embodiments, when a part is connected to another part, it includes not only a case where the part is directly connected, but also an electric part connected between other components in between. . In addition, when a part includes a certain component, this means that the component may further include other components, not to exclude other components unless specifically stated otherwise. In addition, the terms "... unit", "... module" described in the embodiments means a unit for processing at least one function or operation, which is implemented in hardware or software, or a combination of hardware and software. Can be implemented.

Terms such as “consisting of” or “comprising” as used in the present embodiments should not be construed as necessarily including all of the various components or steps described in the specification, and some of the components or It is to be understood that some steps may not be included or may further include additional components or steps.

The description of the following embodiments should not be construed as limiting the scope of rights, and it should be construed as belonging to the scope of the embodiments as can be easily inferred by those skilled in the art. Hereinafter, only exemplary embodiments will be described in detail with reference to the accompanying drawings.

Referring to FIG. 1, the genetic analysis apparatus 10 uses a genetic data 20 obtained from a normal population and a genetic data 30 obtained from a subject, thereby replicating copy number (CNV) to a subject gene of a subject. The presence of a gene can be identified.

The genetic data 20 and the genetic data 30 received by the genetic analysis apparatus 10 may correspond to the genetic data in the FASTQ file format obtained by next generation sequencing (NGS). The FASTQ format is usually a text-based format that stores biological sequences, such as nucleotide sequences, and corresponding quality scores. However, the genetic analysis apparatus 10 according to the present embodiment is not limited to the FASTQ format, and the

genetic data

20 and 30 in other formats can also be analyzed.

Gene data 20 of the normal population is obtained from a database (DB) already known in the art, such as the National Center for Biotechnology Information (NCBI), Gene® Expression Omnibus (GEO), or the like. It may be obtained from a biological sample of people recruited to. That is, the genetic data 20 may be obtained from public genomic data or public map data. Meanwhile, the reference genes included in the genetic data 20 or the test genes included in the genetic data 30 may be obtained from biopsy tissue, formalin-fixed tissue, or paraffin-embedded tissue. It may be.

Copy number variation (CNV) is known to mean a variation in a gene that appears to be repeated or lacking or amplified in a relatively large region of a particular chromosome compared to a reference genome. That is, the genetic analysis apparatus 10 may determine whether there is an abnormally deleted or amplified gene in the genetic data 30 obtained from the subject compared to the genetic data 20 obtained from a normal population. Here, the gene analyzed by the genetic analysis device 10 may refer to a nucleic acid such as DNA (deoxyribonucleic acid), RNA (ribonucleic acid), and the like.

In the present embodiments, the normal population may refer to a population composed of ordinary people who have not found a specific disease, such as cancer or a tumor, and the subject may refer to a patient where a specific disease such as cancer or a tumor is found. have. Meanwhile, in the present embodiments, the normal population and the subject may correspond to other animals other than humans.

The genetic analysis apparatus 10 may be implemented with at least one processor having a function of data processing for performing various instructions and various algorithms for analyzing the

gene data

20 and 30 to identify a copy number variation (CNV) gene. Can be.

Referring to FIG. 2, the genetic analysis apparatus 10 may include a reference data generator 110, an analyzer 120, and a determiner 130. On the other hand, since the gene analysis apparatus 10 shown in FIG. 2 only shows the components related to the present embodiment in order to prevent the features of the present embodiment from being blurred, the gene analysis apparatus 10 is shown in FIG. In addition to the components, other general purpose components may be further included.

The reference data generator 110 receives the gene data 20 obtained from the normal population described above with reference to FIG. 1, and generates a reference data set using the received gene data 20.

In more detail, the reference data generator 110 performs deep sequencing of reference genes included in the gene data 20, thereby providing depths of reads aligned with each of the reference genes. Create a reference data set for (depths). Deep sequencing is a technique for sequencing nucleic acids such as DNA fragments, RNA fragments, and the like by repeatedly aligning leads to nucleic acids such as DNA fragments, RNA fragments, and the like. As a result of deep sequencing, data regarding depths corresponding to the number of reads complementarily bound to nucleic acids such as DNA fragments, RNA fragments, and the like can be obtained. In the present embodiments, the term “depth” may be used interchangeably as the same meaning as the term “read-depth”.

The reference data generator 110 first read-depth corresponding to reference genes for each of the people through deep sequencing on the genetic data (20 of FIG. 1) of a plurality of people (eg, normal people). Acquire them. Then, the reference data generator 110 clusters people into different groups according to the obtained distribution of read-depths. The reference data generator 110 obtains standard depths of each of the reference genes representing each of the groups by normalizing the read-depths obtained for each of the reference genes for each group. As a result, the reference data set generated by the reference data generator 110 may include data representing standard depths of each of the reference genes for each of the groups.

The analyzer 120 receives the gene data 30 obtained from the subject, described above with reference to FIG. 1, and performs deep sequencing on the test genes included in the gene data 30 to each of the test genes. Analyze the depths of the aligned reads.

Meanwhile, deep sequencing performed by the reference data generator 110 and the analyzer 120 may be performed on exon sites in the reference gene or the test gene. In other words, the data of the depths analyzed by the reference data set generated by the reference data generator 110 or the analyzer 120 corresponding to the deep sequencing result may be related to the depths of the exon sites. Only data may be included, and data regarding depths of reads aligned to intron sites may not be included. However, the exemplary embodiments are not limited thereto, and depth data of intron portions may be included.

The determination unit 130 compares the depths analyzed by the analyzer 120 with the depths of the reference genes included in the reference data set generated by the reference data generator 110. Then, the determination unit 130 determines whether there is a copy number variation (CNV) gene among the test genes. In this case, the determination unit 130 may determine the presence of the copy number variation (CNV) gene by comparing the depths between the reference genes and the test genes for the same exon region.

As a criterion of determination, the determination unit 130 includes an exon region in which the difference in the depth of exon regions corresponding to each other between the reference genes and the test genes is not statistically significant among the exon regions of the test genes. In this case, it can be determined that a copy number variation (CNV) gene is present.

The determination unit 130 detects or identifies that the gene corresponding to the exon region whose difference in depth in the corresponding exon regions is not statistically significant corresponds to the copy number variation (CNV) gene. Further, when it is determined that there is a copy number variation (CNV) gene among the test genes, the determination unit 130 selects a drug (for example, an anticancer agent) corresponding to the detected copy number variation (CNV) gene. Annotations can be performed to identify them.

3 is a flowchart of a method of generating a reference data set according to an embodiment. Referring to FIG. 3, the generation of the reference data set includes steps processed in time series in the reference data generator 110 described above.

In operation 301, the reference data generator 110 acquires read-depths corresponding to reference genes for each of a plurality of people (eg, normal people).

In operation 302, the reference data generator 110 clusters people into different groups according to the obtained distribution of read-depths.

In step 303, the reference data generator 110 normalizes the read-depths acquired for each of the reference genes for each group.

In step 304, the reference data generator 110 obtains standard depths of each of the reference genes representing each of the groups.

FIG. 4 is a diagram for describing obtaining lead-depths corresponding to reference genes for each of a plurality of people (eg, normal people), according to an exemplary embodiment. The description of FIG. 4 may relate to the method performed in step 301 of FIG. 3.

Referring to FIG. 4, the reference data generator 110 may acquire read-depths by performing deep sequencing using the genetic data 401 obtained from a database (DB) 40.

Database (DB) 40 stores genetic data 401 of a plurality of people (eg, normal people) classified into normal population 400. Genetic data 401 may be obtained using various sequencing means, such as next generation sequencing (NGS), microarrays, and the like on biological samples taken from a plurality of people. On the other hand, the genetic data 401 may be data about a whole genome or data about a HapMap.

Database (DB) 40 corresponds to a database (DB) already known in the art, such as NCBI, GEO, etc., or stores genetic data 401 of people recruited to analyze subject genes of a subject. It may be built to.

The reference data generator 110 performs deep sequencing on genes (ie, reference genes) of individuals of the normal population 400 included in the gene data 401. For example, the reference data generator 110 may perform deep sequencing on reference genes 411 of the “person 1” 410 included in the normal population 400. As a result of deep sequencing for the reference gene 411, the genes 1, ..., gene n (n is a natural number) included in the reference genes 411 are aligned with the leads 415, and the reference genes 411 Data for the depths (lead-depths) of the leads 415 aligned to each other are obtained. Similarly, the reference data generator 110 performs deep sequencing on the reference genes 421 of the “person 1” 420 included in the normal population 400, and arranges each of the reference genes 421. Data about the depths (lead-depths) of the read leads 425 are obtained. The reference data generator 110 may acquire data of read-depths by performing deep sequencing on reference genes of individual individuals of the normal population 400 included in the gene data 401.

Referring to FIG. 5, deep sequencing of reference genes corresponding to genes of individuals in the normal population 400, except for intron sites 505, the depths of the reads aligned with the exon sites (leads). -Depths). For example, if an individual's reference gene (nucleic acid 500) comprises gene a, gene b and gene c, the result of deep sequencing may be the depth of leads 510 aligned to exon a1 in gene a and Data of the depths of the reads aligned to exon a2, the depths of the reads aligned to exon b1 in gene b and the depths of the leads aligned to exon b2, and the depths of the reads aligned to exon c in gene c. However, the exemplary embodiments are not limited thereto, and the deep sequencing result may include data of depths of reads aligned with the intron regions 505.

Meanwhile, deep sequencing of the exon sites shown in FIG. 5 is applied not only to reference genes but also to test genes obtained from a subject. That is, the analysis unit 120 of FIG. 2 may analyze the depths of reads aligned with each of the exon sites in the test genes by performing deep sequencing on the exon sites in the test genes.

FIG. 6 is a diagram illustrating clustering people into different groups according to a distribution of lead-depths obtained from a normal group 400 according to an embodiment. The description of FIG. 6 may relate to the method performed in step 302 of FIG. 3.

Since individuals in the normal population 400 have different genes, the depths corresponding to specific genes (or specific exons) analyzed by deep sequencing for each individual may be different. Or, in addition, the depth of each of the individual reference genes, due to chemical processing (eg, formalin-fixed, paraffin-embedded (FFPE) (FFPE), deep sequencing errors, etc.) obtained from the individual The distribution tendency of these may be different. Therefore, the reference data generator 110 groups people having a similar distribution of depths to cluster individuals of the normal group 400 into different groups. Here, clustering may be performed by statistically analyzing the distribution of read-depth for each reference gene (exon) using a known trend analysis algorithm, a clustering algorithm, or the like.

Referring to FIG. 6, as a result of deep sequencing of reference genes of people belonging to group 1, reference genes of people belonging to group 1 may have a similar distribution of each gene and depth pair. The same also applies to other groups. For example, reference genes of people in group 1 may be obtained from biopsy samples of people in group 1, and reference genes of people in group M (M is a natural number) may be obtained from FFPE of people in group M. It may be one obtained from the samples.

FIG. 7 is a diagram for describing standard depths of each of reference genes representing a group according to an embodiment. The description of FIG. 7 may relate to the methods performed in

steps

303 and 304 of FIG. 3.

Referring to FIG. 7, when clustering is completed, the reference data generator 110 normalizes the read-depths acquired for each of the reference genes for each group, and represents each of the reference genes representing each of the groups. Obtain standard depths.

For a reference gene (eg, “exon 1”), when the depths of the people in the group x have various values, the reference data generator 110 calculates an average of various depths for “exon 1”. By doing so, it is possible to standardize the depth for “Exon 1”. Similarly, the reference data generator 110 calculates an average of various depths with respect to each of the other reference genes (eg, “Exon 43”, “Exon 3543”, “Exon 5623”, etc.), and thus, each gene ( Exon) can be calculated. As a result, the reference data generator 110 may acquire standard depths of each of the reference genes, which represent each of the clustered groups. Meanwhile, in the present embodiment, for convenience of description, the average of the depths is calculated to take a representative value. However, in the present embodiment, the representative value of the depths may be calculated using other types of statistics besides the average.

Referring to FIG. 8, the analysis unit 120 of FIG. 2 performs depth sequencing of test genes on the basis of the gene data 30 of the test subject 800 to determine depths of reads aligned with each of the test genes. Analyze them.

The genetic data 30 of the subject 800 may be obtained through next generation sequencing (NGS) on a biopsy sample 810 or an FFPE sample 825 taken from some tissue of the subject 800. Here, the FFPE sample 825 is a sample by FFPE treatment 820 for some tissue of the subject 800.

The analysis unit 120 of FIG. 2 analyzes the depths of the reads aligned with the test genes of the test subject 800 according to the deep sequencing methods described above with reference to FIGS. 4 and 5, thereby providing depth data of the test genes ( 830 may be obtained.

9 is a flowchart of a method of determining whether a copy number variation (CNV) gene is present according to an embodiment. Referring to FIG. 9, the determination of the copy number mutation (CNV) gene includes steps that are processed in time series in the determination unit 130 described above.

In operation 901, the determination unit 130 determines a group among the groups clustered by the reference data generation unit 110 having the smallest statistical difference between the distribution of the depths analyzed from the test genes and the distribution of the standard depths. do. That is, the determination unit 130 determines at least one group among the clustered groups (eg, the groups of FIG. 6) having a statistical tendency similar to the distribution of depths analyzed from the test genes. In this case, the determination unit 130 may determine a group having the smallest standard deviation between the distribution of the depths analyzed from the test genes and the distribution of the standard depths. However, the present invention is not limited thereto, and other statistics may be used in addition to the standard deviation to select a group having a tendency similar to the distribution of depths analyzed from the test genes.

In operation 902, the determination unit 130 compares the analyzed depths analyzed from the test genes and the standard depths corresponding to the determined group. More specifically, the determination unit 130 compares the depth of each of the test genes (exons) with the depths of the corresponding reference genes (corresponding exons). For example, assuming that “exon 1” and “exon 43” exist in both the test genes and the reference genes, the determination unit 130 may determine the “exon 1” of the analysis unit 120. The depth is compared with the standard depth of "Exon 1", and the depth of "Exon 43" analyzed by the analyzer 120 is compared with the standard depth of "Exon 43". Here, "exon 1" and "exon 43" are arbitrary terms for indicating that they are different exons.

In operation 903, the determination unit 130 determines whether a copy number variation (CNV) gene is present as a result of the comparison. At this time, the determination unit 130, if there is an exon region of the exon regions of the test genes, the difference in the depth of the exon regions corresponding to each other between the reference genes and the test genes are not statistically significant (significant) It can be determined that the copy number variation (CNV) gene is present.

More specifically, assuming that the threshold value for determining that the difference in depth is not significant is set to be 4 times the standard depth, the determination unit 130 determines that the depth of any exon analyzed by the analysis unit 120 is standard. It may be determined that the copy number variation (CNV) gene is present when it exceeds 4 times the depth. However, the threshold is not limited thereto and may be variously changed. For example, when the standard depth of “exon 1” is 1000, the threshold for determining significance may be 4000. Therefore, when the depth of the "exon 1" of the subject analyzed by the analysis unit 120 is 5000, the determination unit 130 may determine that the gene of "exon 1" is a copy number variation (CNV) gene. Can be.

Referring to FIG. 10, the depths indicated by solid lines correspond to reference genes (exons), the depths indicated by solid lines correspond to reference genes (exons), and the depths indicated by dashed lines correspond to test genes (exons).

The determination unit 130 compares the depths of the exons analyzed by the analysis unit 120 and the standard depths, as described above in the drawings. The determination unit 130 may be an exon region (“exon a”) in which the difference in the depth of exon regions corresponding to each other between the reference genes and the test genes among the exon sites of the test genes is not statistically significant. Is present, the test gene of "exon a" has been identified as a copy number mutation (CNV) gene, it can be determined that the copy number mutation (CNV) gene is present.

Meanwhile, when it is determined that the copy number variation (CNV) gene is present among the test genes, the determination unit 130 may annotate for identifying a drug (eg, an anticancer agent) corresponding to the copy number variation (CNV) gene. Can be performed.

11 is a flowchart of a method of analyzing a gene, according to an embodiment. Referring to FIG. 11, the gene analysis method includes steps that are processed in time series in the gene analysis apparatus 10 described in the foregoing figures. Therefore, even if omitted below, the contents described above may be applied to the genetic analysis method of FIG. 11.

In step 1101, the reference data generator 110 performs deep sequencing on the reference genes to generate a reference data set about depths of reads aligned with each of the reference genes.

In operation 1102, the analyzer 120 analyzes the depths of the reads aligned with each of the test genes by performing deep sequencing on the test genes.

In operation 1103, the determination unit 130 compares the analyzed depths with the depths of the reference genes included in the reference data set to determine whether a copy number variation (CNV) gene exists among the test genes.

Referring to FIG. 12, the computing device 1 includes a genetic analysis device (processor) 10, a data interface 11, and a memory 12. On the other hand, the computing device 1 shown in FIG. 12 has only general components related to the present embodiment in order to prevent the features of the present embodiment from being blurred. Therefore, the computing device 1 shown in FIG. Components may be further included.

The data interface 11 receives the genetic data 20 of the normal population and the genetic data 30 of the subject described above in FIG. 1. That is, the data interface 11 may be implemented in hardware of a wired / wireless network interface for the computing device 1 to communicate with other external devices. The data interface 11 transmits the received

genetic data

20 and 30 to the genetic analysis device (processor) 10.

Data interface 11 may receive genetic data 20 of a normal population from database DB (40 in FIG. 4). The data interface 11 may receive the genetic data 30 of the subject from an external next-generation sequencing apparatus, a microarray, or the like for sequencing the subject gene of the subject.

The memory 12 is hardware for storing data to be processed in the computing device 1 and the processed results, and memory chips such as random access memory (RAM), read only memory (ROM), or a hard disk (HDD). drive, solid state drive (SSD), and the like. That is, the memory 12 may store the

gene data

20 and 30 received by the data interface 11, and the reference data set processed by the genetic analysis device (processor) 10, for the genes to be tested. Deep sequencing data, data for identified copy number variation (CNV) genes can also be stored.

Genetic analysis device (processor) 10 is a module implemented in one or more processing units, which may be implemented as a combination of a microprocessor having an array of multiple logic gates and a memory module storing a program that can be executed on the microprocessor. have. Genetic analysis device (processor) 10 may be implemented in the form of a module of an application program. Genetic analysis device (processor) 10 is a hardware device for processing the gene analysis described above in FIGS.

The information about the copy number variation (CNV) gene identified by the genetic analysis device (processor) 10 may be transmitted via the data interface 11 to another external device such as a display device, another computing device, or the like, Or on an external network, such as the Internet, a public database (DB) server.

According to the embodiments described above, even if normal blood of a subject (for example, a cancer patient) cannot be obtained, a copy number variation (CNV) gene may be generated only by a biopsy sample or an FFPE sample of the cancer tissue of the subject. Can be detected. Furthermore, although genes of cancer tissue (test genes) obtained from a subject may be slightly damaged chemically by FFPE treatment, reference to reference genes under similar conditions (FFPE treatment) may be used to determine the number of cloned mutation (CNV) genes. By determining the presence, it is possible to accurately detect the copy number variation (CNV) gene.

The device according to the embodiments may include a processor, a memory for storing and executing program data, a persistent storage such as a disk drive, a communication port for communicating with an external device, a touch panel, a key, a button, and the like. And a user interface device. Methods implemented by software modules or algorithms may be stored on a computer readable recording medium as computer readable codes or program instructions executable on the processor. The computer-readable recording medium may be a magnetic storage medium (eg, read-only memory (ROM), random-access memory (RAM), floppy disk, hard disk, etc.) and an optical reading medium (eg, CD-ROM). ) And DVD (Digital Versatile Disc). The computer readable recording medium can be distributed over network coupled computer systems so that the computer readable code is stored and executed in a distributed fashion. The medium is readable by the computer, stored in the memory, and can be executed by the processor.

This embodiment can be represented by functional block configurations and various processing steps. Such functional blocks may be implemented in various numbers of hardware or / and software configurations that perform particular functions. For example, an embodiment may include an integrated circuit configuration such as memory, processing, logic, look-up table, etc. that may execute various functions by the control of one or more microprocessors or other control devices. You can employ them. Similar to the components that may be implemented in software programming or software elements, the present embodiment includes various algorithms implemented in C, C ++, Java (data structures, processes, routines or other combinations of programming constructs). It may be implemented in a programming or scripting language such as Java), an assembler, or the like. The functional aspects may be implemented with an algorithm running on one or more processors. In addition, the present embodiment may employ the prior art for electronic environment setting, signal processing, and / or data processing. Terms such as "mechanism", "element", "means" and "configuration" can be used widely and are not limited to mechanical and physical configurations. The term may include the meaning of a series of routines of software in conjunction with a processor or the like.

Specific implementations described in this embodiment are examples, and do not limit the technical scope in any way. For brevity of description, descriptions of conventional electronic configurations, control systems, software, and other functional aspects of the systems may be omitted. In addition, the connection or connection members of the lines between the components shown in the drawings by way of example shows a functional connection and / or physical or circuit connections, in the actual device replaceable or additional various functional connections, physical It may be represented as a connection, or circuit connections.

In the present specification (particularly in the claims), the use of the term “above” and similar indicating terminology may correspond to both the singular and the plural. In addition, when a range is described, it includes the individual values which belong to the said range (if there is no description contrary to it), and it is the same as describing each individual value which comprises the said range in detailed description. Finally, if there is no explicit order or contrary to the steps constituting the method, the steps may be performed in a suitable order. It is not necessarily limited to the order of description of the above steps.

So far I looked at the center of the preferred embodiment for the present invention. Those skilled in the art will appreciate that the present invention can be implemented in a modified form without departing from the essential features of the present invention. Therefore, the disclosed embodiments should be considered in descriptive sense only and not for purposes of limitation. The scope of the present invention is shown in the claims rather than the foregoing description, and all differences within the scope will be construed as being included in the present invention.

Claims

Generating a reference data set relating to depths of reads aligned to each of the reference genes by performing deep sequencing on reference genes;

Analyzing depths of reads aligned with each of the test genes by performing the deep sequencing on the test genes; And

Comparing the analyzed depths with depths of the reference genes included in the reference data set to determine whether a copy number variation (CNV) gene is present among the test genes. How to Analyze.
The method of claim 1,

The analyzing step

Analyzing the depth of the reads aligned with exon sites of the test genes.
The method of claim 2,

The determining step

And determining the presence of the copy number variation (CNV) gene by comparing the depths between the reference genes and the test genes for the same exon region.
The method of claim 1,

The determining step

The copy number variation when there is an exon region in which the difference in the depth of exon regions corresponding to each other between the reference genes and the test genes is not statistically significant among the exon regions of the test genes. (CNV) The method of determining that the gene is present.
The method of claim 1,

The generating step

Acquiring lead-depths corresponding to the reference genes for each of the people through the deep sequencing on a plurality of people's genetic data;

Clustering the people into different groups according to the obtained distribution of lead-depths; And

Normalizing the read-depths obtained for each of the reference genes per group, thereby obtaining standard depths of each of the reference genes representing each of the groups,

The reference data set is

For each of the groups, data representing standard depths of each of the reference genes.
The method of claim 5,

The determining step

Determining a group of the groups having the smallest statistical difference between the distribution of the analyzed depths and the distribution of the standard depths; And

Determining whether the copy number variation (CNV) gene is present by comparing the analyzed depths with standard depths corresponding to the determined group.
The method of claim 5,

Obtaining the genetic data of the people from public genomic data or public map data.
The method of claim 1,

The reference genes or the test genes

Biopsy tissue, obtained from formalin-fixed, paraffin-embedded (FFPE) tissue.
The method of claim 1,

If it is determined that the copy number variation (CNV) gene is present among the test genes, further comprising performing an annotation for identifying a drug corresponding to the copy number variation (CNV) gene.
A computer-readable recording medium having recorded thereon a program for executing the method of any one of claims 1 to 9.
A reference data generator which generates a reference data set about depths of reads aligned with each of the reference genes by performing deep sequencing on reference genes;

An analysis unit which analyzes depths of reads aligned with each of the test genes by performing the deep sequencing on the test genes; And

Comparing the analyzed depths with the depths for the reference genes included in the reference data set, a gene comprising a determination unit for determining whether there is a copy number variation (CNV) gene of the test genes Device to analyze.
The method of claim 11,

The analysis unit

And analyzing the depth of the reads aligned with exon sites of the test genes.
The method of claim 12,

The determination unit

And determining the presence of the copy number variation (CNV) gene by comparing the depths between the reference genes and the test genes for the same exon region.
The method of claim 11,

The determination unit

The copy number variation when there is an exon region in which the difference in the depth of exon regions corresponding to each other between the reference genes and the test genes is not statistically significant among the exon regions of the test genes. (CNV) A device that determines that a gene is present.
The method of claim 11,

The reference data generation unit

Through the deep sequencing of a plurality of people's genetic data, obtain lead-depths corresponding to the reference genes for each of the people,

Clustering the people into different groups according to the obtained distribution of lead-depths,

Standardizing the read-depths acquired for each of the reference genes per group, thereby obtaining standard depths of each of the reference genes representing each of the groups,

The reference data set is

For each of the groups, data representing standard depths of each of the reference genes.
The method of claim 15,

The determination unit

Among the groups, determine a group having the smallest statistical difference between the distribution of analyzed depths and the distribution of standard depths

And comparing the analyzed depths with standard depths corresponding to the determined group to determine whether the copy number variation (CNV) gene is present.
The method of claim 15,

The reference data generation unit

And obtain the genetic data of the people from public genomic data or public map map data.
The method of claim 11,

The reference genes or the test genes

The biopsy tissue, obtained from formalin-fixed, paraffin-embedded (FFPE) tissue.
The method of claim 11,

The determination unit

And when it is determined that the copy number variation (CNV) gene is present among the test genes, an annotation for identifying a drug corresponding to the copy number variation (CNV) gene.