WO2016208826A1

WO2016208826A1 - Method and device for analyzing gene

Info

Publication number: WO2016208826A1
Application number: PCT/KR2015/012922
Authority: WO
Inventors: 박웅양; 신현태; 김나영
Original assignee: 사회복지법인 삼성생명공익재단
Priority date: 2015-06-24
Filing date: 2015-11-30
Publication date: 2016-12-29

Abstract

A method and a device for analyzing a gene obtain data related to reads from next-generation sequencing (NGS) data of a sample to be tested, extract candidate gene pairs having a possibility of chromosomal translocation from the sample to be tested by using the reads, and identify translocation genes among the candidate gene pairs.

Description

Methods and apparatus for analyzing genes

A method and apparatus for analyzing genes, and more particularly, a method and apparatus for analyzing data regarding translocation genes.

A genome is all the genetic information of a living thing. For sequencing of an individual's genome, various technologies such as DNA chips, Next Generation Sequencing technology, and Next Next Generation Sequencing technology have been developed. Analysis of genetic information such as nucleic acid sequences, proteins, etc., is widely used to find genes that express diseases such as diabetes and cancer, or to identify correlations between genetic diversity and individual expression characteristics. In particular, the genetic data collected from the individual is important in identifying the genetic characteristics of the individual associated with different symptoms or disease progression. Therefore, genetic data such as nucleic acid sequences, proteins, etc. of an individual are essential data for identifying current and future disease-related information to prevent disease or select an optimal treatment method at an early stage of disease. Recently, with the development of sequencing technology, many attempts have been made to discover various kinds of structural variations, but the generation of significant amounts of false positives or false negatives is still in bioinformatics. Disprove many challenging elements).

To provide a method and apparatus for analyzing a gene. The technical problem to be achieved by the present embodiment is not limited to the technical problems as described above, and further technical problems can be inferred from the following embodiments.

According to one aspect, a method of analyzing a gene obtains data about split reads and discreetly aligned paired-end reads from next generation sequencing (NGS) data of a test sample. Making; Extracting first candidate gene pairs that are likely for translocation in the chromosome of the test sample using the split reads and the PE reads; And identifying a translocation gene among the first candidate gene pairs based on break points indicated by the split leads and a fusion direction of the first candidate gene pairs.

The identifying may include extracting, from the extracted first candidate gene pairs, second candidate gene pairs including a gene in which a plurality of split reads having breakpoints belonging to the same coverage are arranged. The transgenic gene is identified from the extracted second candidate gene pairs.

In addition, the gene included in the extracted second candidate gene pairs may have the number of split leads having the break point belonging to the same coverage to be greater than or equal to a predetermined threshold.

In addition, the identifying may include a third candidate in which the fusion direction between different genes among the extracted second candidate gene pairs is from 5 'end to 3' end, or from 3 'end to 5' end. Extracting gene pairs, wherein the translocation gene is identified from the extracted second candidate gene pairs.

In addition, the NGS data includes data in a binary version of SAM (BAM) format or a Sequence Alignment / Map (SAM) format.

In addition, the acquiring step may acquire data of FLAG and Compact Idiosyncratic Gapped Alignment Report (CIGAR) strings for the split leads and the PE leads from the data of the BAM format or the SAM format.

In addition, the NGS data is generated by targeted sequencing to identify base sequences of target genes in the test sample.

In addition, the test sample is a biopsy sample or formalin-fixed, paraffin-embedded (FFPE) sample.

According to another aspect, there is provided a computer-readable recording medium having recorded thereon a program for executing the method on a computer.

According to another aspect, an apparatus for analyzing a gene may include data about split reads and paired-end reads that are discreetly aligned from next generation sequencing (NGS) data of a test sample. A lead analyzing unit to obtain; And extracting the first candidate gene pairs that are likely to be translocation within the chromosome of the test sample using the split reads and the PE reads, and the break points represented by the split reads and the first read. And a translocation identifier that identifies a translocation gene among the first candidate gene pairs based on a fusion direction of the first candidate gene pairs.

The translocation identifier extracts second candidate gene pairs including a gene in which a plurality of split leads having breakpoints belonging to the same coverage are arranged among the extracted first candidate gene pairs, and the translocation gene is extracted. Second candidate gene pairs.

The translocation identifier may further include a third candidate gene in which the fusion direction between different genes is from 5 'end to 3' end, or from 3 'end to 5' end among the extracted second candidate gene pairs. Pairs are extracted and the translocation gene is identified from the extracted second candidate gene pairs.

In addition, the read analyzer obtains data of FLAG and Compact Idiosyncratic Gapped Alignment Report (CIGAR) strings for the split leads and the PE leads from the data of the BAM format or the SAM format.

As described above, it is possible to analyze more accurately whether the translocation gene exists from the test gene extracted from the test sample of the test subject.

1 is a view for explaining a gene analysis apparatus according to an embodiment.

2 is a block diagram illustrating hardware configurations of a gene analysis apparatus according to an exemplary embodiment.

3 is a diagram for describing PE leads according to an exemplary embodiment.

4 is a diagram illustrating misaligned PE leads according to an exemplary embodiment.

5 is a diagram for describing split leads, according to an exemplary embodiment.

FIG. 6 is a diagram for describing an IGV (Integrative Genomics Viewer) screenshot comparing reads obtained from a biopsy sample of a subject according to an embodiment with reference gene data. FIG.

7 is a diagram illustrating an IGV screenshot comparing reads obtained from an FFPE sample of a subject according to an embodiment with reference gene data.

8 is a flowchart of a method of identifying a translocation gene by extracting candidate gene pairs from the translocation identification unit, according to an exemplary embodiment.

9 is a diagram for describing extracting second candidate gene pairs using break points of split reads, according to an exemplary embodiment.

FIG. 10 is a diagram for explaining extraction of third candidate gene pairs using appropriateness of a fusion direction according to one embodiment. FIG.

FIG. 11 is a diagram illustrating a result of identifying translocation genes of EML4 (echinoderm microtubule-associated protein-like 4) and ALK (anaplastic lymphoma kinase) according to one embodiment.

12 is a flowchart of a method of analyzing a gene, according to an embodiment.

13 is a block diagram illustrating hardware configurations of a computing device according to an embodiment.

The terminology used in the present embodiments is to select general terms widely used now, considering the functions of the present embodiments, but this will vary depending on the intention or precedent of the person skilled in the art, the emergence of new technologies, etc. Can be. In addition, in certain cases, there is also a term arbitrarily selected, in which case the meaning will be described in detail in the description of the corresponding embodiment. Therefore, the terms used in the present embodiments should be defined based on the meanings of the terms and the contents throughout the embodiments, rather than simply the names of the terms.

In the descriptions of the embodiments, when a part is connected to another part, it includes not only a case where the part is directly connected, but also an electric part connected between other components in between. . In addition, when a part includes a certain component, this means that the component may further include other components, not to exclude other components unless specifically stated otherwise. In addition, the terms "... unit", "... module" described in the embodiments means a unit for processing at least one function or operation, which is implemented in hardware or software, or a combination of hardware and software. Can be implemented.

Terms such as “consisting of” or “comprising” as used in the present embodiments should not be construed as necessarily including all of the various components or steps described in the specification, and some of the components or It is to be understood that some steps may not be included or may further include additional components or steps.

The description of the following embodiments should not be construed as limiting the scope of rights, and it should be construed as belonging to the scope of the embodiments as can be easily inferred by those skilled in the art. Hereinafter, only exemplary embodiments will be described in detail with reference to the accompanying drawings.

Referring to FIG. 1, the genetic analysis apparatus 10 may transfer translocation to a test gene of a test sample by using reference gene data 20 of a normal population and test gene data 30 obtained from a test biological sample of a test subject. translocation) genes can be identified.

The test gene data 30 received by the genetic analysis device 10 may be NGS data obtained by next generation sequencing (NGS), and the NGS data may be in a binary version of SAM (BAM) format or SAM ( Sequence Alignment / Map) format may include genetic data. The BAM format or SAM format can usually be used as a format that describes data relating to short reads. The file in BAM format or SAM format may include text data about start point of read, direction of read, mapping quality, FLAG indicating alignment order, Compact Idiosyncratic Gapped Alignment Report (CIGAR) string, and the like. . Here, FLAG is an alignment pair of a primary alignment-primary alignment pair, a primary alignment-secondary alignment pair, a secondary alignment-primary alignment pair, or a secondary alignment-secondary alignment pair. It may be an identifier for identifying. By creating various alignment pairs, various supporting reads can be obtained.

Reference gene data 20 may be obtained from a database already known in the art such as the National Center for Biotechnology Information (NCBI), Gene®Expression Omnibus (GEO), or the like, or to analyze a subject's genes. It may be from biological samples of the recruited people.

Meanwhile, the reference genes included in the reference gene data 20 or the test genes included in the test gene data 30 may be obtained from biopsy tissue, formalin-fixed tissue, or paraffin-embedded tissue. It may be obtained.

Translocation refers to a phenomenon in which a cleavage occurs in a portion of a chromosome and the cleaved fragment is bound to another portion or another chromosome in the same chromosome, which in turn means structural variation of the chromosome.

The genetic analysis apparatus 10 may determine whether the translocation gene exists in the test gene data 30 obtained from the test sample of the subject compared with the reference gene data 20 obtained from the normal population. Here, the gene analyzed by the genetic analysis device 10 may refer to a nucleic acid such as DNA (deoxyribonucleic acid), RNA (ribonucleic acid), and the like.

In the present embodiments, the normal population may refer to a population composed of ordinary people who have not found a specific disease, such as cancer or a tumor, and the subject may refer to a patient where a specific disease such as cancer or a tumor is found. have. Meanwhile, in the present embodiments, the normal population and the subject may correspond to other animals other than humans.

The genetic analysis apparatus 10 may be implemented with at least one processor having a function of data processing for analyzing various

genetic data

20 and 30 to identify translocation genes and performing various algorithms.

Referring to FIG. 2, the genetic analysis apparatus 10 may include a read analyzer 110 and a translocation identifier 120. On the other hand, since the gene analysis apparatus 10 shown in FIG. 2 only shows the components related to the present embodiment in order to prevent the features of the present embodiment from being blurred, the gene analysis apparatus 10 is shown in FIG. In addition to the components, other general purpose components may be further included.

The read analyzer 110 splits reads and discordantly aligned PE from next generation sequencing (NGS) data of the test sample, which is included in the test gene data 30 described above with reference to FIG. 1. Obtain data about paired-end leads.

The NGS data included in the test gene data 30 is data in a BAM format or a SAM format, and the read analyzer 110 reads each of the split leads and the PE leads from the data in the BAM format or the SAM format. Text data regarding a start point of a read, a read direction, a mapping quality, a FLAG indicating an order of alignment, and a compact Idiosyncratic Gapped Alignment Report (CIGAR) string may be acquired.

Generally, as a sequencing technique for analyzing the sequence of a test gene from a test sample, NGS techniques such as whole genome sequencing (WGS), whole exome sequencing (WES), and the like are known. However, the NGS data according to the present embodiment may be generated by targeted sequencing for identifying nucleotide sequences of some target genes, but not the entire genome in a test sample.

On the other hand, the test sample may be a biopsy sample obtained from the subject, a formalin-fixed paraffin-embedded (FFPE) sample.

The translocation identifier 120 extracts the first candidate gene pairs that are likely to be translocation in the chromosome of the test sample using split reads and mismatched PE reads. A gene sequenced into split reads, or a gene sequenced into mismatched PE leads, may be considered a candidate with a high probability that the nucleotide sequence of some of the genes may differ from the reference gene (the gene of normal person).

The translocation identifier 120 identifies the translocation gene among the first candidate gene pairs based on the break points indicated by the split leads and the fusion direction of the first candidate gene pairs.

In more detail, the translocation identifier 120 may extract second candidate gene pairs including genes in which a plurality of split leads having break points belonging to the same coverage are arranged among the first candidate gene pairs. That is, the translocation gene can be identified from second candidate gene pairs compressed to a narrower range than the first candidate gene pairs. Here, in the genes included in the second candidate gene pairs, the number of split leads having break points belonging to the same coverage may be greater than or equal to a predetermined threshold. Coverage refers to an error range of break points that can be considered the same break point in consideration of sequencing errors. For example, when the predetermined threshold is three, genes having three or more split leads having break points within the same coverage may be included in the second candidate gene pairs. However, the predetermined threshold value may be variously changed.

Further, the translocation identifier 120 may include a third candidate gene pair having a fusion direction between 5'ends and 3'ends or between 3'ends and 5'ends among the second candidate gene pairs. Can extract them. That is, the translocation gene can be identified from third candidate gene pairs compressed to a narrower range than the second candidate gene pairs. For example, a fusion gene in which the 3 ′ end of gene A and the 3 ′ end of gene B are combined may be a meaningless fusion gene that does not have a function of biological expression. Therefore, the translocation identification unit 120 extracts the third candidate gene pairs by filtering gene pairs having inappropriate fusion directions among the second candidate gene pairs in consideration of the fusion direction.

The translocation identification unit 120 may finally determine that the gene pair included in the third candidate gene pairs is a translocation gene.

3 is a diagram for describing PE leads according to an exemplary embodiment.

Referring to FIG. 3, sequencing of PE means sequencing a test gene of a test sample at both ends, respectively. In the present exemplary embodiment, since it has been described above to perform target sequencing, it may be assumed that the nucleic acid (DNA, RNA, etc.) 300 of the test sample to be sequenced is 500bp (base pair) in size. If the read size is set to 100 bp, PE leads 310 and 320 may be generated by sequencing from both ends of nucleic acid 300. Since the read size is smaller than the size of the nucleic acid 300, separate reads may not be generated for the remaining portions of the nucleic acid 300. Meanwhile, the sequencing of the PE according to the present exemplary embodiment may sequence not only the exon 305 but also the intron to obtain the PE leads 310 and 320. The reason for using the sequencing of the PE will be described in more detail with reference to FIG. 4.

The transgenic gene may be a combination of different genes within the same chromosome or different genes within different chromosomes. As a result, the gene sequence on the 5 'end and the gene sequence on the 3' end of the translocation gene are derived from different genes. Thus, the nucleotide sequence of the PE leads 410 and 420 will be clear from the nucleotide sequence of the reference gene of the normal person corresponding to the sequencing positions.

Referring to FIG. 4, the PE lead 410 is mapped to any gene 401 present in chromosome 2 and the PE lead 420 is present in chromosome 3 based on the nucleotide sequence of the reference gene of a normal person. ), It can be inferred that each of the PE leads 410 and 420 may have originated from a fusion gene (translocation gene) present in the test sample. Such leads may be defined as misaligned PE leads 410 and 420. Gene pairs 401 and 402 mapped to such mismatched PE leads 410 and 420 may be included in the first candidate gene pairs described above.

Referring to FIG. 5, the split read 510 refers to a read having a nucleotide sequence only partially matching the base sequence of the reference gene and not matching the other portion. When a read corresponding to the test gene of the test sample matches the base sequence of the corresponding reference gene, it can be considered that there is no structural variation in the base sequence of the test gene. However, as in the split read 510, if only a part of the base sequence of the reference gene and some of the others do not match, it can be inferred that the base sequence of the test gene is different from the base sequence of the reference gene. Accordingly, the genes 501 and 502 mapped to the split read 510 may be included in the first candidate gene pairs described above.

For example, if the CIGAR string of the split read 511 is 75M25S, the split read 511 matches only 75 base sequences with gene A 501 and the remaining 25 base sequences do not match gene A 501. Is not the lead. In addition, when the CIGAR string of the split read 512 is 80M20S, the split read 512 matches only 80 nucleotide sequences with the gene A 501 and the remaining 20 nucleotide sequences do not match the gene A 501. to be. The read analyzer 110 of FIG. 1 may obtain data of such split leads from data of a BAM format or a SAM format.

Referring to FIG. 6, in the IGV screenshot 600, the leads 610 that match the nucleotide sequence of a reference gene (eg, an analogous lymphoma kinase (ALK) gene) are displayed in gray color. However, the reads 620 that do not match the base sequence of the reference gene are displayed in various colors other than gray. That is, the leads 620 represented in various colors are likely to correspond to, for example, misaligned PE leads or split leads. Thus, the read analyzer 110 of FIG. 1 obtains data on reads 620 having a nucleotide sequence different from that of the reference gene.

The chromosomal location of the reference gene (eg, ALK gene) in the IGV screenshot 700 shown in FIG. 7 is similar to the chromosomal location in the IGV screenshot 600 shown in FIG. 6.

However, unlike FIG. 6 described above, IGV screenshot 700 is more colorful than IGV screenshot (600 in FIG. 6). This means that there are more reads (eg, PE reads, split reads) that do not match the nucleotide sequence of the reference gene than in the case of FIG. 6. The reason is that in the case of Fig. 7, the reference gene is obtained from the FFPE sample. Since the life of the biopsy sample is short, FFPE is an essential treatment to maintain the biochemical properties of the biopsy sample for a long time. Unlike the biopsy sample, the FFPE sample has chemical variations and structural variations due to the FFPE treatment, so that there are many more mismatched leads than the case of FIG. 6. As a result, a plurality of judgments of false positive or false negative may be included in the genetic analysis result. However, according to the present embodiments, even if the reference gene, the test gene of the test sample is obtained from the biopsy sample or the FFPE sample, removes false positive or false negative judgments, Or can be reduced. It will be described more continuously below.

In step 801, the translocation identification unit 120 uses the data regarding the split reads and the mismatched PE reads acquired by the read analysis unit 110 to identify the first candidate gene pairs that are likely to be translocations. Extract. For example, the translocation identifier 120 may use the split reads using data regarding reads that match the nucleotide sequence of the reference gene in the

IGV screenshot

600 or 700 described above with reference to FIG. 6 or 7. The first candidate gene pairs may be extracted by various combinations of genes mapped to and genes mapped to mismatched PE leads.

In operation 802, the translocation identifier 120 extracts second candidate gene pairs including genes in which a plurality of split leads having break points belonging to the same coverage are aligned. As described above, if the number of split leads with break points belonging to the same coverage is above a predetermined threshold, the gene mapped to those split leads is considered to be more likely to have the actual break points of the translocation gene. Can be. Accordingly, the translocation identifier 120 selects, among the first candidate gene pairs, genes in which a plurality of split leads having breakpoints belonging to the same coverage are arranged as second candidate gene pairs. That is, genes included in the second candidate gene pairs may be genes that are more likely to be translocation genes than genes included in the first candidate gene pairs.

In operation 803, the translocation identifier 120 extracts third candidate gene pairs in which the fusion direction between different genes is 5 'to 3' end or 3 'to 5' end. Even if the pairs of genes that are expected to be fusion genes (translocation genes) included in the second candidate gene pairs are not translocation genes when the fusion direction is inappropriate. Accordingly, whether the fusion direction of the different genes is appropriate, that is, the translocation identifier 120 determines that the fusion direction of the gene pairs included in the second candidate gene pairs is 5 'to 3' end or 3 'to 5'. 'Determine whether it is properly bound in the direction of the end, and filter the appropriate gene pairs as third candidate gene pairs. That is, genes included in the third candidate gene pairs may be genes that are more likely to be translocation genes than genes included in the second candidate gene pairs.

In operation 804, when the third candidate gene pairs are extracted, the translocation identifier 120 identifies that the gene pairs included in the third candidate gene pairs correspond to the translocation gene.

9, a plurality of split leads 910 may be mapped to a gene (gene X) 900 of a test sample. The gene X 900 to which the split leads 910 are mapped may be included in the first candidate gene pairs. Data for the break points 920 may be mapped to the split leads 910, respectively. Since the split reads 910 may exist due to various causes such as sequencing error, gene insertion, gene deletion, etc., even if multiple split reads 910 are mapped to gene X 900, gene X ( 900) cannot be determined to be a part of the translocation gene.

However, when a plurality of split leads 930 having break points 940 belonging to the same coverage among the split leads 910 mapped to gene X 900 are mapped, the translocation gene is assigned to gene X 900. It can be considered that there is a high possibility that a break point of. Thus, if the number of split leads 930 having break points 940 belonging to the same coverage is greater than or equal to a predetermined threshold, gene X 900 is identified as likely to correspond to a portion of the translocation gene. That is, gene X 900 may be extracted as being a gene included in the second candidate gene pairs.

Meanwhile, the break points represented by the split leads 930 having the actual break points may not be exactly the same due to various reasons such as sequencing errors. Therefore, it may be desirable for the translocation identifier 120 to determine whether the breakpoint exists within a predetermined range (ie, coverage), rather than determining whether the breakpoint is the same value.

Referring to FIG. 10, when gene X 1010 on chromosome 2 1001 and gene Y 1020 on chromosome 3 1002 are included in the second candidate gene pairs, translocation identifier 120 determines gene X ( 1010) and the fusion direction of the gene Y (1020) can be determined.

The

fusion genes

1030 and 1040 on chromosome 2 (1001) and chromosome 1002 are combined with the 3 'end of gene X (1010) and the 5' end of gene Y (1020), and thus, translocation identifier 120 It may be determined that the fusion direction of the fusion gene 1030 is appropriate.

Among the second candidate gene pairs, the

fusion genes

1030 and 1040 having the proper fusion direction are the third candidate gene pairs, and thus, the translocation identifier 120 determines that the gene pairs included in the third candidate gene pairs It is judged that it is a translocation gene.

In the case of the FFPE sample (FIG. 7), even if a plurality of false positive leads are extracted, the actual translocation gene is identified by eliminating or reducing false positive judgments through the determination of the break point and the fusion direction. I can do it.

FIG. 11 illustrates a result of identifying translocation genes of EML4 (echinoderm microtubule-associated protein-like 4) and ALK, according to an exemplary embodiment.

Referring to FIG. 11, data is shown for the translocation gene of EML4-ALK identified through the analysis of translocation genes described above. The IGV screenshot 1101 on the right shows the leads mapped to EML4, and the IGV screenshot 1102 on the left shows the leads mapped to ALK. Leads mapped to EML4 are split at break point coverage of 42536701 to 42559688, and leads mapped to ALK are split at breakpoint coverage of 29415639 to 29446500. In addition, 39 supporting leads were used to identify the translocation gene of EML4-ALK. FIG. 11 is only a simulation result of verifying the identification result of the translocation gene by applying the gene analysis method described in the present embodiments to a test sample of an actual patient, and thus the present embodiments are not limited by FIG. 11.

12 is a flowchart of a method of analyzing a gene, according to an embodiment. Referring to FIG. 12, the gene analysis method includes steps that are processed in time series in the gene analysis apparatus 10 described in the foregoing figures. Therefore, even if omitted below, the contents described above may be applied to the genetic analysis method of FIG. 12.

In operation 1201, the read analyzer 110 obtains data regarding split leads and mismatched PE leads from next generation sequencing (NGS) data of a test sample.

In step 1202, the translocation identifier 120 extracts the first candidate gene pairs that are likely to be translocated in the chromosome of the test sample using split reads and mismatched PE reads.

In operation 1203, the translocation identifier 120 identifies the translocation gene among the first candidate gene pairs based on the break points indicated by the split leads and the fusion direction of the first candidate gene pairs.

Referring to FIG. 13, the computing device 1 includes a genetic analysis device (processor) 10, a data interface 11, and a memory 12. On the other hand, the computing device 1 shown in FIG. 13 has only general components related to the present embodiment in order to prevent the features of the present embodiment from being blurred. Therefore, the computing device 1 shown in FIG. Components may be further included.

The data interface 11 receives the reference gene data 20 of the normal population and the test gene data 30 of the subject described in FIG. 1. That is, the data interface 11 may be implemented in hardware of a wired / wireless network interface for the computing device 1 to communicate with other external devices. The data interface 11 transmits the received

genetic data

20 and 30 to the genetic analysis device (processor) 10.

The data interface 11 may receive the test gene data 30 of the test subject from an external next-generation sequencing device, a microarray, or the like for sequencing the test gene of the test subject.

The memory 12 is hardware for storing data to be processed in the computing device 1 and the processed results, and memory chips such as random access memory (RAM), read only memory (ROM), or a hard disk (HDD). drive, solid state drive (SSD), and the like. That is, the memory 12 may store the

genetic data

20 and 30 received by the data interface 11 and store the first to third candidate gene pairs processed by the genetic analysis device (processor) 10. Relevant data, data on the identified transgenes, etc. can be stored.

Genetic analysis device (processor) 10 is a module implemented in one or more processing units, which may be implemented as a combination of a microprocessor having an array of multiple logic gates and a memory module storing a program that can be executed on the microprocessor. have. Genetic analysis device (processor) 10 may be implemented in the form of a module of an application program. The genetic analysis device (processor) 10 is a hardware device for processing the gene analysis described above with reference to FIGS. 1 to 12.

Information about the translocation gene identified by the genetic analysis device (processor) 10 is transmitted via the data interface 11 to another external device, such as a display device, another computing device, or the like, or an external network, eg For example, it can be transmitted over the Internet, public database (DB) server.

According to the embodiments described above, the translocation gene can be detected from cancer tissue of a subject (eg, a cancer patient). Furthermore, even if genes (test genes) of cancer tissue (test sample) obtained from a subject are slightly damaged chemically by FFPE treatment, the translocation gene can be accurately determined.

The device according to the embodiments may include a processor, a memory for storing and executing program data, a persistent storage such as a disk drive, a communication port for communicating with an external device, a touch panel, a key, a button, and the like. And a user interface device. Methods implemented by software modules or algorithms may be stored on a computer readable recording medium as computer readable codes or program instructions executable on the processor. The computer-readable recording medium may be a magnetic storage medium (eg, read-only memory (ROM), random-access memory (RAM), floppy disk, hard disk, etc.) and an optical reading medium (eg, CD-ROM). ) And DVD (Digital Versatile Disc). The computer readable recording medium can be distributed over network coupled computer systems so that the computer readable code is stored and executed in a distributed fashion. The medium is readable by the computer, stored in the memory, and can be executed by the processor.

This embodiment can be represented by functional block configurations and various processing steps. Such functional blocks may be implemented in various numbers of hardware or / and software configurations that perform particular functions. For example, an embodiment may include an integrated circuit configuration such as memory, processing, logic, look-up table, etc. that may execute various functions by the control of one or more microprocessors or other control devices. You can employ them. Similar to the components that may be implemented in software programming or software elements, the present embodiment includes various algorithms implemented in C, C ++, Java (data structures, processes, routines or other combinations of programming constructs). It may be implemented in a programming or scripting language such as Java), an assembler, or the like. The functional aspects may be implemented with an algorithm running on one or more processors. In addition, the present embodiment may employ the prior art for electronic environment setting, signal processing, and / or data processing. Terms such as "mechanism", "element", "means" and "configuration" can be used widely and are not limited to mechanical and physical configurations. The term may include the meaning of a series of routines of software in conjunction with a processor or the like.

Specific implementations described in this embodiment are examples, and do not limit the technical scope in any way. For brevity of description, descriptions of conventional electronic configurations, control systems, software, and other functional aspects of the systems may be omitted. In addition, the connection or connection members of the lines between the components shown in the drawings by way of example shows a functional connection and / or physical or circuit connections, in the actual device replaceable or additional various functional connections, physical It may be represented as a connection, or circuit connections.

In the present specification (particularly in the claims), the use of the term “above” and similar indicating terminology may correspond to both the singular and the plural. In addition, when a range is described, it includes the individual values which belong to the said range (if there is no description contrary to it), and it is the same as describing each individual value which comprises the said range in detailed description. Finally, if there is no explicit order or contrary to the steps constituting the method, the steps may be performed in a suitable order. It is not necessarily limited to the order of description of the above steps.

So far I looked at the center of the preferred embodiment for the present invention. Those skilled in the art will appreciate that the present invention can be implemented in a modified form without departing from the essential features of the present invention. Therefore, the disclosed embodiments should be considered in descriptive sense only and not for purposes of limitation. The scope of the present invention is shown in the claims rather than the foregoing description, and all differences within the scope will be construed as being included in the present invention.

Claims

Obtaining, from next generation sequencing (NGS) data of the test sample, data relating to split reads and discreetly aligned paired-end reads;

Extracting first candidate gene pairs that are likely for translocation in the chromosome of the test sample using the split reads and the PE reads; And

Identifying a translocation gene among the first candidate gene pairs based on the break points indicated by the split leads and the fusion direction of the first candidate gene pairs. Way.
The method of claim 1,

The identifying step

Extracting, from the extracted first candidate gene pairs, second candidate gene pairs comprising a gene in which a plurality of split leads having breakpoints belonging to the same coverage are arranged;

The translocation gene is

Identified from the extracted second candidate gene pairs.
The method of claim 2,

The gene included in the extracted second candidate gene pairs is

And the number of split leads with the breakpoint belonging to the same coverage is greater than or equal to a predetermined threshold.
The method of claim 2,

The identifying step

Among the extracted second candidate gene pairs, extracting third candidate gene pairs in which the fusion direction between different genes is from 5 'end to 3' end, or from 3 'end to 5' end. Including,

The translocation gene is

Identified from the extracted second candidate gene pairs.
The method of claim 1,

The NGS data is

A method comprising data in a binary version of SAM (BAM) format or a Sequence Alignment / Map (SAM) format.
The method of claim 5,

The acquiring step

Obtaining data of FLAG and Compact Idiosyncratic Gapped Alignment Report (CIGAR) strings for each of the split leads and the PE leads from the data of the BAM format or the SAM format.
The method of claim 1,

The NGS data is

Generated by targeted sequencing to identify base sequences of target genes in the test sample.
The method of claim 1,

The test sample is

The biopsy sample or formalin-fixed, paraffin-embedded (FFPE) sample.
A non-transitory computer-readable recording medium having recorded thereon a program for executing the method of claim 1.
A read analysis unit for obtaining data about split reads and discreetly aligned paired-end (PE) reads from next generation sequencing (NGS) data of a test sample; And

The split reads and the PE leads are used to extract first candidate gene pairs that are likely to be translocation within the chromosome of the test sample, and the break points represented by the split reads and the first And a translocation identifier that identifies a translocation gene among the first candidate gene pairs based on a fusion direction of candidate gene pairs.
The method of claim 10,

The translocation identification unit

Among the extracted first candidate gene pairs, extracting second candidate gene pairs including a gene in which a plurality of split leads having breakpoints belonging to the same coverage are arranged;

The translocation gene is

And identify from the extracted second candidate gene pairs.
The method of claim 11,

The gene included in the extracted second candidate gene pairs is

And the number of split leads with the breakpoint belonging to the same coverage is greater than or equal to a predetermined threshold.
The method of claim 11,

The translocation identification unit

Among the extracted second candidate gene pairs, extracting third candidate gene pairs in which the fusion direction between different genes is 5 'end to 3' end, or 3 'end to 5' end,

The translocation gene is

And identify from the extracted second candidate gene pairs.
The method of claim 10,

The NGS data is

A device comprising data in a binary version of SAM (BAM) format or a Sequence Alignment / Map (SAM) format.
The method of claim 14,

The lead analysis unit

Obtaining data of FLAG and Compact Idiosyncratic Gapped Alignment Report (CIGAR) strings for each of the split leads and the PE leads from data in the BAM format or the SAM format.
The method of claim 10,

The NGS data is

And generated by targeted sequencing to identify base sequences of target genes in the test sample.
The method of claim 10,

The test sample is

The device, which is a biopsy sample or a formalin-fixed, paraffin-embedded (FFPE) sample.