WO2017204414A1

WO2017204414A1 - Method and apparatus for analyzing degree of cross-contamination of sample

Info

Publication number: WO2017204414A1
Application number: PCT/KR2016/009451
Authority: WO
Inventors: 박동현; 손대순; 박웅양
Original assignee: 삼성전자 주식회사; 사회복지법인 삼성생명공익재단
Priority date: 2016-05-25
Filing date: 2016-08-25
Publication date: 2017-11-30
Also published as: KR20170133079A; KR101882866B1

Abstract

Provided are a method and an apparatus for analyzing a degree of cross-contamination of a sample with regard to a target sample, comprising the steps of: acquiring first sequence information of a nucleic acid fragment from each of a target sample and an additional sample, and second sequence information of a nucleic acid fragment from a mixed sample of the target sample and the additional sample; calculating allele frequency from each of the first sequence information and the second sequence information acquired; and comparing the calculated allele frequencies with regard to a specific chromosomal locus. By measuring a degree of cross-contamination between samples at a specific chromosomal locus, the method and the apparatus can guarantee reliability to variation detection results.

Description

Method and apparatus for analyzing the degree of cross contamination of a sample

A method for analyzing the degree of contamination between samples, a computer-readable recording medium having recorded thereon a program for executing the method, and an apparatus for analyzing the degree of contamination between samples.

A genome is all the genetic information of a living thing. Techniques for sequencing a person's genome have been developed such as DNA chips, Next Generation Sequencing technology, and Next Next Generation Sequencing technology. Next-generation sequencing can be used interchangeably with large-scale parallel sequencing or second-generation sequencing.

Analysis of genetic information such as nucleotide sequences, proteins, etc. is widely used to find genes expressing diseases such as diabetes and cancer, or to identify correlations between genetic diversity and expression characteristics of individuals. In particular, the genetic data collected from the individual is important in identifying the genetic characteristics of the individual associated with different symptoms or disease progression. Thus, genetic data such as individual nucleotide sequences, proteins, etc. are essential data to identify current and future disease-related information to prevent disease or to select the optimal treatment method in the early stages of disease. Techniques for accurately analyzing and diagnosing mutations such as Single Nucleotide Variant (SNV), Copy Number Variation (CNV), Insertion and Deletion (InDel), and Translocation using diseases are being studied.

Conventionally, in detecting variations, the effects of contamination between samples are insignificant, and most of them are not taken into consideration or estimated using general population frequencies provided by known databases. However, there is a need for a technique for measuring or correcting the effects of interference between samples in order to detect mutations with low allele frequencies.

According to one aspect, obtaining first sequence information of the nucleic acid fragment from each of the target sample and the additional sample, and second sequence information of the nucleic acid fragment from the mixed sample of the target sample and the additional sample; Calculating allele frequencies from the obtained first and second sequence information, respectively; And comparing the calculated allele frequency with respect to a specific site of the chromosome, to provide a method for analyzing the degree of cross contamination of the sample with respect to the target sample.

According to another aspect, the sequence information obtaining unit for obtaining the first sequence information of the nucleic acid fragment from each of the target sample and the additional sample, and the second sequence information of the nucleic acid fragment from the mixed sample mixed with the target sample and the additional sample; An allele frequency calculating unit for calculating an allele frequency from the obtained first sequence information and the second sequence information, respectively; And it provides a device for analyzing the degree of cross-contamination of the sample to the target sample, including a calculation unit for comparing the calculated allele frequency for a specific site of the chromosome.

According to another aspect, there is provided a computer-readable recording medium having recorded thereon a program for executing the method.

The sample may be a biological sample or a compound of the subject, that is, a synthetic sample. The subject may include primates and humans, such as humans, non-human primates, cattle, horses, pigs, sheep, goats, dogs, cats, or rodents. The biological sample may be obtained from blood, plasma, serum, urine, saliva, mucosal secretions, sputum, feces, tears, or a combination thereof. The biological sample of the subject may be a sample of eukaryotic cells, prokaryotic cells, viruses, bacteriophage, etc. derived from various species. In addition, the sample may include a nucleic acid or synthetic nucleic acid of the subject. The nucleic acid may be used interchangeably with a polynucleotide or oligonucleotide of any length. The nucleic acid may be a cell-free DNA (cf DNA) or an isolated DNA.

The method of separating nucleic acid from the sample may be performed by a method known to those skilled in the art. The length of the nucleic acid fragment may be about 10bp (base pair) to about 2000bp, about 15bp to about 1500bp, about 20bp to about 1000bp, about 20bp to about 500bp or about 20 to about 200bp.

Obtaining sequence information of the nucleic acid fragment may include obtaining sequence information by performing next-generation sequencing (NGS) on the separated nucleic acid. The "next generation sequencing" may be used interchangeably with "massive parallel sequencing" or second-generation sequencing. Next-generation sequencing refers to a technique of fragmenting a full-length genome in chip-based and PCR-based paired end formats, and performing the sequencing of the fragments at high speed based on hybridization. Next-generation sequencing is a technique for sequencing multiple nucleic acids of a large amount of fragments, and may perform targeted sequencing or panel sequencing based on next-generation sequencing. Next-generation sequencing includes, for example, 454 platform (Roche), GS FLX Titanium, Illumina MiSeq, Illumina HiSeq, Illumina Genome Analyzer, Solexa platform, SOLiD System (Applied Biosystems), Ion Proton (Life Technologies), Complete Genomics, Helicos Biosciences Heliscope , Single molecule real time (SMRT ™) technology from Pacific Biosciences, or a combination thereof.

The method may further comprise preparing a nucleic acid library to perform next generation sequencing. The nucleic acid library can be prepared according to the next generation sequencing scheme. Nucleic acid libraries can be constructed according to the manufacturer's instructions to provide next generation sequencing.

The sequence information of the obtained nucleic acid fragments may be called a read.

Sequence information of the nucleic acid fragments may be stored in the system, and N masking may be performed. N masking means treating missing individual nucleic acids with too low quality. In addition, a low quality lead filter can be performed. The low quality read filter means processing to exclude sequence information of nucleic acid fragments that have been read with excessively low quality.

The method may include assigning sequence information of the nucleic acid fragment to a chromosome by mapping the obtained sequence information to a human reference genome. The human reference genome may be hg18 or hg19. Sequence information mapped to only one genomic position in the human reference genome may be designated as unique sequence information. The sequence information of the nucleic acid fragments can be assigned to the position of the chromosome based on the designated unique sequence number. The locus of the chromosome may be a continuous range on a chromosome having a length of at least about 5 kb, about 10 kb, about 20 kb, about 50 kb, about 100 kb, about 1000 kb, or 2000 kb. The chromosomal locus may be a single chromosome.

In the step of mapping the obtained sequence information to the human reference genome, a global alignment or a local alignment may be performed in parallel. The global alignment refers to a method of placing the entire sequence information of the nucleic acid fragments in the most similar portion of the reference genome, and the local alignment refers to a method of positioning some of the sequence information of the nucleic acid fragments in the most similar portion of the reference genome sequence. do.

The method may include identifying a variation in the DNA of the sample. The mutation check may be performed using a known mutation detection program, for example, GATK, SAMtool, MoDIL, SeqSeq, PeMer, VariationHunter, Pindel, BreakDancer, and Mutek, but is not limited thereto.

The first sequence information may be sequence information of a nucleic acid fragment obtained from each of a plurality of samples including a target sample and an additional sample. The first sequence information may be a result of sequencing the target sample alone. In addition, the first sequence information may be a result of sequencing each sample individually for one or more, two or more, or five or more additional samples.

The second sequence information may be sequence information of a nucleic acid fragment obtained from a mixed sample in which a target sample and an additional sample are mixed. A sequencer that performs sequencing may be a mixed sample in which a plurality of samples are mixed. In the case of using a mixed sample in which a plurality of samples are mixed, there is an advantage of reducing the cost of increasing the concentration of the target and providing high throughput in a short time. At this time, a plurality of samples can be distinguished from each other by tagging a label unique to a library of a plurality of samples.

The method may include calculating an allele frequency from each of the obtained first and second sequence information. In the target region where sequencing was performed, the allele frequency of each allele can be calculated. The allele frequency may refer to a numerical value representing a composition ratio between different alleles constituting the same gene in one sample. The allele frequency may be expressed as one or more of A, G, C, and T, or the frequency of sequence information of all of A, G, C, and T.

The method may comprise comparing the calculated allele frequency for a particular site of the chromosome.

The specific position of the chromosome may be the same or corresponding exon site or intron site between a plurality of samples, and may be the same sequence number site on the same number of chromosomes. The specific site of the chromosome may be a part or all of a region including the mutation predicting site and the surrounding site to be subjected to sequencing in sequencing or target sequencing.

For specific sites of the same target sample and chromosome, the allele frequency obtained from the first sequence information and the allele frequency obtained from the second sequence information can be compared. For example, for the specific target of the same target sample and chromosome, the allele frequency of A from the first sequence information and the allele frequency of A from the second sequence information can be compared. Similarly, allele frequencies of G, C, and T from the first sequence information and allele frequencies of each of G, C, and T from the second sequence information can be compared with respect to specific sites of the same target sample and chromosome. As a result of the comparison, if there is a significant difference in the allele frequency of any one of A, G, C or T, the target sample may be determined to be contaminated by the additional sample. The greater the significant difference, the more likely that particular site of the target sample is contaminated by further samples. The allele frequency may be compared by the number of alleles having the allele frequency, or the ratio of the number of alleles having the allele frequency in the total allele number may be compared.

The "cross-contamination" of the sample is a tag tagged in the sequence information of the nucleic acid fragment of another sample tagged with the sequence information of the nucleic acid fragment of one sample, or the sequence information of the nucleic acid fragments of different samples By exchanging the label between the liver, it means that the sequence information of the nucleic acid fragment in which the label is incorrectly tagged. Due to cross-contamination of the samples, allele frequencies are significant when the allele frequency is analyzed from the first sequence information and the allele frequency is analyzed from the second sequence information for a specific chromosomal site of the sample. The difference can be seen.

The method selects a mutation prediction site set by combining the mutation prediction sites obtained from the sequence information of each of the target sample and the additional sample in the obtained first sequence information, and selects the positions excluding the mutation prediction site set as control site sets. Selecting as; Calculating allelic frequencies of genotype alleles and background alleles from the obtained first sequence information and the second sequence information, respectively, for the set of predictive mutation sites or the set of control sites; And comparing the calculated allele frequency with respect to the mutation prediction site set or the control site set.

The variation may mean different characteristics of a plurality of samples appearing at specific sites of the chromosome. The property may be a nucleic acid sequence or a nucleotide sequence. For a specific site of the chromosome, the genotype allele of one sample obtained from the first sequence information may have a nucleic acid sequence or nucleotide sequence different from the genotype allele of another sample obtained from the first sequence information. . The mutation may be Single Nucleotide Polymorphism (SNP). SNP It refers to the difference between a single nucleotide that appears between individuals in one species, and is a genetic change or variation showing a difference of a nucleotide sequence (A, G, C, T) at a specific position in the nucleic acid sequence. In particular, SNP is a genetic factor associated with the disease, and different SNPs show different resistance, sensitivity, and degree of disease to each subject. Each of the plurality of samples may have different or identical SNP sites from each other.

The variation may have a variation with respect to the reference dielectric. Specifically, the variation may include a variation of the nucleic acid sequence or the nucleotide sequence with respect to the reference genome. Variation of the nucleic acid sequence or nucleotide sequence may comprise substitution, insertion, deletion, or translocation of one or more nucleotide sequences relative to a reference genome. Substitution of the one nucleotide sequence may be, for example, Single Nucleotide Variation (SNV). SNV refers to the difference between a single nucleotide that appears in a few populations in one sequence or species, and may be, for example, a difference from the nucleotide sequence of a reference genome appearing in sequencing data. Each of the plurality of samples may have different or identical SNV sites from each other. Allele frequencies of variation can be calculated by counting the number of alleles in existing generation sequencing data using existing programs such as samtools.

The method includes selecting a mutation prediction site set by combining the mutation prediction sites obtained from the sequence information of each of the target sample and the additional sample, and selecting the positions other than the mutation prediction site set as the control site sets.

The "mutation prediction site" may mean a specific site of the chromosome having the above-described mutation. When the genotype allele of one sample obtained from the first sequence information is different from the genotype allele of another sample obtained from the first sequence information, it may mean the site. The spot may be a predictive site of variation of the sample. For example, when the genotype allele of the target sample has an SNP, the SNP site may be included in the predictive site of the mutation of the target sample. Each of the plurality of samples may have different or identical mutation prediction sites from each other. Referring to FIG. 3, for

positions

1, 2, 3, 4, and 5, the predicted variance is 2 to 4 digits for sample 1 (S 1), and the predicted variance for sample 2 (S 2). The seat may be 2 to 5 seats.

The "union variant set" is a collection of variation prediction sites that combines the variation prediction sites of each of the plurality of samples, that is, the target sample and the additional sample, and is a union of the variation prediction sites of the plurality of samples. Can be. Referring to FIG. 3, for the first to fifth digits, the set of variation prediction sites of the first and second samples may be the second to fifth digits.

A position excluding the mutation prediction position set may be selected as the control position set. The "control site set" is a set of sites for which no mutation is detected in any of the plurality of samples because the background alleles of the plurality of samples obtained from the first sequence information are the same for a specific site of the chromosome. Means.

The method may calculate the allele frequencies of the alleles, ie, genotype alleles and / or background alleles, from the obtained first sequence information and the second sequence information, respectively, for the set of mutation prediction sites or the control site set. have. Among allele frequencies calculated from the first sequence information and the second sequence information described above, allele frequencies of genotype alleles and / or background alleles for the set of predictive or control sites can be selected or derived. have. The allele frequency of the target sample may be represented by the frequency of sequence information of one or more of A, G, C and T or all of A, G, C and T.

The method determines the allele as a background allele if the allele obtained from the first sequence information has an allele frequency of less than 10%, and if the allele has an allele frequency of 10% or more, the allele Genes can be determined as genotype alleles. The criterion for distinguishing the allele may be any criterion for genotyping.

The "background allele" may mean an allele having an allele frequency of less than 10%, 5% or less, 1% or less, 0.5% or 0.1% or less obtained from sequence information. The background allele can be understood as the meaning of the background allele used in the art. The "genotype allele" may refer to an allele having an allele frequency of 10% or more obtained from sequence information. The allele frequency of the genotype allele may be at least 10%, at least 30%, at least 50%, at least 90%, or 100%. The genotype allele may be understood as meaning genotype alleles used in the art. For certain chromosomal loci, alleles can typically have A, G, C, and T genotypes, of which base sequences having an allele frequency of at least 10% are assigned to genotype alleles, allele frequencies of 1% or less. The branch can be determined by the background allele as the base sequence. Referring to FIG. 3, the genotype allele at position 1 of Sample 1 is represented by T, and the background alleles are A, G, and C. In addition, genotype allele at position 5 of Sample 1 was indicated by T and C, and the background allele was A and G.

The method may include comparing the calculated allele frequencies with respect to the mutant prediction site set or the control site set.

For the same target sample and the set of mutation prediction sites, allele frequencies can be compared in the first sequence information and the second sequence information. For example, for the same target sample and set of mutation prediction sites, the allele frequency of A in the second sequence information and the allele frequency of A in the first sequence information can be compared. Similarly, for the same target sample and the set of mutation prediction sites, allele frequencies of G, C and T in the second sequence information and allele frequencies of G, C and T in the first sequence information can be compared. As a result of the comparison, if there is a significant difference in the allele frequency of any one of A, G, C or T, it may be determined that the target sample is contaminated by another sample.

For the same target sample and the set of control sites, allele frequencies can be compared in the first sequence information and the second sequence information. For example, for the same target sample and control site set, the allele frequency of A in the second sequence information and the allele frequency of A in the first sequence information can be compared. Similarly, for the same target sample and the set of control sites, allele frequencies of G, C and T in the second sequence information and allele frequencies of G, C and T in the first sequence information can be compared. In this case, since the background alleles and genotype alleles of the plurality of samples obtained from the first sequence information are the same for the control site set, it may be determined that there is no cross contamination of the samples. Referring to FIG. 3, the first site of all samples is the same genotype allele as T, the background allele is the same as A, G, and C, and no mutation is detected. The first position becomes one of the set of control sites. In this position, the background allele of one sample may be determined not to be interfered by the genotype allele of another sample.

The method selects alleles that are the background alleles of the target sample and the genotype alleles of the additional sample in the first sequence information as a test group, and the mutation prediction site sets and the control site sets. For, in the first sequence information may include the step of selecting the allele that is the background allele of the target sample and the background allele of the additional sample as a control group.

The “control group” means an allele that is a background allele of a target sample in the first sequence information and a background allele of a further sample with respect to the mutation predicting site set and the control site set.

The method may include comparing allele frequencies of the control group obtained from the target sample in the first sequence information, and allele frequencies of the control group obtained from the target sample in the second sequence information.

Referring to FIG. 3, the background allele of the first allele of sample 1 (S 1) and at the same time, the background alleles of sample 2 (S 2), sample 3 (S 3), and sample 4 (S 4), which are additional samples. Alleles that are genes are A, G and C. In the obtained first sequence information, allele frequencies of A, G, and C, which are control groups of Sample 1, and allele frequencies of A, G, and C, which are control groups of Sample 1, may be compared, respectively, in the second sequence information. In addition, the allele which is the background allele of the 2nd position of the sample 1, and the background allele of the sample 2, the sample 3, and the sample 4 which are additional samples at the same time is G and C. In the obtained first sequence information, the allele frequencies of the control group G and C of sample 1 and the allele frequencies of the control groups G and C of sample 1 in the second sequence information may be compared, respectively. In addition, the background allele of the 3rd position of sample 1, and the background allele of sample 2, the sample 3, and the sample 4 which are additional samples are allele A. In the obtained first sequence information, the allele frequency of A, which is a control group of Sample 1, and the allele frequency of A, which is a control group of Sample 1, may be compared with each other in the second sequence information. The control group of these target samples, when comparing the allele frequency in the first sequence information and the allele frequency in the second sequence information, there may be little or no difference. The control group may determine that there is no possibility of cross contamination of the sample.

The "test group" refers to the allele which is the background allele of the target sample in the first sequence information and the genotype allele of the additional sample with respect to the set of mutation prediction sites. Since the test group determines that there is a possibility of cross contamination of a sample at a chromosome specific site corresponding to a plurality of samples, the test group may be an object to analyze the degree of contamination.

The method may compare the allele frequency of the test group obtained from the target sample in the first sequence information, and the allele frequency of the test group obtained from the target sample in the second sequence information.

The method of analyzing the degree of contamination and the method of selecting a test group may vary depending on how and what samples are mixed. If contamination occurs by sample and by chromosome specific site, it may be different. If cross contamination between samples for a target sample occurs, the allele frequency of the background allele in the set of predictive sites of variation of the target sample may be affected by genotype alleles of other samples. The comparing step may analyze the number of alleles having any allele frequency in the test group and / or control group. The number of alleles having the allele frequency by allele frequency may be compared, or the ratio of the number of alleles having the allele frequency in the total alleles by group may be compared.

Referring to FIG. 3, the allele which is the background allele of the fourth position of Sample 1 and the genotype allele of Sample 2, Sample 3, and Sample 4, which are additional samples, is T. In the obtained first sequence information, the allele frequency of T which is the test group of Sample 1 and the allele frequency of T which is the test group of Sample 1 in the second sequence information can be compared. In addition, the allele which is the background allele of the 4th position of sample 2, and the genotype allele of sample 1, sample 3, and sample 4 which is an additional sample is G. In the obtained first sequence information, the allele frequency of G which is the test group of Sample 2 and the allele frequency of G which is the test group of Sample 2 in the second sequence information can be compared. Since the allele frequency of the background allele G of the fourth digit of Sample 2 may be affected by the genotype allele G of Sample 1, Sample 3, and Sample 4, the allele frequency of the background allele G of Sample 2 is increased. Can vary. In addition, if it is the background allele of the 2nd position of the sample 1, the allele which is the genotype allele of the additional sample 2, the sample 2, the sample 3, and the sample 4 is T. In the obtained first sequence information, the allele frequency of T which is the test group of Sample 1 and the allele frequency of T which is the test group of Sample 1 in the second sequence information can be compared.

Another aspect includes a sequence information obtaining unit for obtaining first sequence information of a nucleic acid fragment from each of a target sample and an additional sample, and second sequence information of the nucleic acid fragment from a mixed sample of the target sample and the additional sample; An allele frequency calculating unit for calculating an allele frequency from the obtained first sequence information and the second sequence information, respectively; And it provides a device 100 for analyzing the degree of cross-contamination of the sample to the target sample, including a calculation unit for comparing the calculated allele frequency for a specific site of the chromosome.

The device may include a "... part" or "... module" that implements a time series method of analyzing the degree of cross contamination of the sample. Therefore, even if omitted below, the above description of the method for analyzing the degree of cross contamination of a sample may be applied to an apparatus for analyzing the degree of cross contamination of the sample. The components may correspond to a processor. Thus, such a processor may be implemented as an array of multiple logic gates, or may be implemented as a combination of a general purpose microprocessor and a memory storing a program that may be executed on the microprocessor. In addition, it will be understood by those skilled in the art that other types of hardware may be implemented.

The sequence information obtaining unit 110 obtains sequence information from a sequencing device. The calculation unit 120 analyzes allele frequencies from the obtained first and second sequence information, respectively. The operation unit compares the allele frequencies calculated from the first sequence information and the second sequence information with respect to a specific site of the chromosome. The operation unit 130 may compare the number of alleles having the allele frequency for each allele frequency, or compare the ratio of the number of alleles having the allele frequency in the total allele number.

In the obtained first sequence information, the apparatus selects a mutation prediction site set by combining the mutation prediction sites obtained from the sequence information of each of the target sample and the additional sample, and selects the positions other than the mutation prediction site set as control site sets. Seat selection unit to be selected as; An allele frequency calculator configured to calculate an allele frequency of genotype alleles and background alleles from the obtained first sequence information and the second sequence information with respect to the set of predictive sites or the set of control sites; And a calculation unit for comparing the calculated allele frequencies with respect to the mutation prediction site set or the control site set.

The position selector 140 selects a set of predictive positions by combining the predictive positions of each of a plurality of samples, and selects a set of control positions by combining the positions of which no mutation is detected in any of the plurality of samples. .

The device may include a group selector for selecting a test group and a control group based on the mutation prediction site set and the control site set. The group selector 150 selects a test group and a control group.

The apparatus may include an allele frequency calculation unit for calculating an allele frequency of genotype alleles and background alleles from the obtained first sequence information and the second sequence information, respectively, with respect to the mutation prediction site set or the control site set. Can be. The allele frequency calculating unit may calculate an allele frequency of an allele including a genotype allele and / or a background allele.

The group selector selects alleles, which are the background alleles of the target sample and the genotype alleles of the additional samples, as the test group, and the mutation predictive site sets and the control site with respect to the mutation predicting site set. For the set, alleles that are the background alleles of the target sample and the background alleles of the additional sample in the first sequence information can be selected as the control group. If necessary, the test group and the control group may be selected simultaneously or sequentially.

The calculating unit compares the allele frequency of the test group obtained from the target sample in the first sequence information, and the allele frequency of the test group obtained from the target sample in the second sequence information, and obtains from the target sample in the first sequence information. The allele frequency of the control group, and the allele frequency of the control group obtained from the target sample in the second sequence information can be compared. The calculating unit may analyze the number of alleles having any allele frequency in the test group and / or the control group. The number of alleles having the allele frequency by allele frequency may be compared, or the ratio of the number of alleles having the allele frequency in the total alleles by group may be compared.

The apparatus determines the allele as a background allele when the allele obtained from the first sequence information has an allele frequency of less than 10%, and the allele when the allele has an allele frequency of 10% or more. The allele determining unit 160 may determine the gene as the genotype allele.

Another aspect provides a computer readable recording medium having recorded thereon a program for executing a method of analyzing a degree of cross contamination of a sample with respect to the target sample.

The method may be implemented in software form readable by various computer means and recorded on a computer readable recording medium. Here, the recording medium may include a program command, a data file, a data structure, etc. alone or in combination. The program instructions recorded on the recording medium may be those specially designed and constructed for the method according to the above, or may be known and available to those skilled in the computer software arts.

For example, the recording medium may be magnetic media such as hard disks, floppy disks and magnetic tapes, optical disks such as Compact Disk Read Only Memory (CD-ROM), digital video disks (DVD), Magnetic-Optical Media, such as floppy disks, and hardware devices specially configured to store and execute program instructions, such as ROM, random access memory (RAM), flash memory, and the like. do. Examples of program instructions may include high-level language code that can be executed by a computer using an interpreter as well as machine code such as produced by a compiler. Such a hardware device may be configured to operate as one or more software modules to perform the operation of the method according to the above, and vice versa.

Although the specification and drawings describe exemplary device configurations, the functional operations and subject matter implementations described herein may be embodied in other types of digital electronic circuitry, or modified from the structures and structural equivalents disclosed herein. It may be implemented in computer software, firmware or hardware, including, or a combination of one or more of them. Implementations of the subject matter described herein relate to one or more computer program products, ie computer program instructions encoded on a program storage medium of tangible type for controlling or by the operation of an apparatus according to the method. It may be implemented as the above module. The computer readable medium may be a machine readable storage device, a machine readable storage substrate, a memory device, a composition of materials affecting a machine readable propagated signal, or a combination of one or more thereof.

A computer program (also known as a program, software, software application, script or code) mounted on a device according to the method and executing the method may be any of a programming language including a compiled or interpreted language or a priori or procedural language. It can be written in any form, and can be deployed in any form, including stand-alone programs or modules, components, subroutines, or other units suitable for use in a computer environment. Computer programs do not necessarily correspond to files in the file system. A program may be in a single file provided to the requested program, in multiple interactive files (eg, a file that stores one or more modules, subprograms, or parts of code), or part of a file that holds other programs or data. (Eg, one or more scripts stored in a markup language document). The computer program may be deployed to run on a single computer or on multiple computers located at one site or distributed across multiple sites and interconnected by a communication network.

In the process of obtaining sequence information from a plurality of biological samples mixed with individual samples and extracting mutations, the contamination rate at the corresponding chromosome site can be accurately measured when the samples are contaminated. Although the effects of cross contamination between samples were ignored or estimated by comparison with known database values, the degree of contamination between samples can be measured using the results of experiments obtained in the platform of the experiment. Therefore, reliability can be given to the result of variation extraction of individual samples. Furthermore, in analyzing similar samples, it is possible to standardize the degree of cross contamination of the samples that can be generated by the protocol used for the analysis.

1 is a diagram for describing a method of selecting a set of disparity prediction positions.

FIG. 2 is a graph showing the ratio of the number of background alleles with allele frequencies of 0 to 0.01 in the test and control groups.

3 is a view for explaining a method of selecting a control group and a test group between a plurality of samples.

4 is a block diagram showing the configuration of an apparatus for analyzing the degree of cross contamination of a sample.

Hereinafter, the present invention will be described in more detail with reference to Examples. However, these examples are for illustrative purposes only and are not intended to limit the scope of the present invention.

실시예Example 1. One. 합맵Sum map 세포주로부터 변이 추출 Mutation Extraction from Cell Lines

Eight normal HapMap cell lines were purchased from the Coriell Institute (http://ccr.coriell.org/). DNA concentration and purity of cell lines measured by Picogreen fluorescence analysis using Nanodrop 8000 UV-Vis spectrometer (Thermo Scientific) and Qubit 2.0 fluorescence spectrometer (Life Technologies) It was. Section size distributions indicating the degree of DNA degradation were measured using a 2200 TapeStation instrument (Agilent Technologies) and real-time PCR Mx3005p (Agilent Technologies) according to the manufacturer's instructions.

GDNA of the cell line was sonicated using Covaris S2 (7 min, 0.5% duty, intensity = 0.1, 50 cycles / burst; Covaris Inc.) and sectioned into fragments of about 150 to about 200 bp. It was. The purified fragments were then purified using 1.8 times the volume of AMPure XP beads (Beckman Coulter) of the sectioned gDNA samples. After fragmentation, prior to enriching the target, end-pair, A-tailing, adapter ligation, and PCR reactions were subjected to KAPA Hyper kit (Kapa). Biosystem Inc.). Ligation was performed overnight at 4 ° C. using a Pentabase indexed adapter as an adapter.

Agilent SureDesign was used to design a unique RNA bait that targeted ˜0.5 Mb of the human genome. The genome is one that contains introns from exons and five genes from 83 cancer related genes that are frequently rearranged in solid tumors. After pre-amplification of the library of cell line samples, the double stranded DNA concentration was measured using a QubitFluorometer (Life Technologies). Section size distribution was measured using a 2200 TapeStation instrument (Agilent Technologies). The library was adjusted to a total of 750 ng of DNA for each hybridization selection reaction. SureSelect's blocking oligonucleotides were used for hybridization selection.

Prior to capture hybridization, libraries were labeled so as to be distinguishable for each of a plurality of samples based on DNA concentration and average fragment size, and each library was normalized to the same 2 nM concentration and pooled to the same volume. After denaturing the library with 0.2 N NaOH, the library was diluted to 20 pM. Perform cluster amplification of the denatured template and sequence the flowcell using HiSeq 2500 v3 Sequencing-by-Synthesis kit (2 × 100 bp read), followed by RTA v.1.12. Base calling was performed using 4.2.

The reads obtained were arranged in hg19 human reference using BWA v0.7.5a 35 to obtain BAM files. Off-target leads, inappropriate pairs, using SAMtools v0.1.18 36, GATK v2.2-2537, and Picard v1.93 to sort local realignment, duplicate markings, and SAM / BAM , Remove duplicates. Thereafter, mutations were detected using MuTect 1.1.4.

실시예Example 2. 시료의 교차 오염 정도 확인 2. Check the level of cross contamination of the sample

Sequence information was obtained and test and control groups were selected as follows from the calculated allele frequencies. And allele frequency of background alleles in each group was confirmed. Within the allele frequency interval of 0 to 0.01, the number of background alleles with allele frequency in this interval was identified. The ratio of the number of background alleles with corresponding allele frequency from the total number of background alleles by group was calculated. At this time, the allele frequency was determined to be the background allele of 1% or less, and the allele frequency was determined to be the genotype allele of 10% or more.

	각 그룹 중에서 해당 대립유전자 빈도를 가지는 대립유전자의 평균 비율Average ratio of alleles with corresponding allele frequencies in each group				각 그룹 중에서 해당 대립유전자 빈도를 가지는 대립유전자의 수The number of alleles with corresponding allele frequencies in each group
대립유전자 빈도IntervalAllele frequency	단독테스트 그룹 Standalone test group	8종 혼합테스트 그룹8 mixed test groups	단독대조군 그룹Single Control Group	8종 혼합대조군 그룹Eight mixed control group	단독테스트 그룹 Standalone test group	8종테스트 그룹8 test groups	단독대조군 그룹 Single Control Group	8종대조군 그룹8 species control group
00	0.7978488690.797848869	0.6691693060.669169306	0.9194251210.919425121	0.9250680920.925068092	373155 373155	312971 312971	1291528 1291528	1299454 1299454
0.0010.001	0.0887722810.088772281	0.1512915050.151291505	0.0559965320.055996532	0.0532926860.053292686	41519 41519	70759 70759	78659 78659	74861 74861
0.0020.002	0.0492085930.049208593	0.0768759780.076875978	0.0160105770.016010577	0.0143894260.014389426	23015 23015	35955 35955	22490 22490	20213 20213
0.0030.003	0.0245149340.024514934	0.0456593920.045659392	0.0052708310.005270831	0.0045364280.004536428	11466 11466	21355 21355	7404 7404	6372 6372
0.0040.004	0.0193198210.019319821	0.0251207660.025120766	0.0019019910.001901991	0.0015617080.001561708	9036 9036	11749 11749	2672 2672	2194 2194
0.0050.005	0.0108290320.010829032	0.0124804060.012480406	0.0007166950.000716695	0.0005876650.000587665	5065 5065	5837 5837	1007 1007	826 826
0.0060.006	0.0024576440.002457644	0.0069906890.006990689	0.0003032650.000303265	0.0002442670.000244267	1149 1149	3270 3270	426 426	343 343
0.0070.007	0.0017612650.001761265	0.0042645140.004264514	0.0001438020.000143802	0.0001203980.000120398	824 824	1995 1995	202 202	169 169
0.0080.008	0.0010604890.001060489	0.0033647110.003364711	7.51044E-057.51044E-05	6.83414E-056.83414E-05	496 496	1574 1574	106 106	96 96
0.0090.009	0.0003597120.000359712	0.0012906740.001290674	4.62728E-054.62728E-05	3.94209E-053.94209E-05	168 168	604 604	65 65	55 55
0.010.01	0.0007007760.000700776	0.0007094090.000709409	2.86536E-052.86536E-05	2.45602E-052.45602E-05	328 328	332 332	40 40	35 35
0.0110.011	0.0014086780.001408678	0.000360750.00036075	1.44158E-051.44158E-05	1.41488E-051.41488E-05	659 659	169 169	20 20	20 20
0.0120.012	00	0.0001803750.000180375	1.47717E-051.47717E-05	1.03224E-051.03224E-05	0 0	84 84	21 21	15 15
0.0130.013	00	00	1.06783E-051.06783E-05	6.40701E-066.40701E-06	0 0	0 0	15 15	9 9
0.0140.014	00	0.0001803750.000180375	5.69512E-065.69512E-06	5.7841E-065.7841E-06	0 0	84 84	8 8	8 8

Allele frequency values less than 0.014 are omitted.

(1) 테스트 그룹의 대립유전자 빈도 분포 확인(1) Confirmation of allele frequency distribution in test group

As a result of obtaining sequence information of any one of the sum map cell line samples, it was confirmed that 467,701 alleles were included in the test group. In the result of sequencing with the corresponding sum map cell line sample alone, the number of alleles having an allele frequency within an allele frequency range of 0 to 0.01 was analyzed for the test group, and the ratios are shown in the graph (Table 1 and FIG. 2 single, test group). In addition, as a result of sequencing with a mixed sample of eight kinds of summapped cell lines including the summated cell line sample, the number of alleles having an allele frequency within an allelic frequency range of 0 to 0.01 for the test group was determined. Analyzes and the ratios are shown in the graph (see Table 1 and FIG. 2 8-plex, Test group).

2 and Table 1, it can be seen that the number of background alleles having a specific allele frequency (allele frequency of background alleles) is different. For example, in the test group, the group having an allele frequency of 0.007 was about 0.176% when the single sample was analyzed and about 0.427% when the eight mixed samples were analyzed. When analyzed in single or mixed samples of eight species, it was confirmed that the frequency of the allele of the background allele was changed even in the same sum map cell line sample.

The product of the allele frequency and the number of alleles having the allele frequency was divided by the total number of alleles belonging to the test group to obtain the average allele frequency. Referring to FIG. 2, the average allele frequency of the sum map cell line sample was about 0.052% when analyzed by the sum map cell line sample alone. When analyzed in a mixed sample containing eight Hapmap cell line samples, the average allele frequency of the Hapmap cell line sample was about 0.077%. Therefore, it can be seen that the test group of the sum map cell line sample has an average degree of contamination of about 0.025% by the other sum map cell line samples.

(2) 대조군 그룹의 대립유전자 빈도 분포 확인(2) Confirmation of allele frequency distribution of control group

As a result of obtaining sequence information of any one of the sum map cell line samples, it was confirmed that 1,404,712 alleles were included in the control group. In the result of sequencing with the corresponding sum map cell line sample alone, the number of alleles having an allele frequency within an allele frequency range of 0 to 0.01 was analyzed for the control group, and the ratio is shown in the graph (Table 1 and FIG. 2 alone, see control group). In addition, as a result of sequencing with a mixed sample of eight kinds of summapped cell lines including the corresponding summapped cell line sample, the number of alleles having an allele frequency within an allele frequency range of 0 to 0.01 for the control group was determined. Analyzes and the ratios are shown in the graph (see Table 1 and Figure 8, 8-plex, control group). Referring to Figure 2 and Table 1, it can be seen that the number of alleles having a specific allele frequency is almost no difference. For example, in the control group, the group having an allele frequency of 0.007 was about 0.014% when the single sample was analyzed and about 0.012% when the 8 mixed samples were analyzed. When analyzed in single or mixed samples of eight species, it was confirmed that the same synapse cell line sample had little difference in allele frequency of the background allele.

The product of the allele frequency and the number of alleles having the allele frequency was divided by the total number of alleles belonging to the control group to obtain the average allele frequency. Referring to FIG. 2, the average allele frequency of the hapmap cell line sample was about 0.012% when analyzed by the hapmap cell line sample alone. When analyzed in a mixed sample containing eight Hapmap cell line samples, the average allele frequency of the Hapmap cell line sample was about 0.011%. Therefore, it was confirmed that the control group of the corresponding Hapmap cell line sample had no or minimal influence of contamination by other Hapmap cell line samples.

So far I looked at the center of the preferred embodiment for the present invention. Those skilled in the art will understand that the present invention may be implemented in a modified form without departing from the essential characteristics of the present invention. Therefore, the disclosed embodiments should be considered in descriptive sense only and not for purposes of limitation. The scope of the present invention is shown in the claims rather than the foregoing description, and all differences within the scope will be construed as being included in the present invention.

Claims

Obtaining first sequence information of the nucleic acid fragment from each of the target sample and the additional sample, and second sequence information of the nucleic acid fragment from the mixed sample in which the target sample and the additional sample are mixed;

Calculating allele frequencies from the obtained first and second sequence information, respectively; And

A method for analyzing the degree of cross contamination of a sample with respect to a target sample, comprising comparing the calculated allele frequency with respect to a particular site of the chromosome.
The method according to claim 1,

In the comparing step, in the obtained first sequence information, a mutation prediction site set is selected by combining the mutation prediction sites obtained from the sequence information of each of the target sample and the additional sample, and the positions other than the mutation prediction site set are selected. Selecting a set of control sites;

Calculating allelic frequencies of genotype alleles and background alleles from the obtained first sequence information and the second sequence information, respectively, for the set of predictive mutation sites or the set of control sites; And

Comparing the calculated allele frequencies with respect to the set of predictive sites or the set of control sites.
The method according to claim 2,

The selecting may include selecting alleles, which are the background alleles of the target sample and the genotype alleles of the additional samples, as the test group, for the set of the predictive predicting sites.

Selecting the alleles, which are the background alleles of the target sample and the background alleles of the additional sample, in the first sequence information, for the mutation prediction site set and the control site set.
The method according to claim 2,

The comparing may include comparing allele frequencies of the test group obtained from the target sample in the first sequence information, and allele frequencies of the test group obtained from the target sample in the second sequence information. .
The method according to claim 2,

The comparing may include comparing the allele frequency of the control group obtained from the target sample in the first sequence information, and the allele frequency of the control group obtained from the target sample in the second sequence information. .
The method according to claim 2,

The background allele determines the allele as the background allele when the allele obtained from the first sequence information has an allele frequency of less than 10%.

Wherein said genotype allele is determined as a genotype allele when the allele obtained from the first sequence information has an allele frequency of 10% or more.
The method of claim 1, wherein the mutation is SNP or SNV.
A sequence information obtaining unit for obtaining first sequence information of the nucleic acid fragment from each of the target sample and the additional sample, and second sequence information of the nucleic acid fragment from the mixed sample of the target sample and the additional sample;

An allele frequency calculating unit for calculating an allele frequency from the obtained first sequence information and the second sequence information, respectively; And

Comprising a calculation unit for comparing the calculated allele frequency for a specific site of the chromosome,

A device for analyzing the degree of cross contamination of a sample to a target sample.
The method according to claim 8,

In the obtained first sequence information, the apparatus selects a mutation prediction site set by combining the mutation prediction sites obtained from the sequence information of each of the target sample and the additional sample, and selects the positions other than the mutation prediction site set as control site sets. Seat selection unit to be selected as;

An allele frequency calculator configured to calculate an allele frequency of genotype alleles and background alleles from the obtained first sequence information and the second sequence information with respect to the set of predictive sites or the set of control sites; And

Apparatus comprising a calculation unit for comparing the calculated allele frequency with respect to the set of predictive prediction sites or the set of control sites.
The method according to claim 9,

The apparatus selects alleles that are the background alleles of the target sample and the genotype alleles of the additional samples in the first sequence information as the test group, for the set of mutation prediction sites.

And a group selector for selecting the alleles, which are the background alleles of the target sample and the background alleles of the additional sample, in the first sequence information, for the mutation prediction site set and the control site set.
The method according to claim 9,

The operation unit compares the allele frequency of the test group obtained from the target sample in the first sequence information, and the allele frequency of the test group obtained from the target sample in the second sequence information.
The method according to claim 9,

The operation unit compares the allele frequency of the control group obtained from the target sample in the first sequence information, and the allele frequency of the control group obtained from the target sample in the second sequence information.
The method according to claim 9,

The device determines that the allele is a background allele if the allele obtained from the first sequence information has an allele frequency of less than 10%.

Wherein the genotype allele comprises an allele determining portion that determines the allele as a genotype allele when the allele obtained from the first sequence information has an allele frequency of 10% or more.
The device of claim 8, wherein the mutation is SNP or SNV.
A computer-readable recording medium having recorded thereon a program for executing the method according to any one of claims 1 to 7.