WO2018088635A1 - 유전체내 암 특이적 진단 마커 검출 - Google Patents

유전체내 암 특이적 진단 마커 검출 Download PDF

Info

Publication number
WO2018088635A1
WO2018088635A1 PCT/KR2017/001581 KR2017001581W WO2018088635A1 WO 2018088635 A1 WO2018088635 A1 WO 2018088635A1 KR 2017001581 W KR2017001581 W KR 2017001581W WO 2018088635 A1 WO2018088635 A1 WO 2018088635A1
Authority
WO
WIPO (PCT)
Prior art keywords
information
cancer
base
samples
sample
Prior art date
Application number
PCT/KR2017/001581
Other languages
English (en)
French (fr)
Inventor
조동호
한규범
서혜인
정병창
Original Assignee
한국과학기술원
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from KR1020170019559A external-priority patent/KR101928094B1/ko
Application filed by 한국과학기술원 filed Critical 한국과학기술원
Priority to US16/323,948 priority Critical patent/US20190252040A1/en
Publication of WO2018088635A1 publication Critical patent/WO2018088635A1/ko

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B25/00ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding

Definitions

  • the present invention is based on the analysis of cancer genome bases to identify cancer-specific diagnostic markers.
  • International Patent Publication No. 2014-052909 discloses a method for diagnosing a disease by using a database including disease, clinical information, and genetic information in consideration of phenotypic information and genetic variation of an individual.
  • 2014-052909 we provide a system for linking sequencing of gene ranges with clinical information of patients to diagnose diseases, and ascertaining the correlation between disease and genetic information at high resolution.
  • Patent Document 1 International Publication WO2014-052909 (published date 2015.07.30.)
  • the present invention analyzes cancer-specific dielectric changes to identify the relationship between cancer and genetic variations, and provides a method for detecting cancer-specific diagnosis markers with high accuracy. Task solution
  • the present invention provides a method for detecting cancer diagnosis markers in the form of a program executed by an operation processing means including a computer, the method comprising: inputting whole genome sequencing information of cancer samples and normal samples; Obtaining and analyzing information comparing and / or comparing genome sequencing information with reference genome sequence information; deriving disease classification from the analyzed information and sample information; and using cancer classification using disease classification charts. And constructing a library for cancer-specific base sequence information from the whole genome sequencing information of the normal sample, and deriving the classification accuracy using the number of bases of disease classification and variation in the library as a variable. Provide diagnostic marker detection method.
  • the cancer diagnosis marker detection method of the present invention uses genome base sequence information obtained from actual cancer patients and normal patients to determine base sequence variation information and base sequence position information that appear in genomes in the course of reference dielectric information. Analysis can detect cancer specific diagnostic markers through the determination of cancer specific genome complex information.
  • cancer-specific diagnosis markers can be detected by analyzing known cancer genomes.
  • the library can be used to easily analyze complex mutations to detect cancer-specific diagnostic markers with high accuracy.
  • cancer diagnosis markers detected in accordance with the present invention can be easily applied to all fields of medicine and medicine such as biochips, precision diagnosis systems, kits, and medical devices.
  • FIG. 1 is an exemplary view showing the types of reference dielectric information used in the cancer diagnosis marker detection method according to the present invention, and information on the whole dielectric sequence of a sample.
  • FIG. 2 is an exemplary view showing the result of comparing and / or comparing reference dielectric information and total dielectric signaling information of a sample in a method for detecting cancer diagnosis markers according to the present invention.
  • FIG 3 shows a target range extraction step in the method for detecting cancer diagnosis markers according to the present invention.
  • FIG. 4 is an exemplary view of building a library in a method for detecting cancer diagnosis markers according to the present invention.
  • FIG. 5 shows an embodiment of a method for detecting cancer diagnosis markers according to the present invention.
  • FIG. 6 is an exemplary diagram for diagnosing cancer for an arbitrary sample using a marker detected by the cancer diagnosis marker detection method according to the present invention.
  • the present invention elicits cancer specific diagnostic markers based on the analysis of genetic information.
  • the invention relates to a method of detection.
  • the present invention compares and analyzes general life phenomena and disease-related genome information based on the whole genome base sequence data to help understand the function of the genome, and further detect the precise cancer diagnosis markers.
  • the method for detecting cancer diagnosis markers of the present invention is generally performed as follows. First, the information of the full-length dielectric (total genome) bases for cancer and normal samples (samples) is obtained, and the reference dielectric (reference) is used. Obtain analytical information, including base mutations and positional information of cancer and normal samples based on gemones, with baseline mutations and positional information predicted by cancer-specific dielectric changes.
  • the present invention is a method for detecting cancer diagnosis markers in the form of a program executed by an operation processing means including a computer, the method comprising: inputting whole genome sequencing information of cancer samples and normal samples; Obtaining and analyzing information comparing and / or comparing genome sequencing information and reference genome sequence information, deriving a disease classification from the analyzed and sample information, and using disease classification. Constructing a library for cancer-specific sequencing information from the whole genome sequencing information of the sample and the normal sample, and deriving the classification accuracy by using the number of bases with the disease classification and the variation in the constructed library as variables.
  • a method for detecting cancer diagnosis markers is provided.
  • the full genome sequencing information of cancer and normal samples can be obtained from the genetic information.
  • the sequencing company can be used to obtain the full genome sequencing information of the sample, or in some cases, the whole exome sequence can be obtained for a set of solutions that directly play a role in the synthesis of proteins in the gene.
  • the entire genome-sequencing information of the samples may have some variation in the information, depending on the genetic information database, the equipment used for sequencing, and the sequencing method.
  • the whole genome sampling information of cancer samples and normal samples is the basis for detecting cancer diagnosis markers according to the present invention.
  • the following steps are performed based on the difference in the dielectric characteristics of the samples included in the whole genome sampling information. Will proceed.
  • Location information of the nucleotide sequence, mutation information of the nucleotide sequence, and reliability information can be used as important information for cancer diagnosis marker detection.
  • information may be added or subtracted.
  • specific information contained in the genomes of the samples can be obtained.
  • the variation of the genome base sequence and the combination thereof common in cancer samples can be obtained.
  • Reference genome sequence information can be obtained from the human genome map information obtained from the Human Genome Project, which basically includes the location and base sequence information of chromosomes, chromosomal base sequences.
  • Analysis of total genome sequencing information and reference genome sequence information results in chromosome information of nucleotide sequences in cancer and normal sample genomes, location information of chromosome sequences, base sequence information of reference genomes, and sample genomes. Reliability can be obtained for sequence information and information for each base sequence, which can be used as important information for detecting cancer diagnostic markers.
  • the genome can be analyzed by the shape itself. Accordingly, the analysis of the entire genome sequencing information and the reference genome sequence information can be performed using a genome analysis program. .
  • a genome analysis program For example, you can use open source programs such as SAM (Sequence Alignment / Map) tools, BCFtools, etc.
  • SAM Sequence Alignment / Map
  • BCFtools etc.
  • the results of processing and analyzing the data can vary. Can be used.
  • the analyzed information can be stored and managed by converting the same form into a certain platform.
  • chromosome information Chromosome, #CHROM
  • intrachromosome Chromosome, #CHROM
  • Position information (base, POS) of the base (base), base sequence (base) information (reference, REF) of the reference dielectric, base sequence (base) information (alternation, ALT) and reliability (quality, QUAL) of the sample dielectric Is information that is important for cancer diagnosis marker detection. These information differ from the reference dielectric in the cancer or normal samples.
  • the information on the base sequence (base) portion is particularly important for detecting the cancer diagnosis margue.
  • base sequence position information and base sequence variation information for each sample can be obtained, which can be utilized as needed.
  • Chromosome information (#CHROM), chromosomes, for information about areas where nucleotide variations occur According to the position information (base) of the base sequence (base), the base sequence (base) information (REF) of the reference dielectric, and the base sequence (base) information (ALT) of the sample dielectric, the following is explained.
  • the chromosome crystal (#CHROM) for the raised part is a chromosome that has a variation in the base sequence (base) when compared to and / or contrasted with the genome information of the entire genome sequencing information of the cancer or normal sample.
  • the information (POS) is the position of the nucleotide sequence (base) where the mutation occurred in the chromosome corresponding to the chromosome information (#CHROM), and the nucleotide sequence (base) information (REF) of the reference dielectric is the chromosome base position information (POS).
  • the base sequence (base) of the reference dielectric corresponding to the same position as the base, and the base sequence (base) information (ALT) of the sample dielectric is the base sequence (base) present at the position corresponding to the base sequence position information (POS) in the chromosome. to be.
  • chromosome information (#CHROM), position information (POS) of the chromosomal base sequence, reference sequence information (REF) of the reference dielectric, and the nucleotide sequence of the sample genome are shown in the first blade of the data shown in FIG.
  • Information (ALT) and reliability (QUAL) are shown.
  • the second line of the data in Fig. 2 is the chromosome crystal (#CHROM), the position information (POS) of the chromosome base sequence, the base sequence information (REF) of the reference dielectric, Values for the nucleotide sequence information (ALT) and the reliability (QUAL) of the sample dielectric are shown.
  • the nucleotide sequence of the reference dielectric sequence information at position 109 (POS) of chromosome 1 (#CHROM) is 'A' (
  • the nucleotide sequence of cancer samples and / or normal samples is
  • the classification ratio (CR) was calculated from the analyzed information and the sample information.
  • Disease classification maps can be derived to build specific base sequence libraries.
  • the analyzed information refers to the full genome sequencing information of the cancer and normal samples.
  • Chromosome information obtained by comparing and / or comparing genome information, position information (POS) of nucleotide sequence in chromosome, base sequence information (REF) of reference dielectric, base sequence information (ALT) and reliability of sample dielectric ( QUAL) At least one or more of the information.
  • the sample information includes the total number of cancer samples and normal samples, the total number of cancer samples, the total number of normal samples, the number of cancer samples with base mutations, the number of cancer samples without base mutations, and the number of normal samples with base mutations. And at least one or more of normal samples without base mutation.
  • the disease classification can be derived according to [Equation I] or [Equation ⁇ ].
  • disease classification is used to build libraries for cancer specific base sequence information in cancer samples and normal samples.
  • the disease classification can vary in function.
  • the function for deriving the disease classification can be arbitrarily determined by the person of the present invention according to the analyzed information and the sample information, and is not limited to the following [Formula I] or [Formula II].
  • new disease classifications can be derived and used by using the extracted disease classification, analysis information, and sample information.
  • the base mutation is located at 109 of chromosome 1, the base of the reference dielectric information is ' ⁇ ', and the base of the sample information has a ratio of cancer samples corresponding to ⁇ : 35/50 (the number of mutations among 50 cancer samples). Is 35) and the ratio of normal samples is 20/50 (the number of mutations in the total of 50 normal samples is 20), the degree of disease classification at 109 base position of chromosome 1 is 0.28 according to [Equation I]. .
  • the genome base sequence information of normal samples is the same as the base sequence variation information generated in the genome sequence information of cancer samples when compared with the reference genome information. If a mutation occurs, it is likely that it is not subject to cancer-specific changes. Thus, the number of cancer samples with base mutations and the number of normal samples without base mutations in disease classifications can be particularly important parameters.
  • a base sequence variation is common in cancer samples.
  • cancer-specific sequencing that is a target of a marker for cancer diagnosis is performed. You can build a library with information. You can also use the information contained in the library to derive the highest probability of rock discrimination when a certain number of sequence variations occurs in each library.
  • Libraries for cancer-specific sequencing information can be constructed based on disease classification.
  • the disease classification is derived, and each disease classification value corresponds to an abnormality of a specific disease classification value.
  • Set of analysis information chromosome information (#CHROM), location information of chromosome sequence (POS), reference genome sequence information (REF), sample genome sequence information (ALT) and reliability (QUAL)
  • POS location information of chromosome sequence
  • REF reference genome sequence information
  • ALT sample genome sequence information
  • QUAL reliability
  • a library of cancer-specific sequencing information may correspond to a set of analysis information sorted based on a specific disease classification in the entire analysis information.
  • FIG. 4 shows an example of constructing a library. After the disease classification is derived according to the analysis information and the sample information, FIG. 4 corresponds to 0.7 or more of the extracted disease classification values.
  • a library (right) can be built. As such, after a disease classification map is derived, an analysis is performed to determine the specific disease classification value and satisfy the above-specific disease classification value.
  • a set of information (chromosome information (#CHROM), chromosome base position information (POS), reference genome sequence information (REF), sample genome base information (ALT) and reliability (QUAL))
  • the library corresponding to the classification value can be configured to build a library for cancer-specific base sequence information.
  • Such a library can be viewed as a set of analysis information that satisfies a specific disease classification value, and the specific disease classification degree. The analysis information varies for each value.
  • the disease classification map is derived for each base position and base variation of the analytical information.
  • the most probable set of analysis information and base mutation information can be obtained and used as markers.
  • the classification accuracy of the cancer sample and the normal sample differs according to the predetermined base mutation number. Therefore, by calculating the classification accuracy of the samples using the disease classification and the predetermined number of base mutations as variables, the most appropriate base mutation information can be obtained as a cancer diagnosis marker among the entire genome sequencing information.
  • Accuracy can be obtained by applying the rand measure (rand index) as the objective function, and using numerical analysis programs such as the matrix laboratory to derive the degree of disease classification and the maximum classification accuracy of the library according to the predetermined number of base variations. have.
  • is the predetermined number of base mutations
  • TP is the number of cases where cancer samples are classified as cancer
  • TN is the number of cases where normal samples are classified as normal
  • FP is the number of cases where normal samples are classified as normal samples
  • FN is the number of cases where normal samples are classified as normal.
  • the disease classification diagram (I) and the predetermined number of base mutations (T) satisfying the highest classification accuracy in the library can be obtained according to the following [Equation IV].
  • T is the predetermined number of predetermined base mutations, and because it is also variable, it is represented as T * and The maximum is the total number of base variants included in the analysis information sorted according to I.
  • TP is the number of cases where cancer samples are classified as cancer.
  • TN is the number of cases where a normal sample is classified as normal.
  • FN is the number of cases where a cancer sample is normally classified.
  • the base information that satisfies it can be used as cancer diagnosis markers.
  • the sample's genome information can be used to diagnose cancer.
  • the number of total cases for the markers to be investigated should be determined if the size of the set is reduced in stages by checking and first considering only the most likely cases. Will be reduced to N (N + l) / 2.
  • the performance of markers can be verified by substituting cancer diagnostic markers for cancer samples or normal samples not used for cancer diagnosis marker detection and calculating the classification accuracy.
  • the accuracy of the cancer diagnosis markers can be improved. Therefore, the information on the genome base of the cancer samples or normal samples used in the validation is determined by the cancer diagnosis markers. It is desirable to use this as feedback information to improve accuracy.
  • the method for detecting cancer diagnosis markers may further include extracting a target range for a specific cancer in order to proceed more quickly and accurately.
  • the target range extraction step is preferably performed after analyzing the entire genome sequencing information and reference genome information of the cancer sample and the normal sample.
  • the reference dielectric information, the full dielectric sequencing information of the cancer sample, and the full dielectric sequencing information of the normal sample can be divided by a predetermined range as shown in FIG.
  • the entire dielectric sequencing information of the divided normal samples can be compared to determine the dielectric range in which the variation occurs.
  • the target dielectric range for a particular cancer can be extracted by setting the corresponding dielectric range as the target dielectric range for the specific cancer. It is desirable, but not limited to, to set the total dielectric sequencing information for a split normal sample relative to the reference dielectric information.
  • the entire genome base information includes not only the genome change caused by a particular cancer, but also the base sequence that is inherent in the base sequence and the cause of the cancer. It is desirable to extract the dielectric range.
  • the reference dielectric information shown in the upper part of FIG. 1 is generally preferably stored in advance, and has a length of about 3 Gbp.
  • the top-level numeric information represents the positional information of the reference dielectric.
  • the base sequence information shown in black below it represents the base sequence of the reference dielectric.
  • the base genome sequence information of the sample is based on the base sequence fragments of several tens or hundreds of lengths. Compared to the genome information, it has a probability placed in the highest position. On average, there are 30 to 40 candidate base sequences per position. Thus, the size of the entire genome sequence data is the reference genome information. It is common to have a size of around 30 to 40 times the size of, and around 100 Gbytes, of course, depending on the method of sampling.
  • the size of the sample dielectric base sequence sequencing information is 100 Gbytes.
  • nucleotide sequence change rate can be defined as divided by the length of the dielectric portion divided by the reference dielectric information in the divided dielectric portion, and the sequencing variation reliability (QUAL) is also shown in the illustrated sequencing information. It can be used to estimate the degree of binding in the chemical reaction of fragments of sequence, and to define the rate of change based on this change.
  • QUAL sequencing variation reliability
  • the rate of change can be defined by calculating the interspatial between the reference dielectric and the cancer sample and the normal sample dielectric of the divided dielectric range portion.
  • the sequence is cut into a word of a certain length and then a word is used. You can use the correlation of the PDF by investigating the frequency of the interval or the frequency of the interval in which the words of a certain length appear, and calculate the probability of the transition of words of a certain length and then use the correlation between the states of the transition diagram.
  • the genome segmentation portion having a large nucleotide sequence change rate of the cancer sample genome sequencing information is compared with the base sequence variation rate of the normal sample genome sequencing information. It is desirable to define as.
  • Target range extraction extracts a meaningful portion of the entire genome into a target genome range for a specific cancer.
  • the target genome can be divided into genetic and non-genetic parts based on the location information of the genes.
  • the whole genome is composed of 23 chromosomes, and the chromosome is composed of gene parts and non-gene parts.
  • the target range extraction divides reference genome information, genome sequence information of cancer samples, and genome sequence sequencing information of normal samples based on the gene position.
  • Predetermined numbers are assigned according to the chromosome order, non-gene parts before gene 1 are defined as pre-1, non-gene parts between gene 1 and gene 2 are defined as pre-2, and By defining the non-gene part after the gene as last, we can split all parts of the genome.
  • the cancer diagnosis marker detection method of the present invention is based on the analysis of the genetic information, it is possible to utilize not only the gene but also the base mutation information of the non-gene part, and thus cancer detection method is different from the conventional cancer diagnosis marker detection method. Diagnostic markers can be detected.
  • Gene boundaries, lengths, etc. may be based on previously known or known genetic analysis information.
  • base sequence variation information can be determined by comparing the base sequence information of the divided base genome information, and the base sequence variation information of the divided arm sample or the normal sample to extract the target dielectric range of the cancer.
  • the genome sequence sequencing information of the divided cancer sample or the normal sample is compared with the divided reference dielectric information, and only the portion where the variation occurs can be extracted.
  • the sequencing rate of the nucleotide sequence by comparing the sequencing rate of the nucleotide sequence and checking the number of changes in the cancer sample compared to the normal sample, and the change in the cancer sample is greater than the predetermined rate of change compared to the normal sample.
  • the set of corresponding dielectric parts is extracted and defined as the target dielectric range for a particular cancer.
  • the rate of change can be defined by calculating the correlation between the divided portion of the reference dielectric and the cancer sample and the normal sample dielectric.
  • Correlation can be defined by cutting base sequences into words of constant length, examining the frequency of words or the frequency of intervals in which the words of a certain length appear, and using the correlation of PDFs to determine the length of words. After calculating the probability of transition, the state correlation of the transition diagram can be used.
  • cancer specific dielectric changes can be extracted by comparing and analyzing the positional information of the nucleotide sequences and the variation information of the nucleotide sequences defined by the target genotype range for the specific cancer.
  • the cancer diagnosis marker detection method described above will be described once again through the flow chart and specific example according to Fig. 4. The following description corresponds to one example to help understanding of the present invention. Some of the data processing and sample information may be omitted and described using arbitrary values.
  • the method for detecting cancer specific diagnosis markers in the dielectric of the present invention includes an information input step (S100), a target range extraction step (S200), a comparison analysis step (S300), and a library construction step ( S400) and the marker detection step (S500), and the cancer specific diagnosis marker detection method in the dielectric may be in the form of a program executed by arithmetic processing means including a computer. It is recommended that the target range extraction step (S200) and the comparative analysis step (S300) be performed in reverse order when directly collecting and inputting the entire dielectric sequencing information of the cancer sample and the normal sample.
  • Sequencing information and whole genome sequencing of normal samples can be entered, for example, cancer from blood cancer, stomach cancer, or liver cancer, certified by the National Institute of Health (NIH).
  • the entire dielectric sequence information of the sample and the normal sample can be received and entered (you can also select the number of samples, check the sequencing equipment, and check the sequencing method).
  • BAM binary alignment map
  • the target range extraction step (S200) is performed by storing the reference dielectric stored in advance.
  • the target dielectric range for the specific cancer can be extracted.
  • more than 2,000 genes known to have a high rate of change among these cancer genes can be extracted into the target range, and the non-gene regions around the genes with high rate of change can also be extracted to the target range for analysis.
  • the analyzing step (S300) compares the entire genome sequencing information of the cancer sample or the full genome sequencing information of the normal sample in the target dielectric range for the specific cancer extracted by the target range extraction step (S200). And / or collated information (chromosome information (#CHROM), position information (POS) of the base sequence within the chromosome, base sequence information (REF) of the reference genome, base sequence information (ALT) of the sample dielectric and
  • the chromosome information for the full genome sequencing information of the cancer sample or the full genome sequencing information of the normal sample showing the base sequence variation, the position information of the chromosome base sequence, the base sequence information, the reliability, Analyze disease classification information,
  • Chromosomes that commonly exhibit mutations in the entire genome sequencing information of cancer samples Information, location information of chromosome base sequence, base sequence information, reliability, disease classification information,
  • the disease classification is derived from the position information of the base sequence (POS), the base sequence information of the reference dielectric (REF), the base sequence information of the sample dielectric (ALT), and the reliability (QUAL)).
  • a library construction step (S400) of the entire dielectric sequencing information of the sample and the normal sample is performed.
  • a random function formula is defined as shown in [Formula I] or [Formula ⁇ ] above, and the disease classification diagram (I) is derived for each base position and base variation through analysis information and sample information, and analyzed as shown in [Table 6]. In addition to the information can be organized.
  • the library can be built based on the disease classification value.
  • Determination of disease is difficult because a library is constructed with analytical information corresponding to a single base mutation.
  • analytical information sorted based on a specific disease class value value or more includes one or more base mutations, so that a library combining multiple base mutations can be constructed to determine whether the disease is more accurate.
  • the analysis information is sorted according to the disease classification degree (I) after the library construction step (S400), and the predetermined number of base mutations is determined by specifying a predetermined number of base mutations (T) in the sorted analysis information.
  • the classification accuracy is obtained according to [Equation m].
  • TP is the number of cases where cancer samples are classified as cancer
  • TN is the number of cases when normal samples are classified as normal
  • FP is the number of cases where normal samples are classified as cancer
  • FN is classified as normal. Is a prosthetic.
  • the analysis information having a disease classification degree (I) of 0.56 in the library is shown in [Table 8].
  • the highest classification accuracy in the entire library can be obtained according to [Equation IV] to detect base information that can be used as a cancer diagnosis marker in the entire dielectric sequencing information.
  • T is the predetermined number of base mutations, which is also variable, and is also represented as T * because it is variable.
  • the maximum value is the total number of base variants included in the analysis information sorted according to I.
  • TP is the number of cases where a cancer sample is classified as cancer.
  • TN is the number of cases where normal samples are classified as normal
  • FP is the number of cases where normal samples are classified as cancer
  • FN is the number of cases where a cancer sample is normally classified.
  • a cancer specific diagnostic marker can be detected by using the whole genome base sequencing information on cancer samples and normal samples. It can be applied to cancer diagnosis chip, cancer diagnosis kit, cancer diagnosis device and cancer diagnosis system using diagnostic marker.For example, it is possible to acquire dielectric information of sample to be detected by simple method such as blood collection. Cancer diagnostic markers can then be detected and, if applicable to small medical businesses such as biochips, kits, terminal devices and systems, can have a significant ripple effect in the field of molecular diagnostics healthcare.
  • the cancer-specific diagnosis marker detection method of the present invention is a real cancer patients and normal
  • the genome sequence data obtained from the patient can be used to compare base sequence variation information and base sequence position information of genomes in the course of oncogenes.
  • the analytical information thus obtained can be used to determine cancer specific genome complex information to derive cancer specific diagnostic markers.
  • additional genetic information was acquired over time.
  • Specific genetic changes can also be identified, for example, as the disease progresses from the patient with the disease, or as the disease is treated, the genetic information is acquired over a period of time and analyzed to determine disease and genetic changes. You can map information.
  • sample information of the diseased and non-disease regions is collected from one patient, and the genetic information of the two samples is analyzed to identify specific genetic variation information seen in the sample with the disease. You can get it this way.
  • the present invention relates to a method for detecting cancer-specific diagnosis markers in the genome, and more specifically, it is possible to detect cancer-specific dielectric changes by understanding the relationship between cancer and genetic variation.

Landscapes

  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Medical Informatics (AREA)
  • Biophysics (AREA)
  • Theoretical Computer Science (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Biotechnology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Genetics & Genomics (AREA)
  • Molecular Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Software Systems (AREA)
  • Public Health (AREA)
  • Evolutionary Computation (AREA)
  • Epidemiology (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Bioethics (AREA)
  • Artificial Intelligence (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

본 발명은 유전체 내의 암 특이적 진단 마커 검출 방법에 관한 것으로, 더욱 상세하게는 암과 유전체 변이의 관계를 파악하여 암 특이적인 유전체 변화를 검출함으로써, 정확도 높은 암 특이적 바이오 마커를 검출할 수 있는 방법이다.

Description

명세서
발명의명칭:유전체내암특이적진단마커검출 기술분야
[1] 본발명은암유전체의염기분석정보를통해암특이적인진단마커를
도출하는기술이다.
배경기술
[2] 유전체는질병에따라특이적인변화를보이는것으로밝혀지고있다.
그렇지만현재까지유전체분석연구는유전체전체의 약 1.2%만을차지하여 단백질을합성하는유전자중심으로이루어져왔다.
[3] 유전자중심연구는생물정보학적분석으로많은결과들을도출하고있지만, 이러한연구결과로는수많은질병들을설명하는데한계가있음이분명하게 나타나고있어,이를보완할수있는유전자를제외한유전체부분의종합적이고 구조적인분석이필요한실정이다.
[4] 많은연구진에의해이루어지고있는질병진단마커의선별을위한유전체 연관성분석법은,대부분유전자의발현을분석하는액솜시퀀싱 (유전체의 약 1%)또는,유전체집단내단일염기다형성 (유전체의 약 0.06%)을대상으로하고 있다.현재까지의질병진단을위한기술동향을살펴보면,특정유전자의 인간 공유다형성 (단일염기다형성 ,복재개수다형성 )을이용하거나유전자집단 전반의발현정보를미용하여,특정질병에연관된유전자들을찾고,유전자들의 기능을연구하는방향으로연구가진행되고있다.
[5] 특히,개인이가지고있는유전적특성이나유전자발현및염기다형성을 이용한진단기술개발이많이이루어지고있다.
[6] 그렇지만,현재까지 이루어지고있는진단기술대부분은매우제한된숫자의 표적유전자를대상으로하여,일부의특정질병에만적용가능한한계점이 있으며,이에따른질병진단마커도출에 있어서도모든유전자와그결과물인 단백질을기반으로하고있어부정확성을갖게된다.
[7] 국제공개특허제 2014-052909호에는질병,임상정보,유전정보를포함하는 데이터베이스를이용해개인의표현형정보와유전적변이를함께고려하여 질병을진단하는방법이개시되어 있다.국제공개특허제 2014-052909호를 통해서,유전자범위의 염기서열변이와환자의 임상정보를연결시켜질병 진단을할수있는시스템을제공하며,질병과유전정보의연관성을높은 해상도로파악한것으로볼수있다.
[8] 그렇지만,국제공개특허계 2014-052909호는유전체에서일어나는변이와 임상정보를이용하여질병을판단하나,일부유전정보에만국한되어 있어전체 유전체정보의분석에 있어서는한계점을갖고있다.또한,질병진단분류 알고리즘에서각각의염기서열변이에 truth value를할당하여중요도를 확인하는단순한구조를사용하고있어,복합적인염기서열변이의조합을통한 정밀진단에는어려움이 있다.
[9] [선행기술문헌]
[10] (특허문헌 1)국제공개특허 WO2014-052909(공개일자 2015.07.30.)
발명의상세한설명
기술적과제
[11] 본발명은암특이적인유전체변화를분석하여암과유전체변이의관계를 파악하고,높은정확도를가지는암특이적진단마커검출방법을제공한다. 과제해결수단
[12] 본발명은컴퓨터를포함하는연산처리수단에의해실행되는프로그램형태로 이루어지는암진단마커검출방법에 있어서,암샘플및정상샘플의전체 유전체시퀀싱 (whole genome sequencing)정보를입력하는단계,전체유전체 시퀀싱정보와참조유전체서열 (refernece genome sequence)정보를비교및 /또는 대조하여분석한정보를얻는단계,분석한정보및샘플정보로부터질병 분류도를도출하는단계,질병분류도를이용하여암샘플및정상샘플의전체 유전체시퀀싱정보에서 암특이적염기서열정보에 대한라이브러리를 구축하는단계및구축한라이브러리에서질병분류도및변이가일어난염기 수를변수로하여분류정확도를도출하는단계를포함하는암진단마커검출 방법을제공한다.
발명의효과
[13] 본발명의암진단마커검출방법은실제암환자및정상환자로부터 얻은 유전체염기서열정보를이용하여참조유전체정보대비암유전체들과정상 유전체들에서나타나는염기서열변이정보및염기서열위치정보를분석하여 암특이적인유전체복합정보의판단을통해암특이적인진단마커를검출할수 있다.
[14] 그리고,실제암환자및정상환자로부터 얻은유전체염기서열데이터외에도 기존에알려진암유전체를분석하여암특이적인유전체복합정보의판단을 통해암특이적인진단마커를검출할수있다.
[15] 또한,유전체염기서열의변이정보및위치정보를기반으로구축한
라이브러리를이용하여복합적인변이를용이하게분석할수있어높은 정확도를가지는암특이적진단마커를검출할수있다.
[16] 나아가,본발명에따라검출한암진단마커는바이오칩,정밀진단시스템, 키트,의료기기등의학및약학분야기술전반에쉽게적용할수있다.
[17]
도면의간단한설명
[18] 도 1은본발명에따른암진단마커검출방법에서이용하는참조유전체정보, 샘플의전체유전체시뭔싱정보의형태를나타낸예시도이다. [19] 도 2는본발명에따른암진단마커검출방법에서참조유전체정보,샘플의 전체유전체시뭔싱정보를비교및 /또는대조하여분석한결과정보를나타낸 예시도이다.
[20] 도 3은본발명에따른암진단마커검출방법에서표적범위추출단계의
유전체분할과정을나타낸예시도이다.
[21] 도 4는본발명에다른암진단마커검출방법에서라이브러리를구축하는 예시도이다.
[22] 도 5는본발명에따른암진단마커검출방법의 일실시예를나타낸
순서도이다.
[23] 도 6은본발명에따른암진단마커검출방법으로검출한마커를이용하여 임의의샘플에대한암여부를진닸하는예시도이다.
[24]
발명의실시를위한최선의형태
[25] 이하에서본발명에대하여구체적으로설명한다.
[26] 본명세서에서사용되는용어는따로정의하지않는경우해당분야에서
통상의지식을가진자가일반적으로이해하는내용으로해석되어야할것이다.
[27] 본명세서의도면및실시예는통상의기술자가본발명을쉽게이해하고
실시하기위한것으로본발명이도면및실시예로한정되는것은아니다.
그리고,도면및실시예에서발명의요지를흐릴수있는내용은생략되거나 과장될수있다.
[28]
[29] 본발명은유전체정보분석에기반한암특이적진단마커를도출또는
검출하는방법에관한발명이다.
[30] 본발명은전체유전체염기서열시뭔싱 데이터를기반으로일반생명현상및 질병관련유전체정보를비교분석하고,판별하여유전체기능의 이해를돕고 더나아가정밀한암진단마커를검출할수있다.
[31] 본발명에서암특이적진단마커를도출하기위해방대한양의유전체정보에 빅데이터처리기술등의정보통신기술을적용하여유전체정보의저장,해석, 분석및판별을수행한다.
[32] 본발명인암진단마커검출하는방법은전체적으로다음과같은과정으로 진행된다.우선,암및정상샘플 (시료)에대한전장유전체 (총유전체)염기서열의 정보를확보하고,참조유전체 (reference gemone)에기반한암및정상샘플의 염기변이및위치정보를포함한분석정보를확보한다.확보한분석정보를통해 암특이적유전체변화로예상되는염기변이및위치정보를포함한
라이브러리를구축한다.구축한라이브러리분석을통해암특이적진단마커를 도출한다.
[33] 보다구체적으로,본발명의암진단마커검출방법은다음과같다. [34] 본발명은컴퓨터를포함하는연산처리수단에의해실행되는프로그램형태로 이루어지는암진단마커검출방법에 있어서,암샘플및정상샘플의전체 유전체시퀀싱 (whole genome sequencing)정보를입력하는단계,전체유전체 시뭔싱정보와참조유전체서열 (refernece genome sequence)정보를비교및 /또는 대조하여분석한정보를얻는단계,분석한정보및샘플정보로부터질병 분류도를도출하는단계,질병분류도를이용하여암샘플및정상샘플의전체 유전체시퀀싱정보에서암특이적 염기서열정보에대한라이브러리를 구축하는단계및구축한라이브러리에서질병분류도및변이가일어난염기 수를변수로하여분류정확도를도출하는단계를포함하는암진단마커검출 방법을제공한다.
[35] 이하에서각단계에 대하여자세히설명한다.
[36] 암샘플및정상샘플의전체유전체시뭔싱 (whole gemone sequencing)정보를 입력하는단계에 대해자세히설명한다.
[37] 암샘플및정상샘플의전체유전체시퀀싱 (whole gemone sequencing)정보를 입력하는단계에서는암샘플및정상샘플의유전체전체에 대한정보를확보할 수있다.
[38] 암샘플및정상샘플의전체유전체시퀀싱정보는유전정보
데이터베이스로부터 얻을수있고, NIH(National Institutes of Health)의 TCGA(The Cancer Genome Atlas)에서 인증하여각질병별로제공하는전체유전체 염기서열정보를통해서얻을수있다ᅳ그리고,병원또는직접채취한실제 환자의샘플을시퀀싱 업체에의뢰하여샘플의전체유전체시퀀싱정보를얻을 수있다.또는경우에따라,유전자내의단백질을합성하는데직접적인역할을 하는액솜집합에대하여시퀀싱된정보 (Whole exome sequence)를얻어 이용할 수도있다.
[39] 샘플들의전체유전체시뭔싱정보는유전정보데이터베이스,시퀀싱사용 기기,시퀀싱방법등에따라정보의 일부변화가있을수있다.
[40] 전체유전체시퀀싱정보를얻을때인간게놈프로젝트로부터밝혀진인간 게놈지도정보를기준으로하는것이바람직하다.
[41] 암샘플및정상샘플의 전체유전체시뭔싱정보는본발명에따른암진단 마커검출방법에서기초가되는정보로서,전체유전체시뭔싱정보에포함된 샘플들의유전체특성차이를기반으로이후단계를진행하게된다.
[42] 전체유전체시뭔싱정보에포함된정보중특히 염색체정보,염색체내
염기서열의위치정보,염기서열의변이정보및신뢰도정보는암진단마커 검출에 있어중요한정보로이용될수있다.
[43] 전체유전체시퀀싱정보에포함된정보의분석은정보분석에 이용하는
프로그램에따라정보의가감이이루어질수있다.
[44]
[45] 전체유전체시뭔싱정보와참조유전체서열정보 (refernece genome sequence)를비교및 /또는대조하여분석한정보를얻는단계에대해서자세히 설명한다.
[46] 전체유전체시뭔싱정보와참조유전체서열정보 (refernece genome
sequence)를비교및 /또는대조하여분석한정보를얻는단계에서는샘플들의 유전체에포함된특이적인정보를얻을수있다.예를들어,암샘플에서 공통적으로나타나는유전체염기서열의변이및이들의조합에대한정보,정상 샘플에서공통적으로나타나는유전체염기서열의 변이및이들의조합에 대한 정보,암및정상샘플모두에서공통적으로나타나는유전체염기서열의 변이 및이들의조합에 대한정보,암샘플,정상샘플및참조유전체모두에서 공통적으로나타나는유전체염기서열에대한정보등이 있다.
[47] 참조유전체서열정보는인간게놈프로젝트로부터얻은인간게놈지도 정보로부터 얻을수있고,기본적으로염색체,염색체내염기서열의위치및 염기서열정보를포함한다.
[48] 전체유전체시퀀싱정보와참조유전체서열정보의분석을통해암샘플및 정상샘플의유전체에서 염기서열변이가일어난염색체정보,염색체내 염기서열의위치정보,참조유전체의 염기서열정보,샘플유전체의 염기서열 정보및각염기서열의정보에대한신뢰도를얻올수있고,이들정보는암진단 마커검출에중요한정보로이용할수있다.
[49] 전체유전체시퀀싱정보는참조유전체를기준으로염기서열조각들이
정렬되어 있는형태 (도 1및도 2참조)로되어 있어이러한형태자체로는 유전체의분석이블가능하다.이에따라,전체유전체시퀀싱정보와참조 유전체서열정보의분석은유전체분석프로그램을이용하여수행될수있다. 예를들어, SAM(Sequence Alignment/Map)tools, BCFtools등의오픈소스 프로그램을이용할수있다.프로그램의종류에따라데이터의처리및분석 결과가달라질수있어,본명세서에서 염기서열과염기는서로치환되어사용될 수있다.
[50] 분석한정보들은일정한플랫폼즉,동일한를의형태로변환하여저장및 관리할수있다.
[51] 분석한정보중,염색체정보 (Chromosome, #CHROM),염색체내
염기서열 (염기 )의위치정보 (position, POS),참조유전체의염기서열 (염기 ) 정보 (reference, REF),샘폴유전체의 염기서열 (염기)정보 (alternation, ALT)및 신뢰도 (quality, QUAL)는암진단마커검출에중요하게이용되는정보이다. 이들정보중암샘플또는정상샘플에서참조유전체와상이한
염기서열 (염기)을가지는부분,즉암샘플또는정상샘플에서염기변이가 일어난부분에대한정보는특히암진단마거검출에중요한정보이다.
이외에도샘플별염기서열위치정보나염기서열변이정보등도얻을수있어 필요에따라활용할수있다.
[52] 염기변이가일어난부분에대한정보에대해서염색체정보 (#CHROM),염색체 내염기서열 (염기)의위치정보 (POS),참조유전체의 염기서열 (염기)정보 (REF) 및샘플유전체의 염기서열 (염기)정보 (ALT)에따라구체적으로설명하면 다음과같다.염기변이가일어난부분에대한염색체정 (#CHROM)는암샘플 또는정상샘플의전체유전체시퀀싱정보를참조유전체정보와비교및 /또는 대조하였을때염기서열 (염기)의변이가일어난염색체이고,염색체내염기서열 위치정보 (POS)는염색체정보 (#CHROM)에해당하는염색체내에서변이가 일어난염기서열 (염기 )의위치이고,참조유전체의염기서열 (염기)정보 (REF)는 염색체내염기서열위치정보 (POS)와동일한위치에해당하는참조유전체의 염기서열 (염기)이고,샘플유전체의염기서열 (염기)정보 (ALT)는염색체내 염기서열위치정보 (POS)에해당하는위치에존재하는염기서열 (염기 )이다.
[53] 도 2를통해설명하면,도 2데이터의첫번째즐에 염색체정보 (#CHROM), 염색체내염기서열의위치정보 (POS),참조유전체의 염기서열정보 (REF),샘플 유전체의 염기서열정보 (ALT)및신뢰도 (QUAL)가표시되어 있다.도 2의 데이터에서두번째줄은염색체정 (#CHROM),염색체내염기서열의 위치정보 (POS),참조유전체의 염기서열정보 (REF),샘플유전체의 염기서열 정보 (ALT)및신뢰도 (QUAL)에대한값을나타내고있다.구체적으로 1번 염색체 (#CHROM)의 109번째위치 (POS)에서참조유전체서열정보의 염기서열은 'A'(REF)인반면에,암샘플및 /또는정상샘플의염기서열은
T(ALT)인바,염기변이가나타난것으로판단할수있으며,이때,염기변이에 대한신뢰도는 58%(QUAL)이다.
[54]
[55] 분석한정보및샘플정보로부터질병분류도 (Classification Ratio, CR)를
도출하는단계에대하여자세히설명한다.
[56] 분석한정보및샘플정보로부터질병분류도를도출하는단계에서는암
특이적인염기서열라이브러리를구축하기위한질병분류도를도출할수있다.
[57]
[58] *분석한정보는암샘플및정상샘플의전체유전체시퀀싱정보와참조
유전체정보를비교및 /또는대조하여얻은염색체정보 (#CHROM),염색체내 염기서열의위치정보 (POS),참조유전체의염기서열정보 (REF),샘플유전체의 염기서열정보 (ALT)및신뢰도 (QUAL)정보중적어도어느하나이상에 해당한다.
[59] 샘플정보는암샘플및정상샘플의총샘플수,총암샘플수,총정상샘플수, 염기변이가발생한암샘플수,염기변이가발생하지않은암샘플수, 염기변이가발생한정상샘플수및염기변이가발생하지않은정상샘플수중 적어도어느하나이상에해당한다.
[60] 질병분류도는분석한정보를바탕으로암샘플및 /또는정상샘플의
염기서열변이 (또는염기변이)를파악하고,각각의염기서열변이 (또는
염기변이)마다샘플정보를매개변수로한임의의함수로부터도출할수있다. [61] 질병분류도를도출할때에는암샘플및정상샘플의수가층분히확보가된 상태인것이바람직하며,두샘플의수가크게차이나지않는상황을가정하는 것이바람직하다.
[62] 질병분류도를도출하는일예로하기 [식 I]또는 [식 Π]에따라질병분류도를 도출할수있다.
[63] [식 I]
[64]
영기변이가발생한암샘플수 염기변이가 생하지않은정상샘플수
총암생플수 x 총정상샘플수
[65] [식 Π]
[66]
영기변이가발생한암샘플수 +염기변이가발생하지않은정상샘플수
총샘플수
[67] 그러나,질병분류도는암샘플및정상샘플에서암특이적염기서열정보에 대한라이브러리를구축하기위해이용하는것이므로,구축하려는
라이브러리의세부정보,형태,크기등에따라질병분류도값을구하기위한 함수는다양하게변할수있다.
[68] 즉,질병분류도를도출하기위한함수는분석한정보및샘플정보에따라본 발명의실시하는자가임의로정할수있는것으로,하기 [식 I]또는 [식 Π]에 제한되지않는다.
[69] 또한,도출한질병분류도,분석정보및샘플정보를이용하여새로운질병 분류도를도출하여이용할수도있다.
[70] 앞서설명한도 2의분석된정보와샘플정보중염기변이가발생한암샘플수, 염기변이가발생하지않은정상샘플수및총샘플수를매개변수로하는 [식 I]에따라질병분류도값을구하는것에대하여설명하면다음과같다.
염기변이가 1번염색체의 109에위치하며,참조유전체정보의염기는 'Α'이고, 샘플정보의염기는 Τ에해당하는암샘플의비율이 35/50(전체암샘플 50개중 변이가나타난개수가 35)이고정상샘플의비율이 20/50(전체정상샘플 50개중 변이가나타난개수가 20)일경우, 1번염색체의 109염기위치에서의질병 분류도는 [식 I]에따라 0.28의값을가진다.
[71] 유전체내의암특이적진단마커검출방법에 있어서,참조유전체정보와 비교하였을때암샘플의유전체염기서열시뭔싱정보에서 일어난염기서열 변이정보와동일하게정상샘플의유전체염기서열시뭔싱정보에서도변이가 일어날경우암특이적변화에해당하지않을가능성이높다.그러므로,질병 분류도에서 염기변이가발생한암샘풀수와염기변이가발생하지않은정상 샘플수는특히중요한매개변수로작용할수있다.
[72] 그리고,암샘플에서공통적으로염기서열변이가나타나고,정상샘플에는 동일하게염기서열변이가나타나지않는유전체의위치를파악하고,이에 대한 염기서열의위치정보와변이정보를추출하는것이바람직하다.
[73]
[74] 질병분류도를이용하여 암샘플및정상샘플의전체유전체시뭔싱정보에서 암특이적 염기서열정보에대한라이브러리를구축하는단계에 대하여자세히 설명한다.
[75] 질병분류도를이용하여암샘플및정상샘플의전체유전체시뭔싱정보에서 암특이적 염기서열정보에대한라이브러리를구축하는단계에서는암진단을 위한마커의표적이되는암특이적 염기서열변이정보가포함된라이브러리를 구축할수있다.나아가,라이브러리에포함된정보를이용하여각라이브러리 마다특정개수의 염기서열변이가일어날때암판별확률이가장높은지 도출할수있다.
[76] 암특이적 염기서열정보에 대한라이브러리는질병분류도를기준으로구축할 수있다.바람직하게는질병분류도를도출하고,각각의질병분류도값중특정 질병분류도값의이상에해당하는분석정보 (염색체정보 (#CHROM),염색체내 염기서열의위치정보 (POS),참조유전체의 염기서열정보 (REF),샘플유전체의 염기서열정보 (ALT)및신뢰도 (QUAL))들의집합을특정한질병분류도에 해당하는라이브러리로정할수있다.
[77] 즉,암특이적 염기서열정보에 대한라이브러리는전체분석정보에서특정 질병분류도를기준으로정렬한분석정보의집합에해당할수있다.
[78] 도 4는라이브러리를구축하는일예로,분석정보및샘플정보에따라질병 분류도를도출한후,도출한질병분류도값중 0.7이상에해당하는
라이브러리 (왼쪽),도출한질병분류도값중 0.6이상에해당하는
라이브러리 (오른쪽)를구축할수있다.이와같이,질병분류도를도출한후특정 질병분류도값을정하고특정질병분류도값이상을만족하는분석
정보 (염색체정보 (#CHROM),염색체내염기서열의위치정보 (POS),참조 유전체의 염기서열정보 (REF),샘플유전체의염기서열정보 (ALT)및 신뢰도 (QUAL))들의집합을특정질병분류도값에해당하는라이브러리로 정하여 암특이적인염기서열정보에대한라이브러리를구축할수있다.이러게 구축된라이브러리는특정질병분류도값이상을만족하는분석정보의 집합으로볼수있고,특정질병분류도값마다분석정보가달라지게된다.
[79] 이와같이라이브러리를구축할때,질병분류도값이높을수록라이브러리에 포함된분석정보가적게되는것이바람직하나,이에제한되는것은아니다. 예를들어 ,앞서설명하였던 [식 I]또는 [식 Π]에따라질병분류도를도출하는 경우,특정한질병분류도값이높을수록라이브러리에포함되는분석정보는 줄어들게된다.이러한질병분류도를도출하기위해서는질병분류도를 도출하기위한함수에서사용하는샘플정보의매개변수로염기변이가발생한 암샘플수및염기변이가발생하지않은정상샘플수를사용하는것이 바람직하다.
[80] 그리고,질병분류도는분석정보의 염기위치및염기변이마다도출하기
때문에특정질병분류도값이상또는이하와같이질병분류도를기준으로 범위를설정하여라이브러리를구축하는것이바람직하다.
[81]
[82] 구축한라이브러리에서질병분류도및변이가일어난염기수를변수로하여 분류정확도를도출하는단계에 대하여자세히설명한다.
[83] 구축한라이브러리에서질병분류도및변이가일어난염기수를변수로하여 분류정확도를도출하는단계에서는암진단마커로서가장확률이높은분석 정보의집합및염기변이정보를얻어마커로활용할수있다.
[84] 라이브러리에서질병분류도에따라정렬되는전체유전체시퀀싱분석
정보들이변하게되고,정렬된분석 정보에서소정의염기변이수를임의로 설정하는경우설정된소정의염기변이수에따라서암샘플및정상샘플의 분류정확도가달라진다.이때,분류정확도가높을수록암특이적염기서열 정보로볼수있다.따라서,질병분류도와소정의 염기변이수를변수로하여 샘플들의분류정확도를계산하면전체유전체시퀀싱정보중암진단마커로서 가장적합한염기변이정보를얻을수있다.
[85] 질병분류도와소정의염기변이수를변수로할때정상 -질병샘플분류
정확도는 rand measure(rand index)를목적함수로적용하여구할수있고, 매트랩 (matrix laboratory)등의수치해석프로그램올이용하여질병분류도및 소정의 염기변이수에따른라이브러리의최대분류정확도를도출할수있다.
[86] 구체적으로,질병분류도 (I)를정하고이를기준으로정렬된분석정보에서 소정의염기변이개수 (T)를임의로정하였을때하기 [식 ΙΠ]에따라질병분류도 I일때소정의 염기변이개수 T를만족하는분류정확도를구할수있다.
[87] [식 m]
[88]
( J = TP 一 TN
, — TP + pP + TN+ FN-
[89] (여기서, I는질병분류도이고 , τ는미리설정된소정의염기변이개수이고, TP는암샘플을암으로분류하는경우의수이며, TN은정상샘플을정상으로 분류하는경우의수이며, FP는정상샘플올암으로분류하는경우의수이며, FN는암샘플을정상으로분류하는경우의수이다.)
[90] 나아가,라이브러리에서가장높은분류정확도를만족하는질병분류도 (I)와 소정의염기변이개수 (T)는하기 [식 IV]에따라구할수있다.
[91] [식 IV] ( ) - arc max TP + W ' 、 ' — arg /> T TP + FP + TN+ FN
[93] (여기서, I는질병분류도이고,가변가능하기때문에 I*로나타낸것이며, [94] T는미리설정된소정의 염기변이개수이고,이또한가변가능하기때문에 T*로나타낸것이며 T의최대값은 I에따라정렬된분석 정보에포함된 염기변이의총수이다.
[95] TP는암샘플을암으로분류하는경우의수이며,
[96] TN은정상샘플을정상으로분류하는경우의수이며,
[97] FP는정상샘플을암으로분류하는경우의수이며,
[98] FN는암샘플을정상으로분류하는경우의수이다.)
[99] 가장높은분류정확도를가지는질병분류도와소정의염기변이수를정한후, 이를만족하는염기정보를암진단마커로활용할수있다.이렇게암진단 마커로정해진염기정보를다양한샘플의유전체정보와비교하여샘플의 유전체정보만으로도암여부를진단할수있다.
[100] 도 6을예를들어설명하면다음과같다.특정암에 대한샘플의전체유전체 시컨싱정보분석결과,질병분류도 (1)=0.602이상일때소정의염기변이 수 (T)=4에서가장높은분류정확도가나타난경우,라이브러리에서 0.602≤1, Τ=4에해당하는염기정보를암진단마커로정할수있다.,암진단마커로검출된 염기정보에따라 1=0.602이상의분석정보에서 Τ=4이상에해당하는경우특정 암에해당하는것으로볼수있다.이결과를바탕으로암진단여부를확인하기 위해,임의의샘플 1, 2, 3에대하여라이브러리내의위치에서염기변이를 확인하고변이가일어난위치를표시한다.확인결과샘플 1및 2염기변이수가 5에해당하여암으로진단할수있고,샘플 3은염기변어 -수가 2에해당하여 정상으로진단할수있다 (도 6참조).
[101]
[102] 라이브러리의크기가큰경우,모든부분집합에 대하여분류정확도를
계산하기어렵고복잡도가높아지기 때문에복잡도를줄이기위한과정을 수행하는것이바람직하다ᅳ
[103] 라이브러리의크기가 Ν인경우,모든부분집합의수는 2ΛΝ개의경우의수가 생긴다.이에따라,라이브러리의크기가커지게되면모든부분집합에대하여 분류정확도를계산하기어렵고복잡도가높아지기때문에이를해결하기 위하여휴리스틱 (heuristic)알고리즘을이용하여복잡도를줄이는것이 필요하다.
[104] 일예를들자면,부분집합의크기가 N인경우에 대하여,마커의가능성을
확인하고가능성이가장큰경우에대해서만우선적으로고려하여단계적으로 집합의크기를줄여나가게되면조사해야하는마커에대한전체경우의수가 N(N+l)/2로줄어들게된다.
[105]
[106] 나아가,최종적으로도출한암진단마커의성능을검증하기위한과정을더 수행하는것이바람직하다.
[107] 상세하게는,암진단마커검출에사용되지않은암샘플또는정상샘플에암 진단마커를대입하고분류정확도을계산하여마커의성능을검증할수있다.
[108] 더불어,암진단마커를검출하는데많은암샘플및정상샘플이사용될수록암 진단마커의정확도가올라갈수있기때문에,검증에쓰인암샘플또는정상 샘플의유전체염기서열시뭔싱정보는암진단마커의정확도를향상시킬수 있는피드백정보로이용하는것이바람직하다.
[109]
[110] 이상의암진단마커의검출방법올보다신속하고정확하게진행하기위해서 특정암에 대한표적범위를추출하는단계를더포함할수있다.
[111] 표적범위추출단계는암샘플및정상샘플의전체유전체시퀀싱정보와참조 유전체정보를분석한후수행되는것이바람직하다ᅳ
[112] 그리고,유전정보데이터베이스로부터암샘플및정상샘풀의전체유전체 시뭔싱정보를얻어암진단마커를검출하는경우암샘플및정상샘플의전체 유전체시뭔싱정보와참조유전체정보를분석하기 전기존에알려진암 유전자들을표적범위로추출할수도있다.
[113] 상세하게는,참조유전체정보,암샘플의전체유전체시퀀싱정보및정상 샘플의전체유전체시퀀싱정보를도 3과같이,미리설정된범위만큼씩분할할 수있다.
[114] 분할한참조유전체정보대비,분할한암샘플의전체유전체시뭔싱정보를 비교하여변이가나타난유전체범위를판단할수있으며,
[115] 분할한참조유전체정보대비,분할한정상샘플의전체유전체시퀀싱정보를 비교하여변이가나타난유전체범위를판단할수있다.
[116] 변이가나타난유전체범위의변화율이미리설정된변화율이상일경우, 해당되는유전체의범위를특정암에대한표적유전체범위로정하여특정암에 대한표적유전체범위를추출할수있다.미리설정된변화율은분할한참조 유전체정보대비,분할한정상샘플의전체유전체시퀀싱정보를비교하여 설정하는것이바람직하나이에제한되는것은아니다.
[117] 다시말하자면,전체유전체염기서열정보는특정암에의한유전체변화뿐 아니라,내재된염기서열의변이및암이외의원인으로변이된염기서열이 포함되어 있기때문에,특정암의표적으로볼수있는유전체범위를추출하는 것이바람직하다.
[118] 이때,전체유전체염기서열정보의경우,수십흑은수백의길이를가지는 염기서열조각들을기준유전체정보와비교하여,확률적으로가장높은위치에 배치된결과정보를갖는다.이때,염기서열의위치는미리저장되어 있는기준 유전체정보를기준으로결정된다.
[119] 도 1의위쪽에나타나있는기준유전체정보는일반적으로미리저장되어 있는 것이바람직하며 ,약 3 Gbp의길이를갖는다.
[120] 가장위에나타나있는숫자정보는기준유전체의위치정보를나타내며,그 아래검은색으로표현된염기서열정보의경우,기준유전체의 염기서열을 나타낸다.
[121] 또한,도 1의아래쪽에나타나있는검정색박스안에표현되어 있는염기서열 조각의경우,상술한바와같이,샘플의전체유전체염기서열정보이며,수십 또는수백의길이를가지는염기서열조각들이기준유전체정보와비교하여 확률적으로가장높은위치에배치된결과를갖는다.하나의위치당평균적으로 30~40개의후보염기서열이존재하게된다.그렇기때문에,전체유전체 염기서열데이터의크기는기준유전체정보의크기보다 30~40배가되어, 100 Gbyte전후의크기를갖는것이 일반적이다.물론,이는시뭔싱방법에따라 달라질수있다.
[122] 샘플유전체염기서열시퀀싱정보의크기가상술한바와같이, 100 Gbyte
전후의크기를가지고있어,모든유전체를비교,분석할경우,매우높은 복잡도를갖게되어실제구현이어려운문제점이 있다.
[123] 이에따라,유전체를분할하고,분할한유전체부분들에대하여기준유전체 정보대비,암샘플유전체시퀀싱정보또는정상샘폴유전체시뭔싱정보를 비교분석하여,분할한유전체범위내염기서열변화율을비교하게된다.
[124] 여기서,염기서열변화율이란,분할한유전체부분내의기준유전체정보대비, 염기서열변이정도를분할한유전체부분의길이로나눈것으로정의할수 있으며,이외에도시퀀싱정보에서 염기서열변이신뢰도 (QUAL)를이용하여 염기서열조각의화학반응시결합정도를추측하고,이변화기반으로 변화율을정의할수도있다.
[125] 또한,분할한유전체범위부분의기준유전체와암샘플및정상샘플유전체 간사오간성을계산하여변화율을정의할수있다.상관성을정의할경우, 염기서열을일정길이의단어로자른뒤,단어의빈도수또는일정길이의단어가 나타나는 interval의빈도수를조사하여, PDF의상관성을이용할수있고, 일정길이의단어의천이확률을계산한뒤, Transition diagram의상태간 상관성올이용할수있다.
[126] 참조유전체를기준으로,정상샘플유전체시퀀싱정보의염기서열변화율에 비하여암샘플유전체시퀀싱정보의 염기서열변화율이큰유전체분할부분을 찾고,이분할부분들의집합을특정암에대한표적유전체범위로정의하는 것이바람직하다.
[127] 표적범위추출은전체유전체중의미있는부분을특정암에대한표적유전체 범위로추출하는것으로서,유전자들의위치정보를기반으로,전체유전체를 유전자부분과비유전자부분으로나누어,분할할수있다. [128] 상세하게는,현재까지 알려진바,전체유전체는 23개의 염색체로이루어져 있고,염색체는유전자부분들과비유전자부분으로구성되어 있다.
[129] 이때,유전자는 25000 ~ 30000개정도로알려져 있다.또한,새로연구되어 추가되고있는유전자들도포함하는것이바람직하다.
[130] 표적범위추출은기준유전체정보,암샘플의유전체염기서열시뭔싱정보및 정상샘플의유전체염기서열시퀀싱정보를유전자위치기준으로분할하게 된다.
[131] 유전자위치기준으로분할하는과정은,도 3에도시된바와같이,각각의
염색체별로위치하고있는순서에따라소정번호를부여하고, 1번유전자 이전의비유전자부분을 pre-1로정의하고, 1번유전자와 2번유전자사이의 비유전자부분을 pre-2로정의하며,마지막유전자이후에나오는비유전자 부분을 last로정의하여,유전체모든부분을분할할수있다.
[132] 본발명의암진단마커검출방법은유전체정보의분석을기반으로하기 때문에유전자뿐만아니라비유전자부분의염기변이정보도함께활용할수 있어,기존의암진단마커검출방법과전혀다른방법으로암진단마커를 검출할수있다.
[133] 유전자의경계,길이등은유전자분석정보등기존에연구되어 있거나알려져 있는바에따를수있다.
[134] 분할한후,분할한기준유전체정보대비,분할한암샘플또는정상샘플의 유전체염기서열시뭔싱정보를비교하여,염기서열변이정보를판단하여암의 표적유전체범위를추출할수있다.
[135] 상세하게는,분할한참조유전체정보대비,분할한암샘플또는정상샘플의 유전체염기서열시퀀싱정보를비교하여,변이가일어난부분만을추출할수 있다.여기서,변이가일어난부분만을추출하는과정으로는,상술한바와같이, 염기서열변화율을비교하여분할한부분별로,정상샘플에비해암샘플에서 변화가얼마나많이일어났는지확인하여,정상샘플에비해암샘플에서미리 설정된특정변화율이상변화가나타날경우,해당하는유전체부분들의집합을 추출하여,특정암에대한표적유전체범위로정의하게된다.
[136] 이외에도,상술한바와같이 ,분할한부분의참조유전체와암샘플및정상 샘플유전체간상관성을계산하여변화율을정의할수있으며,
[137] 상관성올정의할경우,염기서열을일정길이의단어로자른뒤,단어의빈도수 또는일정길이의단어가나타나는 interval의빈도수를조사하여 , PDF의 상관성을이용할수있고,일정길이의단어의천이확률을계산한뒤, Transition diagram의상태간상관성을이용할수있다.
[138] 더나아가,특정암에대한표적유전체범위로정의된염기서열의위치정보, 염기서열의변이정보를비교분석하여,암특이적유전체변화를추출할수 있다. [140] 이상에서설명한암진단마커검출방법을도 4에따른순서도및구체적인 예시를통해다시한번더설명한다.이하의설명은본발명의 이해를돕기위한 하나의 예시에해당하므로프로그램에의해수행되는데이터처리과정및샘플 정보의 일부는생략될수있고,임의의값을사용하여설명할수있다.
[141] 본발명의유전체내의암특이적진단마커검출방법은도 4에도시된바와 같이 ,정보입력단계 (S100),표적범위추출단계 (S200),비교분석단계 (S300), 라이브러리구축단계 (S400)및마커검출단계 (S500)를포함하여이루어질수 있으며,유전체내의암특이적진단마커검출방법은컴퓨터를포함하는 연산처리수단에의하여실행되는프로그램형태로이루어질수있다.이때,암 샘플및정상샘플을직접채취하고,암샘플및정상샘플의전체유전체시퀀싱 정보를입력하는경우표적범위추출단계 (S200)와비교분석단계 (S300)는 순서를바꾸어수행하는것이바람직하다ᅳ
[142] 정보입력단계 (S100)는암샘플의전체유전체시퀀싱 (Whole Genome
Sequencing)정보와,정상샘플의전체유전체시퀀싱 (Whole Genome Sequencing) 정보를입력받을수있다.예를들면,미국국립보건원 (National lnstitutes of Health, NIH)의인증을받아혈액암,위암,간암에대한암샘플및정상샘플의 전체유전체시뭔싱정보를받아입력할수있다 (샘플의개수선택,시뭔싱장비 확인,시퀀싱방법확인도가능하다).이 때입력된암샘플및정상샘플의전체 유전체시뭔싱정보는 BAM(binary alignment map)형태 (도 2참조)의전체유전체 염기서열데이터를다운받거나참조유전체를기준으로어셈블링 (assembling)된 것을다운받아입력할수도있다ᅳ
[143] 다음으로,표적범위추출단계 (S200)는미리저장되어 있는참조유전체
정보와,상기정보입력단계 (S100)에의해입력받은암샘플의전체유전체 시퀀싱정보및정상샘플의전체유전체시퀀싱정보를이용하여,특정암에 대한표적유전체범위를추출할수있다.예를들면,혈액암의경우이들 암유전자중변화율이높은것으로알려진 2천여개의유전자를표적범위로 추출할수있고,이와함께변화율이높은유전자주위의비유전자부분도표적 범위로추출하여함께분석을진행할수있다.
[144] 이후,분석단계 (S300)는상기표적범위추출단계 (S200)에의해추출한특정 암에대한표적유전체범위에서,상기암샘플의전체유전체시퀀싱정보또는 정상샘플의전체유전체시퀀싱정보를비교및 /또는대조하여분석한 정보 (염색체정보 (#CHROM),염색체내염기서열의위치정보 (POS),참조 유전체의염기서열정보 (REF),샘플유전체의 염기서열정보 (ALT)및
신뢰도 (QUAL))를얻는다.구체적으로,염기서열변이가나타난암샘플의전체 유전체시퀀싱정보또는정상샘플의전체유전체시퀀싱정보에대한염색체 정보,염색체내염기서열의위치정보,염기서열정보,신뢰도,질병분류도 정보를분석하고,
[145] 암샘플의전체유전체시퀀싱정보에서공통적으로변이가나타나는염색체 정보,염색체내염기서열의위치정보,염기서열정보,신뢰도,질병분류도 정보를분석하고,
[146] 정상샘플의전체유전체시뭔싱정보에서공통적으로변이가나타나지않는 염색체정보,염색체내염기서열의위치정보,염기서열정보,신뢰도,질병 분류도정보를분석하여,암특이적인유전체염기서열의 변이정보를저장및 관리할수있다.
[147] 분석단계에서는유전체정보분석프로그램으로 SAMtools, BCFtools등의 오픈소스프로그램을이용하여아래의 [표 1]내지 [표 5]와같이전체유전체 시퀀싱정보의분석정보를분류및저장할수있다.분석정보는 [표 5]와같이 통합하여사용하는것이바람직하다. '
[148]
[149] [표 1]
SAMtools를이용하여 얻은분석정보의 예시 (QUAL값은따로표시하지않음)
Figure imgf000017_0002
[150]
[151] [표 2]
BCFtools를이용하여얻은분석정보의 예시 1
Figure imgf000017_0001
[표 3]
BCFtools를이용하여 얻은분석정보의 예시 2
Figure imgf000018_0001
[표 4]
BCFtools를이용하여얻은분석정보의 예시 3
Figure imgf000018_0002
[157] [표 5]
BCFtools를이용하여얻은분석정보의 예시들의통합
Figure imgf000019_0001
[158] 분석단계 (S300)에서 얻은분석정보 (염색체정 (#CHROM),염색체내
염기서열의위치정보 (POS),참조유전체의염기서열정보 (REF),샘플유전체의 염기서열정보 (ALT)및신뢰도 (QUAL))로부터질병분류도를도출하고,질병 분류도를기준으로하여암샘플및정상샘플의전체유전체시퀀싱정보에 대한라이브러리구축단계 (S400)을수행한다.
[159] 상기 [식 I]또는 [식 Π]와같이임의의함수식을정하고,분석정보및샘플 정보를통해염기위치및염기변이마다질병분류도 (I)를도출하여 [표 6]과같 o 분석정보에추가하여정리할수있다.
[160] [표 6]
분석정보에질병분류도 (I)를추가
Figure imgf000020_0001
[161] 질병분류도를도출한후,질병분류도값을기준으로라이브러리를구축할수 있다 ·
[162] [표 6]을참조하면,질병분류도값마다라이브러리를구축할경우
단일염기변이에해당하는분석정보들로라이브러리가구축되기때문에질병 여부의결정이어렵다.
[163] 반면,특정질병분류도값이상을기준으로정렬된분석정보에는하나이상의 염기변이가포함되어있어다중염기변이가조합된라이브러리를구축할수 있어,보다정확하게질병여부를결정할수있다.
[164] 특정질병분류도이상에해당하는분석정보를 [표 7]내지 [표 1이과같이 정렬할수있고이들을집합으로라이브러리를구축할수있다.
[165] [166] [표 7]
질병분류도 0.52이상의분석정보정렬
Figure imgf000021_0001
[167]
[168] [표 8]
질병분류도 0.56이상의분석정보정렬
Figure imgf000021_0002
[169] [표 9]
질병분류도 0.61이상의분석정보정렬
Figure imgf000022_0001
[171]
[172] [표 10]
질병분류도 0.62이상의분석정보정렬
Figure imgf000022_0002
[173] 라이브러리구축단계 (S400)후질병분류도 (I)에따라정렬되는분석정보가 달라지고,정렬된분석정보에서소정의 염기변이 개수 (T)를특정하여특정한 소정의염기변이개수마다하기 [식 m]에따라분류정확도를구한다.
[174]
[175] [식 m]
[176]
ΤΡ -l· TN
(J , T)
TP + FP + iV+ FN-
[177] (여기서, I는질병분류도이고, τ는미리설정된소정의 염기변이개수이고,
TP는암샘풀을암으로분류하는경우의수이며 , TN은정상샘플을정상으로 분류하는경우의수이며, FP는정상샘플을암으로분류하는경우의수이며, FN는암샘플을정상으로분류하는경우의수이다.)
[178] 예를들면,염기변이마다도출한질병분류도값이상을기준 (0.56≤1)으로 라이브러리를구축하였을때,라이브러리에서질병분류도 (I)가 0.56인분석 정보는상기 [표 8]에정렬된분석정보와같다.질병분류도 (1)=0.56일때의분석 정보인 [표 8]에서 T=10, T=20, T=30등여러소정의 염기변이개수 (I)를 특정하고,특정한각소정의염기변이개수 (Τ)마다질병 -샘플분류정확도를 구할수있다. [179] 1=0.56, T=10일때,분류정확도: TP+TN/TP+FP+TN+FN = 0.75,
[18이 1=0.56, Τ=20일때,분류정확도: TP+TN/TP+FP+TN+FN = 0.92
[181] 1=0.56, T=30일때 ,분류정확도: TP+TN TP+FP+TN+FN =
[182]
[183] 1=0.56, T=20일때의분류정확도 0.92로가장높으므로질병
분류도 (1)=0.56에서는 Τ=20일때가가장최적의암진단마커로사용할수있는 염기변이정보에해당한다.
[184] 이와같은방법에따라,라이브러리전체에서가장높은분류정확도를하기 [식 IV]에따라구하여,전체유전체시뭔싱정보에서 암진단마커로활용할수 있는염기정보를검출할수있다.
[185] [식 IV]
[186]
( ' 7 ^ VMi Άτ
ᅳ― argα m I a T x Tp + F TPp ++ T TNN + F . N
[187] (여기서, I는질병분류도이며,가변가능하기때문에 I*로나타낸것이며, [188] T는미리설정된소정의염기변이개수이고,이또한가변가능하기때문에 T*로나타낸것이며 T의최대값은 I에따라정렬된분석정보에포함된 염기변이의총수이다.
[189] TP는암샘플을암으로분류하는경우의수이며,
[190] TN은정상샘플을정상으로분류하는경우의수이며,
[191] FP는정상샘풀을암으로분류하는경우의수이며,
[192] FN는암샘플을정상으로분류하는경우의수이다.)
[193]
[194] 본발명의 일실시예에따른유전체내의암특이적진단마커검출방법에 의해,암샘플및정상샘플에대한전체유전체염기서열시퀀싱정보를통해 얻은암진단마커의검출이가능하며,검출한암진단마커를적용한암진단 칩 (chip),암진단키트 (kit),암진단단말기기및암진단시스템등에적용할수 있다.예를들면,채혈등의간단한방법으로검출하고자하는샘플의유전체 정보를습득한후암진단마커를검출할수있어,바이오칩,키트,단말기기및 시스템등의소형의료사업에적용가능할경우,분자진단의료관련산업분야에 큰파급효과가나타날수있다.
[195] 그리고,본발명의암특이적진단마커검출방법은실제암환자및정상
환자로부터 얻은유전체염기서열데이터를이용하여암유전체들과정상 유전체들의염기서열변이정보및염기서열위치정보를비교분석할수있다. 이렇게얻은분석정보를통해암특이적유전체복합정보를판단하여암특이적 진단마커를도출할수있다.
[196] 나아가,시간의흐름에따라유전체정보를추가적으로획득하여개인 특이적인유전체변화를확인할수도있다.예를들어,질병에걸린환자로부터 질병이진행함에따라또는,질병이치료됨에따라,시간주기를두고유전체 정보를획득하고,이를분석하여질병의변화와유전체변화정보를맵핑시킬수 있다.
[197] 또한,한명의환자로부터질병을가지고있는샘플정보와질병을가지고있지 않은부위의샘플정보를채취하여,두샘플의유전체정보를분석하여질병을 가지고있는샘플에서보여지는특이적인유전체변이정보도용이하게얻을수 있다.
[198] 이상과같이본발명에서는구체적인구성소자등과같은특정사항들과한 정된실시예도면에의해설명되었으나이는본발명의보다전반적인이해를 돕기위해서제공된것일뿐,본발명은상기의 일실시예에한정되는것이 아니며,본발명이속하는분야에서통상의지식을가진자라면이러한 기재로부터다양한수정및변형이가능하다.
[199]
산업상이용가능성
[200] 본발명은유전체내의암특이적진단마커검출방법에관한것으로,더욱 상세하게는암과유전체변이의관계를파악하여암특이적인유전체변화를 검출할수있다.
[201]
[202]

Claims

청구범위
[청구항 1] 컴퓨터를포함하는연산처리수단에의해실행되는프로그램형태로 이루어지는암진단마커검출방법에 있어서,
암샘플및정상샘플의전체유전체시퀀싱 (whole genome sequencing) 정보를입력하는단계 ;
상기전체유전체시퀀싱정보와참조유전체서열 (refernece genome sequence)정보를비교및 /또는대조하여분석한정보를얻는단계 ;
상기분석한정보및샘풀정보로부터질병분류도를도출하는단계; 상기질병분류도를이용하여 암샘플및정상샘플의전체유전체시뭔싱 정보에서암특이적염기서열정보에대한라이브러리를구축하는단계; 구축한라이브러리에서질병분류도및염기변이수의변화에따른분류 정확도를도출하는단계 ;
를포함하는암진단마커검출방법 .
[청구항 2] 제 1항에 있어서,
암샘플및정상샘플의전체유전체시퀀싱정보와참조유전체정보를 비교및 /또는대조하여분석한정보는염색체정보 (#CHROM),염색체내 염기서열의위치정보 (POS),참조유전체의염기서열정보 (REF)및샘플 유전체의 염기서열정보 (ALT)를포함하는암진단마커검출방법.
[청구항 3] 제 1항에 있어서,
상기샘플정보는암샘폴및정상샘폴의총샘플수,총암샘플수,총 정상샘플수,염기변이가발생한암샘플수,염기변이가발생하지않은 암샘플수,염기변이가발생한정상샘플수및염기변이가발생하지 않은정상샘플수중적어도어느하나를포함하는암진단마커검출 방법.
[청구항 4] 제 1항에 있어서,
상기질병분류도는암샘플및정상샘플의전체유전체시퀀싱정보와 참조유전체정보를비교및 /또는대조하여분석한정보인염색체 정 (#CHROM),염색체내염기서열의위치정보 (POS),참조유전체의 염기서열정보 (REF)및샘플유전체의 염기서열정보 (ALT)로부터암 샘플및 /또는정상샘플의염기서열변이정보를얻고,
암샘플및 /또는정상샘플에서 변이된염기서열마다샘플정보인암 샘플및정상샘플의총샘플수,총암샘플수,총정상샘플수, 염기변이가발생한암샘플수,염기변이가발생하지않은암샘플수, 염기변이가발생한정상샘플수및염기변이가발생하지않은정상샘플 수중적어도어느하나를매개변수로하여도출하는것을특징으로하는 암진단마커검출방법 .
[청구항 5] 제 1항에 있어서,
상기라이브러리구축은암샘플및정상샘플의전체유전체시퀀싱 정보와참조유전체정보를비교및 /또는대조하여분석한정보인염색체 정 (#CHROM),염색체내염기서열의위치정보 (POS),참조유전체의 염기서열정보 (REF)및샘플유전체의 염기서열정보 (ALT)로부터암 샘플또는및 /정상샘플의염기서열변이정보를얻고,
암샘풀또는정상샘풀에서변이가일어난염기마다샘플정보인암샘플 및정상샘플의총샘플수,총암샘플수,총정상샘플수,염기변이가 발생한암샘플수,염기변이가발생하지않은암샘플수,염기변이가 발생한정상샘플수및염기변이가발생하지않은정상샘플수중 적어도어느하나를매개변수로하여질병분류도를도출하고, 변이가일어난염기마다도출한질병분류도를기준으로구축한암진단 마커검출방법.
[청구항 6] 제 1항에 있어서,
분류정확도는암샘플및정상샘플의전체유전체시뭔싱정보와참조 유전체정보를비교및 /또는대조하여분석한정보인염색체 정 (#CHROM),염색체내염기서열의위치정보 (POS),참조유전체의 염기서열정보 (REF)및샘플유전체의염기서열정보 (ALT)로부터암 샘플및 /또는정상샘플의염기서열변이정보를얻고,
암샘플및 /또는정상샘플에서변이가일어난염기마다샘플정보인암 샘플및정상샘플의총샘플수,총암샘플수,총정상샘플수, 염기변이가발생한암샘플수,염기변이가발생하지않은암샘플수, 염기변이가발생한정상샘플수및염기변이가발생하지않은정상샘플 수중적어도어느하나를매개변수로하여질병분류도를도출하고, 변이가일어난염기마다도출한질병분류도를기준으로라이브러리를 구축한후,
구축한라이브러리마다특정염기변이수를설정하고,설정한특정 염기변이수마다샘플의분류정확도를계산하는암진단마커검출방법.
[청구항 7] 제 6항에 있어서,
상기질병분류도와설정한특정염기변이수의변화에따라하기 수식으로샘플의분류정확도를도출하는암진단마커검출방법.
(/*, T*) = arg max TP + TN
T TP + FP + TN+ FN
(여기서, I는염기서열의질병분류도이고,가변가능하기때문에 I*로 나타낸것이며,
T는미리설정된소정의 염기변이개수이고,이또한가변가능하기 때문에 T*로나타낸것이며 Τ의최대값은 I에따라정렬된분석정보에 포함된염기변이의총수이다.
TP는암샘플을암으로분류하는경우의수이며,
TN은정상샘플을정상으로분류하는경우의수이며,
FP는정상샘플을암으로분류하는경우의수이며,
FN는암샘플을정상으로분류하는경우의수이다.)
[청구항 8] 제 1항에있어서,
암샘플및정상샘플의 전체유전체시퀀싱정보를입력한후입력한 전체유전체시퀀싱 정보와참조유전체정보를이용하여특정암에 대한 표적유전체범위를추출하는단계를더포함하는암진단마커검출 방법.
PCT/KR2017/001581 2016-11-08 2017-02-14 유전체내 암 특이적 진단 마커 검출 WO2018088635A1 (ko)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US16/323,948 US20190252040A1 (en) 2016-11-08 2017-02-14 Detection of cancer-specific diagnostic markers in genome

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
KR10-2016-0147935 2016-11-08
KR20160147935 2016-11-08
KR1020170019559A KR101928094B1 (ko) 2016-11-08 2017-02-13 유전체내 암 특이적 진단 마커 검출
KR10-2017-0019559 2017-02-13

Publications (1)

Publication Number Publication Date
WO2018088635A1 true WO2018088635A1 (ko) 2018-05-17

Family

ID=62109595

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/KR2017/001581 WO2018088635A1 (ko) 2016-11-08 2017-02-14 유전체내 암 특이적 진단 마커 검출

Country Status (1)

Country Link
WO (1) WO2018088635A1 (ko)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20190136733A (ko) * 2018-05-31 2019-12-10 한국과학기술원 유전체 변이 정보를 이용한 질병 진단 바이오마커 추출 방법

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2014130444A1 (en) * 2013-02-19 2014-08-28 Genomic Health, Inc. Method of predicting breast cancer prognosis
US20140330162A1 (en) * 2011-12-08 2014-11-06 Koninklijke Philips N.V. Biological cell assessment using whole genome sequence and oncological therapy planning using same
KR20150024231A (ko) * 2014-02-21 2015-03-06 (주)신테카바이오 대립유전자의 바이오마커 발굴방법
US20160273049A1 (en) * 2015-03-16 2016-09-22 Personal Genome Diagnostics, Inc. Systems and methods for analyzing nucleic acid
WO2016154493A1 (en) * 2015-03-24 2016-09-29 The Board Of Trustees Of The Leland Stanford Junior University Systems and methods for multi-scale, annotation-independent detection of functionally-diverse units of recurrent genomic alteration

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140330162A1 (en) * 2011-12-08 2014-11-06 Koninklijke Philips N.V. Biological cell assessment using whole genome sequence and oncological therapy planning using same
WO2014130444A1 (en) * 2013-02-19 2014-08-28 Genomic Health, Inc. Method of predicting breast cancer prognosis
KR20150024231A (ko) * 2014-02-21 2015-03-06 (주)신테카바이오 대립유전자의 바이오마커 발굴방법
US20160273049A1 (en) * 2015-03-16 2016-09-22 Personal Genome Diagnostics, Inc. Systems and methods for analyzing nucleic acid
WO2016154493A1 (en) * 2015-03-24 2016-09-29 The Board Of Trustees Of The Leland Stanford Junior University Systems and methods for multi-scale, annotation-independent detection of functionally-diverse units of recurrent genomic alteration

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20190136733A (ko) * 2018-05-31 2019-12-10 한국과학기술원 유전체 변이 정보를 이용한 질병 진단 바이오마커 추출 방법
KR102217272B1 (ko) 2018-05-31 2021-02-18 한국과학기술원 유전체 변이 정보를 이용한 질병 진단 바이오마커 추출 방법

Similar Documents

Publication Publication Date Title
AU784645B2 (en) Method for providing clinical diagnostic services
JP7057913B2 (ja) ビッグデータ解析方法及び該解析方法を利用した質量分析システム
US7881873B2 (en) Systems and methods for statistical genomic DNA based analysis and evaluation
CN112020565A (zh) 用于确保基于测序的测定的有效性的质量控制模板
KR101542529B1 (ko) 대립유전자의 바이오마커 발굴방법
EP2545481B1 (en) A method, an arrangement and a computer program product for analysing a biological or medical sample
CN102007407A (zh) 基因组鉴定系统
EP1613734A2 (en) Visualizing expression data on chromosomal graphic schemes
EP2864918B1 (en) Systems and methods for generating biomarker signatures
EP2923293A1 (en) Efficient comparison of polynucleotide sequences
KR101928094B1 (ko) 유전체내 암 특이적 진단 마커 검출
KR101967248B1 (ko) 개인의 유전 정보를 분석하는 방법 및 장치
CN109524060A (zh) 一种遗传病风险提示的基因测序数据处理系统与处理方法
WO2018088635A1 (ko) 유전체내 암 특이적 진단 마커 검출
US10083274B2 (en) Non-hypergeometric overlap probability
US20180181705A1 (en) Method, an arrangement and a computer program product for analysing a biological or medical sample
Ahmad et al. A review of genetic variant databases and machine learning tools for predicting the pathogenicity of breast cancer
KR20240065434A (ko) 암의 재발 및 전이를 예측 가능한 환자관리시스템
KR20200106643A (ko) 바코드 서열 정보 기반 고민감도 유전변이 탐지 및 레포팅 시스템
CN114730611A (zh) 用于增强变异体识别性能和表征变异体表达状态的组合dna-rna测序分析的方法和系统
WO2011124758A1 (en) A method, an arrangement and a computer program product for analysing a cancer tissue
dos Santos Valente Development of computational tools for the integrated analysis of DNA microarray data with applications in cancer research
Valente Development of computational tools for the integrated analysis of DNA microarray data with applications in cancer research

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17869742

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 17869742

Country of ref document: EP

Kind code of ref document: A1