WO2024090667A1 - Race prediction system and method, using variant frequency - Google Patents

Race prediction system and method, using variant frequency Download PDF

Info

Publication number
WO2024090667A1
WO2024090667A1 PCT/KR2022/019581 KR2022019581W WO2024090667A1 WO 2024090667 A1 WO2024090667 A1 WO 2024090667A1 KR 2022019581 W KR2022019581 W KR 2022019581W WO 2024090667 A1 WO2024090667 A1 WO 2024090667A1
Authority
WO
WIPO (PCT)
Prior art keywords
race
mutation
frequency
target
rate
Prior art date
Application number
PCT/KR2022/019581
Other languages
French (fr)
Korean (ko)
Inventor
한헌종
권기상
Original Assignee
주식회사 쓰리빌리언
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 주식회사 쓰리빌리언 filed Critical 주식회사 쓰리빌리언
Publication of WO2024090667A1 publication Critical patent/WO2024090667A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/40Population genetics; Linkage disequilibrium
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B5/00ICT specially adapted for modelling or simulations in systems biology, e.g. gene-regulatory networks, protein interaction networks or metabolic networks
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics

Definitions

  • Embodiments of the present invention relate to a system and method for predicting race, and more specifically, to a system and method for predicting race using the frequency of occurrence of mutations by race based on conditional probability.
  • This invention was made under the support of the Ministry of Science and ICT of the Republic of Korea under project number 1711160581 and task number 2022-0-00333.
  • the research management agency for the project is IITP Information and Communication Planning and Evaluation Institute, and the research project name is "SW Computing Industry Source.”
  • “Technology Development (R&D)” the research project name is “Development of AI integrated SW solution for multi-faceted analysis of rare pediatric diseases”
  • the host organization is Three Billion Co., Ltd., and the research period is 2022.04.01. ⁇ 2024.12.31.
  • N*M mutation profiles must be constructed to create a prediction model.
  • the mutation profiles of all samples used for prediction must be collected and analyzed, but such data is difficult to obtain and analysis requires high-spec analysis equipment.
  • One embodiment of the present invention is a mutation that can quickly and accurately predict the race of a target without requiring a high-specification analysis device for analysis by predicting the race of the target using the frequency of appearance of mutations summarized based on probability methodology. Provides a race prediction system and method using frequency of appearance.
  • a race prediction system using the frequency of mutation appearance includes a mutation frequency calculator that calculates the frequency of mutation appearance by race in conjunction with a population genome mutation database; a racial score calculation unit that calculates a score by race of the target using the frequency of occurrence of mutations by race; and a race prediction unit that predicts the race of the target based on the race score of the target.
  • the mutation frequency calculation unit collects the number of mutations by race, which represents the number of times each mutation appears in a specific race, and the total number of people by race from the population genome mutation database, and the collected number of mutations by race and the total number of people by race.
  • the frequency of occurrence of each mutation by race can be calculated using the total number of people.
  • the mutation frequency calculation unit calculates the ratio of people with the homozygote mutation among all people (homozygote rate) using the number of people with the homozygote mutation (number of homozygote) and the total number of people (allele number / 2). And, using the number of people with a heterozygote mutation (allele count - 2 * number of homozygote) and the total number of people (allele number / 2), the ratio of people with the heterozygote mutation among all people (heterozygote rate) ) is calculated, and the frequency of occurrence of mutations by race may include the homozygote rate and the heterozygote rate.
  • the mutation frequency calculation unit selects a mutation for which the total number of people (allele number / 2) is at least 1,000 or more in order to select a discriminating mutation, and calculates the mutation if the mutation is too rare. Considering that it may affect the overall race, variants with an allele count ratio of 5% or more and 95% or less can be selected to select the target variant for calculation of the homozygote rate and the heterozygote rate.
  • the race prediction system using mutation frequency stores the homozygote rate and heterozygote rate calculated for each mutation in a table by race to generate a mutation frequency table by race. It further includes a construction unit, wherein the score calculation unit for each race searches for mutations (target mutations) related to the target from the mutation frequency table for each race, and values of the homozygote rate and heterozygote rate for each race for each of the searched target mutations. By loading, a score by race can be calculated for the target mutation set consisting of the plurality of target mutations.
  • the race-specific score calculation unit uses the racial homozygote rate and heterozygote rate values for each target mutation, and determines the target's race based on the conditional probability that the target mutation set (target mutation set) will appear in a specific race. You can calculate star scores.
  • the ethnic score calculation unit may calculate the ethnic score using Equation 1 below.
  • V represents the set of target variants (v 1 , v 2 , ..., v n ), E is race, n is the number of target variants, and Pr(Vn
  • the race prediction unit may predict the race corresponding to the highest score among the scores for each race of the target as the race of the target.
  • a method for predicting race using the frequency of mutation appearance includes the steps of the race prediction server linking with a population genome mutation database to calculate the frequency of mutation appearance by race; The racial prediction server calculating a score by race of the target using the frequency of occurrence of mutations by race; And a step of the race prediction server predicting the race of the target based on the race score of the target.
  • the step of calculating the frequency of occurrence of mutations by race includes collecting the number of mutations by race, which indicates the number of times each mutation appears in a specific race, and the total number of people by race from the population genome mutation database; And it may include calculating the frequency of appearance of each variant by race using the collected number of occurrences of variants by race and the total number of people by race.
  • the step of calculating the frequency of occurrence of each mutation by race is to calculate the number of people with the homozygote mutation (number of homozygote) and the total number of people (allele number / 2), and calculate the number of people with the homozygote mutation among all people. Calculating the homozygote rate; And using the number of people with a heterozygote mutation (allele count - 2 * number of homozygote) and the total number of people (allele number / 2), the ratio of people with the heterozygote mutation among all people (heterozygote rate) It includes the step of calculating , and the frequency of occurrence of mutations by race may include the homozygote rate and the heterozygote rate.
  • the racial prediction server stores the homozygote rate and heterozygote rate calculated for each mutation in a table by race to create a mutation frequency table by race. It further includes generating a score for each race of the object, wherein the step of calculating the score for each race includes: searching for a variant (target variant) related to the object from the variant appearance frequency table for each race; And it may include loading the values of the homozygote rate and heterozygote rate by race for each of the searched target mutations, and calculating a score by race for the target mutation set consisting of the plurality of target mutations.
  • a probability methodology is used rather than a machine learning technique such as existing PCA or random forest, and the existing method uses an N*M mutation profile consisting of N mutations and M samples to build a model. Unlike those that require , because it only requires information on the number of occurrences of mutations in a summary, high-spec analysis equipment is not required for analysis, and the race of the target can be predicted quickly and accurately.
  • the results can be interpreted in more detail because the average value of the probabilities for each race is presented, and the predicted racial information can be usefully used in various research and clinical diagnosis. For example, if a mutation known to cause a specific disease is found in large numbers in people of race A who do not have the disease, the association between the mutation and the disease can be lowered only for race A. Additionally, in the case of diseases whose prevalence varies depending on race, additional clues can be obtained for diagnosing the disease by confirming the patient's race.
  • Figure 1 is a diagram illustrating the configuration of a race prediction system using mutation frequency according to an embodiment of the present invention.
  • FIG. 2 is a block diagram illustrating the detailed configuration of the race prediction server of FIG. 1.
  • Figure 3 is a diagram illustrating an example of a mutation frequency table by race generated according to an embodiment of the present invention.
  • FIG. 4 is a diagram illustrating a method of generating a table of variation occurrence frequencies by race according to an embodiment of the present invention.
  • Figure 5 is a diagram illustrating a method of calculating scores by race according to an embodiment of the present invention.
  • Figures 6 to 8 are diagrams to explain the process of calculating the frequency of occurrence of variations (homozygote rate and heterozygote rate) by race according to an embodiment of the present invention.
  • Figure 9 is a table showing variation information of a specific object used in the step of calculating conditional probability for each race according to an embodiment of the present invention.
  • Figure 10 is a flowchart illustrating a method for predicting race using mutation frequency according to an embodiment of the present invention.
  • transmission refers to the direct transmission of signals or information from one component to another component. In addition, it also includes those transmitted through other components.
  • transmitting or “transmitting” a signal or information as a component indicates the final destination of the signal or information and does not mean the direct destination. This is the same for “receiving” signals or information.
  • FIG. 1 is a configuration diagram of a racial prediction system using mutation frequency according to an embodiment of the present invention
  • FIG. 2 is a block diagram illustrating the detailed configuration of the racial prediction server 110 of FIG. 1.
  • the race prediction system using the frequency of mutation appearance may be implemented as a race prediction server 110.
  • the race prediction server 110 includes a mutation frequency calculation unit 210, a mutation frequency table construction unit 220, a race score calculation unit 230, a race prediction unit 240, and a control unit 250. It can be.
  • the mutation frequency calculation unit 210 can calculate the mutation frequency by race in conjunction with the population genome mutation database 120.
  • the population genome variation database 120 may be implemented as a GnomAD (The Genome Aggregation Database) database.
  • the mutation frequency calculation unit 210 may collect the number of mutations by race, which indicates the number of times each mutation appears in a specific race, and the total number of people by race from the population genome mutation database 120. In addition, the mutation frequency calculation unit 210 may calculate the frequency of occurrence of each mutation by race using the collected number of occurrences of mutations by race and the total number of people by race.
  • the frequency of occurrence of mutations by race can be understood as a concept that includes the proportion of people with homozygote mutations (homozygote rate) and the proportion of people with heterozygote mutations (heterozygote rate) among all people.
  • the process for calculating the homozygote rate and the heterozygote rate is as follows.
  • the mutation frequency calculation unit 210 calculates a value for the total number of people (allele number / 2) and the ratio of the allele count to the total race. Restrictions may apply.
  • the mutation frequency calculation unit 210 selects mutations for which the total number of people (allele number / 2) is at least 1,000 or more, and considering that if the mutations are too rare, it may affect the calculation, the overall number By selecting mutations with an allele count ratio of 5% or more and 95% or less in a race, target mutations for calculating the homozygote rate and the heterozygote rate can be selected.
  • the mutation frequency table construction unit 220 may store the homozygote rate and heterozygote rate calculated for each mutation in a table by race to generate a mutation frequency table by race.
  • the variant frequency table by race can be generated as shown in FIG. 4.
  • accurate and fast prediction is possible because mutation information from the mutation frequency table by race, summarized by race, is used rather than mutation profile data.
  • the score calculation unit 230 for each race may calculate the score for each race of the target using the frequency of occurrence of mutations for each race.
  • the race-specific score calculation unit 230 may use the race-specific variant appearance frequency table.
  • the race-specific score calculation unit 230 searches for mutations (target mutations) related to the target from the race-specific mutation appearance frequency table, and values the racial homozygote rate and heterozygote rate for each of the searched target mutations. By loading, a score by race can be calculated for the target mutation set consisting of the plurality of target mutations.
  • the race-specific score calculation unit 230 uses the values of the homozygote rate and heterozygote rate for each race for each target mutation to determine the conditional probability that the mutation set (target mutation set) of the target will appear in a specific race. Based on this, the score for each race of the subject can be calculated.
  • the ethnicity score calculation unit 230 may calculate the ethnicity score using Equation 1 below.
  • V represents the set of target variants (v 1 , v 2 , ..., v n ), E is race, n is the number of target variants, and Pr(Vn
  • the race-specific score calculation unit 230 determines the probability (Pr(Vn
  • the racial score calculation unit 230 calculates the geometric mean (1/n squared) of each probability as in Equation 1 above. This can be calculated and calculated as a score for each race.
  • the race score calculation unit 230 selects samples for each sample from the racial variant frequency table. You can calculate the score by race by taking the Zygosity (1/1: Homozygote, 1/0: Heterozygote) of the mutation by race and applying it to Equation 1 above.
  • the race prediction unit 240 may predict the race of the target based on the race score of the target. That is, the race prediction unit 240 may predict the race corresponding to the highest score among the scores for each race of the target as the race of the target.
  • the race prediction unit 240 can predict the race of sample A as American because the score of American is the highest at 0.615 for Sample A.
  • the race prediction unit 240 can predict the race of sample B as African because the score of African is the highest at 0.342 for Sample B.
  • the race prediction unit 240 may predict the races with slightly different scores as the race of the target. In other words, if the score difference between two or more specific races is small within a preset range, the racial prediction unit 240 may predict the specific race as the target's race.
  • the race prediction unit 240 may apply a weight to the scores of the two or more specific races and predict the race with the highest final score as the target's race.
  • the racial prediction unit 240 may apply the distribution ratio by race as a weight to calculate the final score of the two or more specific races, and predict the race with the highest calculated final score as the race of the target. there is.
  • the control unit 250 generally controls the operations of the mutation frequency calculation unit 210, the mutation frequency table construction unit 220, the racial score calculation unit 230, and the racial prediction unit 240. You can.
  • the control unit 250 functionally includes components such as the mutation frequency calculation unit 210, the mutation frequency table construction unit 220, the racial score calculation unit 230, and the racial prediction unit 240. Alternatively, it may be implemented including the entirety. That is, the control unit 250 may perform some of the functions of the components or may perform all of the functions of the components.
  • the control unit 250 controls the overall operation of the race prediction server 110 and may include a processor such as a CPU.
  • the control unit 250 may control other components included in the race prediction server 110 to perform operations corresponding to user input received through the input/output unit.
  • the processor can process instructions within the computing device, such as displaying graphic information to provide a GUI (Graphic User Interface) on an external input or output device, such as a display connected to a high-speed interface.
  • GUI Graphic User Interface
  • multiple processors and/or multiple buses may be utilized along with multiple memories and memory types as appropriate.
  • the processor may be implemented as a chipset comprised of chips including multiple independent analog and/or digital processors.
  • the GnomAD data presents allele count, allele number, and number of nomozygote values for each mutation for each race, and this data is provided separately for each race.
  • the sequence of a specific position on a specific chromosome is called an allele, and since each person has two chromosomes, they have two alleles.
  • An allele can be the same sequence as the reference or an alternative (mutated sequence).
  • the allele count refers to the number of alleles corresponding to mutations found in a specific population. Since each person has two alleles, when the number of people is N, the allele count ranges from a minimum of 0 to a maximum of 2N. If it is known that the reference allele at a specific position is A and the alternative allele is T, the allele count refers to the number of T alleles found at that position.
  • the allele number is a number that represents the total number of alleles. Since it is the number of people * 2, it becomes 2N. In other words, dividing the allele number by 2 gives the total number of people.
  • Number of homozygote is the number of people who have a homozygote mutation. This represents the number of people, so the allele count of homozygote people is 2 * number of homozygote.
  • heterozygote rate is the proportion of people with a heterozygote allele, it can be calculated by subtracting the homozygote allele count from the total allele count.
  • the process of calculating the homozygote rate and the heterozygote rate will be explained using Figure 6 as an example.
  • the reference allele is A and the alternative allele is T. And, of the 12 people, 3 are wild type, 7 are heterozygote, and the remaining 2 are homozygote.
  • the allele count is 11, the allele number is 24, and the number of homozygotes is 2.
  • the frequency of occurrence of mutations by race can be calculated by calculating the homozygote rate and heterozygote rate as described above.
  • the 1-976506-AGGCGGGGGC-A mutation was excluded because the total allele number was less than 2000.
  • the 1-1007245-C-G mutation was excluded because the allele count was less than 5%.
  • the score for each race is calculated as follows.
  • 1-138593-G-T in the mutation frequency table by race in Figure 7 is not used in the calculation because it is not a mutation found in the subject.
  • 1-100293-G-C and 1-592801-G-GA are not used in the calculation because they are not in the mutation frequency table by race in Figure 7.
  • devices and components described in embodiments may include, for example, a processor, a controller, an arithmetic logic unit (ALU), a digital signal processor, a microcomputer, a field programmable array (FPA), It may be implemented using one or more general-purpose or special-purpose computers, such as a programmable logic unit (PLU), microprocessor, or any other device capable of executing and responding to instructions.
  • a processing device may execute an operating system (OS) and one or more software applications that run on the operating system. Additionally, a processing device may access, store, manipulate, process, and generate data in response to the execution of software.
  • OS operating system
  • a processing device may access, store, manipulate, process, and generate data in response to the execution of software.
  • a single processing device may be described as being used; however, those skilled in the art will understand that a processing device includes multiple processing elements and/or multiple types of processing elements. It can be seen that it may include.
  • a processing device may include multiple processors or one processor and one controller. Additionally, other processing configurations, such as parallel processors, are possible.
  • Software may include a computer program, code, instructions, or a combination of one or more of these, which may configure a processing unit to operate as desired, or may be processed independently or collectively. You can command the device.
  • Software and/or data may be used on any type of machine, component, physical device, virtual equipment, computer storage medium or device to be interpreted by or to provide instructions or data to a processing device. , or may be permanently or temporarily embodied in a transmitted signal wave.
  • Software may be distributed over networked computer systems and stored or executed in a distributed manner.
  • Software and data may be stored on one or more computer-readable recording media.
  • Figure 10 is a flowchart illustrating a method for predicting race using mutation frequency according to an embodiment of the present invention.
  • the race prediction method described here can be performed by the race prediction server (see 110 in FIG. 1).
  • the race prediction server can be understood as a concept that includes the components and functions of a race prediction system using mutation frequency according to an embodiment of the present invention.
  • the racial prediction method is only one embodiment of the present invention.
  • various steps may be added as needed, and the following steps may also be performed by changing the order, so the present invention It is not limited to each step and its sequence described below.
  • the race prediction server 110 may calculate the frequency of occurrence of mutations by race in conjunction with the population genome mutation database 120.
  • the race prediction server 110 collects the number of mutations by race, which indicates the number of times each mutation appears in a specific race, and the total number of people by race from the population genome mutation database 120, and collects the collected
  • the frequency of appearance of each mutation by race can be calculated using the number of mutations by race and the total number of people by race.
  • the race prediction server 110 uses the number of people with a homozygote mutation (number of homozygote) and the total number of people (allele number / 2) to determine the ratio of people with the homozygote mutation among all people. Calculate the (homozygote rate) and use the number of people with a heterozygote mutation (allele count - 2 * number of homozygote) and the total number of people (allele number / 2) to calculate the number of people with the heterozygote mutation among all people. By calculating the human rate (heterozygote rate), the frequency of occurrence of mutations by race (homozygote rate and heterozygote rate) can be obtained.
  • the race prediction server 110 may calculate a score by race of the target using the frequency of occurrence of mutations by race.
  • the racial prediction server 110 stores the homozygote rate and the heterozygote rate calculated for each mutation in a table by race to generate a mutation frequency table by race, and generates a mutation frequency table by race. Variations regarding the target (target variation) can be searched. Thereafter, the race prediction server 110 may load the values of the homozygote rate and heterozygote rate by race for each of the searched target mutations, and calculate a score by race for the target mutation set consisting of the plurality of target mutations. .
  • the race prediction server 110 may predict the race of the target based on the race score of the target. At this time, the race prediction server 110 may predict the race corresponding to the highest score among the scores for each race of the target as the race of the target.
  • the method according to the embodiment may be implemented in the form of program instructions that can be executed through various computer means and recorded on a computer-readable medium.
  • the computer-readable medium may include program instructions, data files, data structures, etc., singly or in combination.
  • Program instructions recorded on the medium may be specially designed and configured for the embodiment or may be known and available to those skilled in the art of computer software.
  • Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks, and magnetic tapes, optical media such as CDROMs and DVDs, and magneto-optical media such as floptical disks. Includes magneto-optical media and hardware devices specifically configured to store and execute program instructions, such as ROM, RAM, flash memory, etc.
  • program instructions include machine language code, such as that produced by a compiler, as well as high-level language code that can be executed by a computer using an interpreter, etc.
  • the hardware devices described above may be configured to operate as one or more software modules to perform the operations of the embodiments, and vice versa.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Medical Informatics (AREA)
  • Theoretical Computer Science (AREA)
  • Biophysics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Molecular Biology (AREA)
  • Software Systems (AREA)
  • Genetics & Genomics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Databases & Information Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioethics (AREA)
  • Physiology (AREA)
  • Analytical Chemistry (AREA)
  • Chemical & Material Sciences (AREA)
  • Public Health (AREA)
  • Ecology (AREA)
  • Epidemiology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

A race prediction system using variant frequency, according to an embodiment of the present invention, comprises: a variant frequency calculation unit for calculating the frequencies of variants in each race in conjunction with a population genomic variant database; a scores-by-race calculation unit for calculating scores-by-race of a subject using the frequencies of variants in each race; and a race prediction unit for predicting the race of the subject on the basis of the scores-by-race thereof.

Description

변이 출현 빈도를 이용한 인종 예측 시스템 및 방법Racial prediction system and method using mutation frequency
본 발명의 실시예들은 인종 예측 시스템 및 방법에 관한 것으로, 더욱 상세하게는 조건부 확률을 기반으로 인종별 변이 출현 빈도를 이용하여 인종을 예측하는 시스템 및 방법에 관한 것이다.Embodiments of the present invention relate to a system and method for predicting race, and more specifically, to a system and method for predicting race using the frequency of occurrence of mutations by race based on conditional probability.
본 발명은 대한민국 과학기술정보통신부의 지원 하에서 과제고유번호 1711160581, 과제번호 2022-0-00333에 의해 이루어진 것으로서, 상기 과제의 연구관리전문기관은 IITP 정보통신기획평가원, 연구사업명은 "SW컴퓨팅산업원천기술개발(R&D)", 연구과제명은 "소아희귀질환 다면분석 AI 통합 SW 솔루션 개발", 주관기관은 주식회사 쓰리빌리언, 연구기간은 2022.04.01. ~ 2024.12.31.이다.This invention was made under the support of the Ministry of Science and ICT of the Republic of Korea under project number 1711160581 and task number 2022-0-00333. The research management agency for the project is IITP Information and Communication Planning and Evaluation Institute, and the research project name is "SW Computing Industry Source." “Technology Development (R&D)”, the research project name is “Development of AI integrated SW solution for multi-faceted analysis of rare pediatric diseases”, the host organization is Three Billion Co., Ltd., and the research period is 2022.04.01. ~ 2024.12.31.
본 특허출원은 2022년 10월 26일에 대한민국 특허청에 제출된 대한민국 특허출원 제10-2022-0138807호에 대하여 우선권을 주장하며, 상기 특허출원의 개시 사항은 본 명세서에 참조로서 삽입된다.This patent application claims priority to Korean Patent Application No. 10-2022-0138807, filed with the Korean Intellectual Property Office on October 26, 2022, the disclosure of which is incorporated herein by reference.
기존에 사람의 인종을 예측하는 방법들은 예측 모델을 만들 때 코호트 데이터로 집단의 변이 프로필을 만들어 사용한다. 데이터에 샘플 수(사람 수)가 N이고 총 변이의 수가 M 일 때 N*M 크기의 변이 프로필을 구축한 뒤, PCA(Principal Component Analysis) 기법을 통해 가장 의미 있는 축을 몇 개 선정하게 된다. 만들어진 예측 모델을 통해 대상의 인종을 예측할 때는 만들어진 PCA 상에서 대상의 변이 정보가 어떤 집단과의 평균 거리가 가장 가까운지를 계산하는 방식을 사용한다.Existing methods for predicting a person's race use cohort data to create a group mutation profile when creating a prediction model. When the number of samples (number of people) in the data is N and the total number of mutations is M, a mutation profile of size N*M is constructed, and then several of the most meaningful axes are selected through PCA (Principal Component Analysis). When predicting the race of a target through a created prediction model, a method is used to calculate which group the target's mutation information has the closest average distance to in the created PCA.
기존 방법의 한계점은 예측 모델을 만들기 위해 N*M 크기의 변이 프로필을 구축해야 한다는 점이다. 즉, 예측에 사용되는 모든 샘플의 변이 프로필을 취합해서 분석해야 하는데, 이러한 데이터는 구하기도 힘들고 분석에도 높은 사양의 분석 기기가 필요하다.The limitation of existing methods is that N*M mutation profiles must be constructed to create a prediction model. In other words, the mutation profiles of all samples used for prediction must be collected and analyzed, but such data is difficult to obtain and analysis requires high-spec analysis equipment.
N*M 크기의 변이 프로필 정보를 제공하는 가장 큰 공공 데이터베이스는 1000 genome project이기 때문에 기존 방식들이 이 데이터를 많이 사용하고 있다. 그런데 해당 데이터는 각 인종의 종류도 적고 인종별 표본 수도 많지 않기 때문에 정확한 예측이 어렵다.Since the largest public database that provides N*M size mutation profile information is the 1000 genome project, existing methods make extensive use of this data. However, accurate predictions are difficult because the data includes only a small number of types of each race and a small number of samples for each race.
따라서 변이 프로필 데이터가 아닌 인종별로 요약된 변이 정보를 사용해서 인종을 예측하는 방법의 개발이 요구되고 있다.Therefore, there is a need to develop a method to predict race using mutation information summarized by race rather than mutation profile data.
본 발명의 일 실시예는 확률 방법론을 기반으로 요약된 변이 출현 빈도를 이용하여 대상의 인종을 예측함으로써 분석에 높은 사양의 분석 기기를 필요로 하지 않으면서도 신속하고 정확하게 대상의 인종을 예측할 수 있는 변이 출현 빈도를 이용한 인종 예측 시스템 및 방법을 제공한다.One embodiment of the present invention is a mutation that can quickly and accurately predict the race of a target without requiring a high-specification analysis device for analysis by predicting the race of the target using the frequency of appearance of mutations summarized based on probability methodology. Provides a race prediction system and method using frequency of appearance.
본 발명이 해결하고자 하는 과제는 이상에서 언급한 과제(들)로 제한되지 않으며, 언급되지 않은 또 다른 과제(들)은 아래의 기재로부터 당업자에게 명확하게 이해될 수 있을 것이다.The problem to be solved by the present invention is not limited to the problem(s) mentioned above, and other problem(s) not mentioned will be clearly understood by those skilled in the art from the description below.
본 발명의 일 실시예에 따른 변이 출현 빈도를 이용한 인종 예측 시스템은 집단 유전체 변이 데이터베이스와 연동하여 인종별 변이 출현 빈도를 계산하는 변이 빈도 계산부; 상기 인종별 변이 출현 빈도를 이용하여 대상의 인종별 점수를 계산하는 인종별 점수 계산부; 및 상기 대상의 인종별 점수에 기초하여 상기 대상의 인종을 예측하는 인종 예측부를 포함한다.A race prediction system using the frequency of mutation appearance according to an embodiment of the present invention includes a mutation frequency calculator that calculates the frequency of mutation appearance by race in conjunction with a population genome mutation database; a racial score calculation unit that calculates a score by race of the target using the frequency of occurrence of mutations by race; and a race prediction unit that predicts the race of the target based on the race score of the target.
상기 변이 빈도 계산부는 상기 집단 유전체 변이 데이터베이스로부터 각 변이가 특정 인종에서 출현한 횟수를 나타내는 인종별 변이 출현 횟수, 및 인종별 전체 사람 수를 수집하고, 수집된 상기 인종별 변이 출현 횟수 및 상기 인종별 전체 사람 수를 이용하여 상기 각 변이의 인종별 출현 빈도를 계산할 수 있다.The mutation frequency calculation unit collects the number of mutations by race, which represents the number of times each mutation appears in a specific race, and the total number of people by race from the population genome mutation database, and the collected number of mutations by race and the total number of people by race. The frequency of occurrence of each mutation by race can be calculated using the total number of people.
상기 변이 빈도 계산부는 homozygote 변이를 가지고 있는 사람의 수(number of homozygote) 및 전체 사람의 수(allele number / 2)를 이용하여, 전체 사람 중 상기 homozygote 변이를 가진 사람의 비율(homozygote rate)를 계산하고, heterozygote 변이를 가지고 있는 사람의 수(allele count - 2 * number of homozygote) 및 전체 사람의 수(allele number / 2)를 이용하여, 전체 사람 중 상기 heterozygote 변이를 가지고 있는 사람의 비율(heterozygote rate)을 계산하며, 상기 인종별 변이 출현 빈도는 상기 homozygote rate 및 상기 heterozygote rate를 포함할 수 있다.The mutation frequency calculation unit calculates the ratio of people with the homozygote mutation among all people (homozygote rate) using the number of people with the homozygote mutation (number of homozygote) and the total number of people (allele number / 2). And, using the number of people with a heterozygote mutation (allele count - 2 * number of homozygote) and the total number of people (allele number / 2), the ratio of people with the heterozygote mutation among all people (heterozygote rate) ) is calculated, and the frequency of occurrence of mutations by race may include the homozygote rate and the heterozygote rate.
상기 변이 빈도 계산부는 상기 homozygote rate 및 상기 heterozygote rate의 계산 시, 변별력 있는 변이를 골라내기 위하여, 상기 전체 사람의 수(allele number / 2)가 최소한 천 명 이상인 변이를 고르고, 변이가 너무 희귀하면 계산에 영향을 미칠 수 있다는 점을 고려하여 전체 인종에서 allele count의 비율이 5% 이상, 95% 이하인 변이를 골라내서, 상기 homozygote rate 및 상기 heterozygote rate의 계산을 위한 대상 변이를 선정할 수 있다.When calculating the homozygote rate and the heterozygote rate, the mutation frequency calculation unit selects a mutation for which the total number of people (allele number / 2) is at least 1,000 or more in order to select a discriminating mutation, and calculates the mutation if the mutation is too rare. Considering that it may affect the overall race, variants with an allele count ratio of 5% or more and 95% or less can be selected to select the target variant for calculation of the homozygote rate and the heterozygote rate.
본 발명의 일 실시예에 따른 변이 출현 빈도를 이용한 인종 예측 시스템은 상기 각 변이에 대하여 계산된 상기 homozygote rate 및 상기 heterozygote rate를 인종별로 테이블에 저장하여 인종별 변이 출현 빈도 테이블을 생성하는 변이 빈도 테이블 구축부를 더 포함하고, 상기 인종별 점수 계산부는 상기 인종별 변이 출현 빈도 테이블로부터 상기 대상에 관한 변이(대상 변이)를 탐색하고, 상기 탐색된 대상 변이 각각에 대한 인종별 homozygote rate 및 heterozygote rate의 값을 로딩하여, 상기 복수의 대상 변이로 이루어진 대상 변이 집합에 대한 인종별 점수를 계산할 수 있다.The race prediction system using mutation frequency according to an embodiment of the present invention stores the homozygote rate and heterozygote rate calculated for each mutation in a table by race to generate a mutation frequency table by race. It further includes a construction unit, wherein the score calculation unit for each race searches for mutations (target mutations) related to the target from the mutation frequency table for each race, and values of the homozygote rate and heterozygote rate for each race for each of the searched target mutations. By loading, a score by race can be calculated for the target mutation set consisting of the plurality of target mutations.
상기 인종별 점수 계산부는 상기 대상 변이 각각에 대한 인종별 homozygote rate 및 heterozygote rate의 값을 이용하여, 특정 인종에서 상기 대상이 가진 변이 집합(대상 변이 집합)이 나타날 조건부 확률에 기반하여 상기 대상의 인종별 점수를 계산할 수 있다.The race-specific score calculation unit uses the racial homozygote rate and heterozygote rate values for each target mutation, and determines the target's race based on the conditional probability that the target mutation set (target mutation set) will appear in a specific race. You can calculate star scores.
상기 인종별 점수 계산부는 하기 수학식 1을 이용하여 상기 인종별 점수(Ethnicity score)를 계산할 수 있다.The ethnic score calculation unit may calculate the ethnic score using Equation 1 below.
[수학식 1][Equation 1]
Figure PCTKR2022019581-appb-img-000001
Figure PCTKR2022019581-appb-img-000001
여기서, V는 대상 변이 집합(v1, v2, ..., vn)을 나타내고, E는 인종, n은 대상 변이의 수, Pr(Vn|E)은 대상 변이 vn이 특정 인종에서 발생활 확률을 각각 나타냄.Here, V represents the set of target variants (v 1 , v 2 , ..., v n ), E is race, n is the number of target variants, and Pr(Vn|E) is the number of target variants v n in a particular race. Each represents the probability of death.
상기 인종 예측부는 상기 대상의 인종별 점수 중에서 가장 높은 점수에 해당하는 인종을 상기 대상의 인종으로 예측할 수 있다.The race prediction unit may predict the race corresponding to the highest score among the scores for each race of the target as the race of the target.
본 발명의 일 실시예에 따른 변이 출현 빈도를 이용한 인종 예측 방법은 상기 인종 예측 서버가 집단 유전체 변이 데이터베이스와 연동하여 인종별 변이 출현 빈도를 계산하는 단계; 상기 인종 예측 서버가 상기 인종별 변이 출현 빈도를 이용하여 대상의 인종별 점수를 계산하는 단계; 및 상기 인종 예측 서버가 상기 대상의 인종별 점수에 기초하여 상기 대상의 인종을 예측하는 단계를 포함한다.A method for predicting race using the frequency of mutation appearance according to an embodiment of the present invention includes the steps of the race prediction server linking with a population genome mutation database to calculate the frequency of mutation appearance by race; The racial prediction server calculating a score by race of the target using the frequency of occurrence of mutations by race; And a step of the race prediction server predicting the race of the target based on the race score of the target.
상기 인종별 변이 출현 빈도를 계산하는 단계는 상기 집단 유전체 변이 데이터베이스로부터 각 변이가 특정 인종에서 출현한 횟수를 나타내는 인종별 변이 출현 횟수, 및 인종별 전체 사람 수를 수집하는 단계; 및 수집된 상기 인종별 변이 출현 횟수 및 상기 인종별 전체 사람 수를 이용하여 상기 각 변이의 인종별 출현 빈도를 계산하는 단계를 포함할 수 있다.The step of calculating the frequency of occurrence of mutations by race includes collecting the number of mutations by race, which indicates the number of times each mutation appears in a specific race, and the total number of people by race from the population genome mutation database; And it may include calculating the frequency of appearance of each variant by race using the collected number of occurrences of variants by race and the total number of people by race.
상기 각 변이의 인종별 출현 빈도를 계산하는 단계는 homozygote 변이를 가지고 있는 사람의 수(number of homozygote) 및 전체 사람의 수(allele number / 2)를 이용하여, 전체 사람 중 상기 homozygote 변이를 가진 사람의 비율(homozygote rate)를 계산하는 단계; 및 heterozygote 변이를 가지고 있는 사람의 수(allele count - 2 * number of homozygote) 및 전체 사람의 수(allele number / 2)를 이용하여, 전체 사람 중 상기 heterozygote 변이를 가지고 있는 사람의 비율(heterozygote rate)을 계산하는 단계를 포함하고, 상기 인종별 변이 출현 빈도는 상기 homozygote rate 및 상기 heterozygote rate를 포함할 수 있다.The step of calculating the frequency of occurrence of each mutation by race is to calculate the number of people with the homozygote mutation (number of homozygote) and the total number of people (allele number / 2), and calculate the number of people with the homozygote mutation among all people. Calculating the homozygote rate; And using the number of people with a heterozygote mutation (allele count - 2 * number of homozygote) and the total number of people (allele number / 2), the ratio of people with the heterozygote mutation among all people (heterozygote rate) It includes the step of calculating , and the frequency of occurrence of mutations by race may include the homozygote rate and the heterozygote rate.
본 발명의 일 실시예에 따른 변이 출현 빈도를 이용한 인종 예측 방법은 상기 인종 예측 서버가 상기 각 변이에 대하여 계산된 상기 homozygote rate 및 상기 heterozygote rate를 인종별로 테이블에 저장하여 인종별 변이 출현 빈도 테이블을 생성하는 단계를 더 포함하고, 상기 대상의 인종별 점수를 계산하는 단계는 상기 인종별 변이 출현 빈도 테이블로부터 상기 대상에 관한 변이(대상 변이)를 탐색하는 단계; 및 상기 탐색된 대상 변이 각각에 대한 인종별 homozygote rate 및 heterozygote rate의 값을 로딩하여, 상기 복수의 대상 변이로 이루어진 대상 변이 집합에 대한 인종별 점수를 계산하는 단계를 포함할 수 있다.In the racial prediction method using mutation frequency according to an embodiment of the present invention, the racial prediction server stores the homozygote rate and heterozygote rate calculated for each mutation in a table by race to create a mutation frequency table by race. It further includes generating a score for each race of the object, wherein the step of calculating the score for each race includes: searching for a variant (target variant) related to the object from the variant appearance frequency table for each race; And it may include loading the values of the homozygote rate and heterozygote rate by race for each of the searched target mutations, and calculating a score by race for the target mutation set consisting of the plurality of target mutations.
기타 실시예들의 구체적인 사항들은 상세한 설명 및 첨부 도면들에 포함되어 있다.Specific details of other embodiments are included in the detailed description and accompanying drawings.
본 발명의 일 실시예에 따르면, 기존의 PCA 혹은 random forest 와 같은 머신러닝 기법이 아닌 확률 방법론을 사용하며, 기존의 방법이 모델 구축을 위해 N 개의 변이와 M 개의 샘플로 이루어진 N*M 변이 프로필을 필요로 하는 것과 달리, 요약된 변이 출현횟수 정보만을 필요로 하기 때문에 분석에 높은 사양의 분석 기기가 필요하지 않으며 신속하고 정확하게 대상의 인종을 예측할 수 있다.According to one embodiment of the present invention, a probability methodology is used rather than a machine learning technique such as existing PCA or random forest, and the existing method uses an N*M mutation profile consisting of N mutations and M samples to build a model. Unlike those that require , because it only requires information on the number of occurrences of mutations in a summary, high-spec analysis equipment is not required for analysis, and the race of the target can be predicted quickly and accurately.
본 발명의 일 실시예에 따르면, 인종별 확률의 평균값을 모두 제시하기 때문에 더 자세하게 결과를 해석할 수 있으며, 예측한 인종 정보는 다양한 연구와 임상 진단에 유용하게 활용될 수 있다. 예를 들어, 특정 질병을 일으킨다고 알려진 변이가 질병을 가지지 않은 A 인종의 사람에게서 다수 발견된다면 A 인종에 한하여 해당 변이와 질병의 연관성을 낮출 수 있다. 또한 인종에 따라 유병률이 달라지는 질병의 경우, 환자의 인종을 확인함으로써 해당 질병으로 진단하는 데 부가적인 단서를 확보할 수 있다.According to one embodiment of the present invention, the results can be interpreted in more detail because the average value of the probabilities for each race is presented, and the predicted racial information can be usefully used in various research and clinical diagnosis. For example, if a mutation known to cause a specific disease is found in large numbers in people of race A who do not have the disease, the association between the mutation and the disease can be lowered only for race A. Additionally, in the case of diseases whose prevalence varies depending on race, additional clues can be obtained for diagnosing the disease by confirming the patient's race.
도 1은 본 발명의 일 실시예에 따른 변이 출현 빈도를 이용한 인종 예측 시스템의 구성도이다.Figure 1 is a diagram illustrating the configuration of a race prediction system using mutation frequency according to an embodiment of the present invention.
도 2는 도 1의 인종 예측 서버의 상세 구성을 설명하기 위해 도시한 블록도이다.FIG. 2 is a block diagram illustrating the detailed configuration of the race prediction server of FIG. 1.
도 3은 본 발명의 일 실시예에 따라 생성된 인종별 변이 출현 빈도 테이블의 일례를 도시한 도면이다.Figure 3 is a diagram illustrating an example of a mutation frequency table by race generated according to an embodiment of the present invention.
도 4는 본 발명의 일 실시예에 따라 인종별 변이 출현 빈도 테이블을 생성하는 방법을 설명하기 위해 도시한 도면이다.FIG. 4 is a diagram illustrating a method of generating a table of variation occurrence frequencies by race according to an embodiment of the present invention.
도 5는 본 발명의 일 실시예에 따라 인종별 점수를 산출하는 방법을 설명하기 위해 도시한 도면이다.Figure 5 is a diagram illustrating a method of calculating scores by race according to an embodiment of the present invention.
도 6 내지 도 8은 본 발명의 일 실시예에 따라 인종별 변리 출현 빈도(homozygote rate와 heterozygote rate)를 구하는 과정을 설명하기 위해 도시한 도면이다.Figures 6 to 8 are diagrams to explain the process of calculating the frequency of occurrence of variations (homozygote rate and heterozygote rate) by race according to an embodiment of the present invention.
도 9는 본 발명의 일 실시예에 따라 인종별 조건부 확률을 계산하는 단계에서 사용되는 특정 대상의 변이 정보를 표로 나타낸 도면이다.Figure 9 is a table showing variation information of a specific object used in the step of calculating conditional probability for each race according to an embodiment of the present invention.
도 10은 본 발명의 일 실시예에 따른 변이 출현 빈도를 이용한 인종 예측 방법을 설명하기 위해 도시한 흐름도이다.Figure 10 is a flowchart illustrating a method for predicting race using mutation frequency according to an embodiment of the present invention.
본 발명의 이점 및/또는 특징, 그리고 그것들을 달성하는 방법은 첨부되는 도면과 함께 상세하게 후술되어 있는 실시예들을 참조하면 명확해질 것이다. 그러나, 본 발명은 이하에서 개시되는 실시예들에 한정되는 것이 아니라 서로 다른 다양한 형태로 구현될 것이며, 단지 본 실시예들은 본 발명의 개시가 완전하도록 하며, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 발명의 범주를 완전하게 알려주기 위해 제공되는 것이며, 본 발명은 청구항의 범주에 의해 정의될 뿐이다. 명세서 전체에 걸쳐 동일 참조 부호는 동일 구성요소를 지칭한다.The advantages and/or features of the present invention and methods for achieving them will become clear by referring to the embodiments described in detail below in conjunction with the accompanying drawings. However, the present invention is not limited to the embodiments disclosed below and will be implemented in various different forms. The present embodiments only serve to ensure that the disclosure of the present invention is complete and are within the scope of common knowledge in the technical field to which the present invention pertains. It is provided to fully inform those who have the scope of the invention, and the present invention is only defined by the scope of the claims. Like reference numerals refer to like elements throughout the specification.
또한, 이하 실시되는 본 발명의 바람직한 실시예는 본 발명을 이루는 기술적 구성요소를 효율적으로 설명하기 위해 각각의 시스템 기능구성에 기 구비되어 있거나, 또는 본 발명이 속하는 기술분야에서 통상적으로 구비되는 시스템 기능 구성은 가능한 생략하고, 본 발명을 위해 추가적으로 구비되어야 하는 기능 구성을 위주로 설명한다. 만약 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자라면, 하기에 도시하지 않고 생략된 기능 구성 중에서 종래에 기 사용되고 있는 구성요소의 기능을 용이하게 이해할 수 있을 것이며, 또한 상기와 같이 생략된 구성 요소와 본 발명을 위해 추가된 구성 요소 사이의 관계도 명백하게 이해할 수 있을 것이다.In addition, preferred embodiments of the present invention to be implemented below are provided in each system function configuration in order to efficiently explain the technical components constituting the present invention, or system functions commonly provided in the technical field to which the present invention pertains. The configuration will be omitted whenever possible, and the description will focus on the functional configuration that must be additionally provided for the present invention. If a person has ordinary knowledge in the technical field to which the present invention pertains, he or she will be able to easily understand the functions of conventionally used components among the functional configurations not shown and omitted below, as well as the omitted configurations as described above. The relationships between elements and components added for the present invention will also be clearly understood.
또한, 이하의 설명에 있어서, 신호 또는 정보의 "전송", "통신", "송신", "수신" 기타 이와 유사한 의미의 용어는 일 구성요소에서 다른 구성요소로 신호 또는 정보가 직접 전달되는 것뿐만이 아니라 다른 구성요소를 거쳐 전달되는 것도 포함한다. 특히 신호 또는 정보를 일 구성요소로 "전송" 또는 "송신"한다는 것은 그 신호 또는 정보의 최종 목적지를 지시하는 것이고 직접적인 목적지를 의미하는 것이 아니다. 이는 신호 또는 정보의 "수신"에 있어서도 동일하다.In addition, in the following description, "transmission", "communication", "transmission", "reception" and other similar terms of signals or information refer to the direct transmission of signals or information from one component to another component. In addition, it also includes those transmitted through other components. In particular, “transmitting” or “transmitting” a signal or information as a component indicates the final destination of the signal or information and does not mean the direct destination. This is the same for “receiving” signals or information.
이하에서는 첨부된 도면을 참조하여 본 발명의 실시예들을 상세히 설명하기로 한다.Hereinafter, embodiments of the present invention will be described in detail with reference to the attached drawings.
도 1은 본 발명의 일 실시예에 따른 변이 출현 빈도를 이용한 인종 예측 시스템의 구성도이고, 도 2는 도 1의 인종 예측 서버(110)의 상세 구성을 설명하기 위해 도시한 블록도이다.FIG. 1 is a configuration diagram of a racial prediction system using mutation frequency according to an embodiment of the present invention, and FIG. 2 is a block diagram illustrating the detailed configuration of the racial prediction server 110 of FIG. 1.
도 1 및 도 2를 참조하면, 본 발명의 일 실시예에 따른 변이 출현 빈도를 이용한 인종 예측 시스템은 인종 예측 서버 (110)로 구현될 수 있다. 상기 인종 예측 서버(110)는 변이 빈도 계산부(210), 변이 빈도 테이블 구축부(220), 인종별 점수 계산부(230), 인종 예측부(240), 및 제어부(250)를 포함하여 구성될 수 있다.Referring to Figures 1 and 2, the race prediction system using the frequency of mutation appearance according to an embodiment of the present invention may be implemented as a race prediction server 110. The race prediction server 110 includes a mutation frequency calculation unit 210, a mutation frequency table construction unit 220, a race score calculation unit 230, a race prediction unit 240, and a control unit 250. It can be.
상기 변이 빈도 계산부(210)는 집단 유전체 변이 데이터베이스(120)와 연동하여 인종별 변이 출현 빈도를 계산할 수 있다. 본 실시예에서 상기 집단 유전체 변이 데이터베이스(120)는 GnomAD(The Genome Aggregation Database) 데이터베이스로 구현될 수 있다.The mutation frequency calculation unit 210 can calculate the mutation frequency by race in conjunction with the population genome mutation database 120. In this embodiment, the population genome variation database 120 may be implemented as a GnomAD (The Genome Aggregation Database) database.
구체적으로, 상기 변이 빈도 계산부(210)는 상기 집단 유전체 변이 데이터베이스(120)로부터 각 변이가 특정 인종에서 출현한 횟수를 나타내는 인종별 변이 출현 횟수, 및 인종별 전체 사람 수를 수집할 수 있다. 그리고, 상기 변이 빈도 계산부(210)는 수집된 상기 인종별 변이 출현 횟수 및 상기 인종별 전체 사람 수를 이용하여 상기 각 변이의 인종별 출현 빈도를 계산할 수 있다.Specifically, the mutation frequency calculation unit 210 may collect the number of mutations by race, which indicates the number of times each mutation appears in a specific race, and the total number of people by race from the population genome mutation database 120. In addition, the mutation frequency calculation unit 210 may calculate the frequency of occurrence of each mutation by race using the collected number of occurrences of mutations by race and the total number of people by race.
여기서, 상기 인종별 변이 출현 빈도는 전체 사람 중에서 homozygote 변이를 가진 사람의 비율(homozygote rate)과 heterozygote 변이를 가진 사람의 비율(heterozygote rate)을 포함하는 개념으로 이해될 수 있다. 상기 homozygote rate 및 상기 heterozygote rate를 계산하는 과정은 다음과 같다.Here, the frequency of occurrence of mutations by race can be understood as a concept that includes the proportion of people with homozygote mutations (homozygote rate) and the proportion of people with heterozygote mutations (heterozygote rate) among all people. The process for calculating the homozygote rate and the heterozygote rate is as follows.
즉, 상기 변이 빈도 계산부(210)는 homozygote 변이를 가지고 있는 사람의 수(number of homozygote) 및 전체 사람의 수(allele number / 2)를 이용하여, 전체 사람 중 상기 homozygote 변이를 가진 사람의 비율(homozygote rate)를 계산할 수 있다(homozygote rate = number of homozygote / (allele number / 2)).That is, the mutation frequency calculation unit 210 uses the number of people with a homozygote mutation (number of homozygote) and the total number of people (allele number / 2) to calculate the ratio of people with the homozygote mutation among all people. (homozygote rate) can be calculated (homozygote rate = number of homozygote / (allele number / 2)).
또한, 상기 변이 빈도 계산부(210)는 heterozygote 변이를 가지고 있는 사람의 수(allele count - 2 * number of homozygote) 및 전체 사람의 수(allele number / 2)를 이용하여, 전체 사람 중 상기 heterozygote 변이를 가지고 있는 사람의 비율(heterozygote rate)을 계산할 수 있다(heterozygote rate = (allele count - 2 * number of homozygote) / (allele number / 2)).In addition, the mutation frequency calculation unit 210 uses the number of people with a heterozygote mutation (allele count - 2 * number of homozygote) and the total number of people (allele number / 2) to calculate the heterozygote mutation among all people. You can calculate the proportion of people who have (heterozygote rate) (heterozygote rate = (allele count - 2 * number of homozygote) / (allele number / 2)).
상기 변이 빈도 계산부(210)는 상기 homozygote rate 및 상기 heterozygote rate의 계산 시, 변별력 있는 변이를 골라내기 위하여, 상기 전체 사람의 수(allele number / 2) 및 전체 인종 대비 allele count의 비율에 대하여 수치 한정을 적용할 수 있다.In order to select a discriminating mutation when calculating the homozygote rate and the heterozygote rate, the mutation frequency calculation unit 210 calculates a value for the total number of people (allele number / 2) and the ratio of the allele count to the total race. Restrictions may apply.
예를 들면, 상기 변이 빈도 계산부(210)는 상기 전체 사람의 수(allele number / 2)가 최소한 천 명 이상인 변이를 고르고, 변이가 너무 희귀하면 계산에 영향을 미칠 수 있다는 점을 고려하여 전체 인종에서 allele count의 비율이 5% 이상, 95% 이하인 변이를 골라내서, 상기 homozygote rate 및 상기 heterozygote rate의 계산을 위한 대상 변이를 선정할 수 있다.For example, the mutation frequency calculation unit 210 selects mutations for which the total number of people (allele number / 2) is at least 1,000 or more, and considering that if the mutations are too rare, it may affect the calculation, the overall number By selecting mutations with an allele count ratio of 5% or more and 95% or less in a race, target mutations for calculating the homozygote rate and the heterozygote rate can be selected.
상기 변이 빈도 테이블 구축부(220)는 상기 각 변이에 대하여 계산된 상기 homozygote rate 및 상기 heterozygote rate를 인종별로 테이블에 저장하여 인종별 변이 출현 빈도 테이블을 생성할 수 있다.The mutation frequency table construction unit 220 may store the homozygote rate and heterozygote rate calculated for each mutation in a table by race to generate a mutation frequency table by race.
예컨대, 도 3에 도시된 바와 같이 변이(variant 1, variant 2, ...)와 인종(Korean, South Asian, African, American)을 각각 행과 열로 하는 표를 만들고, 계산된 확률 값(homozygote rate, heterozygote rate)을 채워 넣어서 상기 인종별 변이 출현 빈도 테이블을 생성할 수 있다.For example, as shown in Figure 3, create a table with variation (variant 1, variant 2, ...) and race (Korean, South Asian, African, American) as rows and columns, and calculate the calculated probability value (homozygote rate). , heterozygote rate) can be filled in to create the mutation frequency table by race.
상기 인종별 변이 출현 빈도 테이블을 생성하는 방법에 대하여 도 4를 참조하여 상세히 설명한다. 도 4를 참조하면, 1-2-C-G와 관련하여, Human 1, 2, 3의 American 인종의 경우 Reference를 기준으로 한 쌍의 대립유전자(allele)가 모두 C에서 G로 변이하였으므로 Homozygote rate의 값은 100%(1.0), Heterozygote rate의 값은 0%(0.0)이 된다.The method of generating the mutation frequency table by race will be described in detail with reference to FIG. 4. Referring to FIG. 4, in relation to 1-2-C-G, in the case of the American race of Human 1, 2, and 3, a pair of alleles have all mutated from C to G based on the reference, so the value of the homozygote rate is 100% (1.0), and the value of heterozygote rate is 0% (0.0).
또한, 1-2-C-G와 관련하여, Human 4, 5의 African 인종의 경우 Reference를 기준으로 한 쌍의 allele가 모두 C에서 G로 변이하였으므로 Homozygote rate의 값은 100%(1.0), Heterozygote rate의 값은 0%(0.0)이 된다.In addition, in relation to 1-2-C-G, in the case of the African race of Humans 4 and 5, both alleles of the pair mutated from C to G based on the reference, so the value of the homozygote rate is 100% (1.0) and the heterozygote rate is 100% (1.0). The value becomes 0% (0.0).
또한, 1-2-C-G와 관련하여, Human 6, 7, 8, 9의 East Asian 인종의 경우 Reference를 기준으로 한 쌍의 allele가 Human 6, 7, 8은 모두 C에서 G로 변이하였고 Human 9는 한 쌍의 allele 중 하나만 변이하였으므로 Homozygote rate의 값은 75%(0.75), Heterozygote rate의 값은 25%(0.25)가 된다.In addition, in relation to 1-2-C-G, in the case of the East Asian race of Humans 6, 7, 8, and 9, a pair of alleles in Humans 6, 7, and 8 all mutated from C to G based on the reference, and Human 9 Since only one of the pair of alleles was mutated, the homozygote rate is 75% (0.75) and the heterozygote rate is 25% (0.25).
이와 같은 과정을 반복하면 도 4에 도시된 바와 같이 상기 인종별 변이 출현 빈도 테이블(Variant frequency table)을 생성할 수 있다. 본 실시예에서는 변이 프로필 데이터가 아닌 인종별로 요약된, 상기 인종별 변이 출현 빈도 테이블의 변이 정보를 사용하기 때문에 정확하고 빠른 예측이 가능하다.By repeating this process, the variant frequency table by race can be generated as shown in FIG. 4. In this embodiment, accurate and fast prediction is possible because mutation information from the mutation frequency table by race, summarized by race, is used rather than mutation profile data.
상기 인종별 점수 계산부(230)는 상기 인종별 변이 출현 빈도를 이용하여 대상의 인종별 점수를 계산할 수 있다. 이를 위해, 상기 인종별 점수 계산부(230)는 상기 인종별 변이 출현 빈도 테이블을 이용할 수 있다.The score calculation unit 230 for each race may calculate the score for each race of the target using the frequency of occurrence of mutations for each race. For this purpose, the race-specific score calculation unit 230 may use the race-specific variant appearance frequency table.
즉, 상기 인종별 점수 계산부(230)는 상기 인종별 변이 출현 빈도 테이블로부터 상기 대상에 관한 변이(대상 변이)를 탐색하고, 상기 탐색된 대상 변이 각각에 대한 인종별 homozygote rate 및 heterozygote rate의 값을 로딩하여, 상기 복수의 대상 변이로 이루어진 대상 변이 집합에 대한 인종별 점수를 계산할 수 있다.That is, the race-specific score calculation unit 230 searches for mutations (target mutations) related to the target from the race-specific mutation appearance frequency table, and values the racial homozygote rate and heterozygote rate for each of the searched target mutations. By loading, a score by race can be calculated for the target mutation set consisting of the plurality of target mutations.
이때, 상기 인종별 점수 계산부(230)는 상기 대상 변이 각각에 대한 인종별 homozygote rate 및 heterozygote rate의 값을 이용하여, 특정 인종에서 상기 대상이 가진 변이 집합(대상 변이 집합)이 나타날 조건부 확률에 기반하여 상기 대상의 인종별 점수를 계산할 수 있다.At this time, the race-specific score calculation unit 230 uses the values of the homozygote rate and heterozygote rate for each race for each target mutation to determine the conditional probability that the mutation set (target mutation set) of the target will appear in a specific race. Based on this, the score for each race of the subject can be calculated.
예를 들면, 상기 인종별 점수 계산부(230)는 하기 수학식 1을 이용하여 상기 인종별 점수(Ethnicity score)를 계산할 수 있다.For example, the ethnicity score calculation unit 230 may calculate the ethnicity score using Equation 1 below.
[수학식 1][Equation 1]
Figure PCTKR2022019581-appb-img-000002
Figure PCTKR2022019581-appb-img-000002
여기서, V는 대상 변이 집합(v1, v2, ..., vn)을 나타내고, E는 인종, n은 대상 변이의 수, Pr(Vn|E)은 대상 변이 vn이 특정 인종에서 발생활 확률을 각각 나타낸다.Here, V represents the set of target variants (v 1 , v 2 , ..., v n ), E is race, n is the number of target variants, and Pr(Vn|E) is the number of target variants v n in a particular race. Each represents the probability of death.
즉, 상기 인종별 점수 계산부(230)는 상기 인종별 변이 출현 빈도 테이블에서 상기 대상 변이 vn이 특정 인종에서 발생활 확률(Pr(Vn|E)), 즉 상기 대상 변이 각각에 대한 인종별 homozygote rate 및 heterozygote rate의 값을 모두 곱셈 연산할 수 있다.That is, the race-specific score calculation unit 230 determines the probability (Pr(Vn|E)) of the target mutation v n occurring in a specific race in the race-specific variant appearance frequency table, that is, the probability (Pr(Vn|E)) of occurrence of the target variant v n in a specific race, that is, the racial variant for each of the target variants. Both homozygote rate and heterozygote rate values can be multiplied.
이때, 각 확률을 계속 곱하는 것이 상기 곱셈 연산의 결과 값을 매우 작게 만들 수 있으므로, 상기 인종별 점수 계산부(230)는 상기 수학식 1에서와 같이 각 확률들의 기하평균(1/n 제곱)을 계산해 이를 각 인종별 점수로 산출할 수 있다.At this time, since continuously multiplying each probability can make the result value of the multiplication operation very small, the racial score calculation unit 230 calculates the geometric mean (1/n squared) of each probability as in Equation 1 above. This can be calculated and calculated as a score for each race.
상기 인종별 점수를 산출하는 방법과 관련해서 도 5를 참조하여 설명하면, 상기 인종별 점수 계산부(230)는 상기 인종별 변이 출현 빈도 테이블(Variant frequency table)에서 각 샘플별(Samples)로 대상 변이의 인종별 Zygosity(1/1: Homozygote, 1/0: Heterozygote)를 가져와서, 이를 상기 수학식 1에 적용하여 인종별 점수를 계산할 수 있다.Regarding the method of calculating the score by race, referring to FIG. 5, the race score calculation unit 230 selects samples for each sample from the racial variant frequency table. You can calculate the score by race by taking the Zygosity (1/1: Homozygote, 1/0: Heterozygote) of the mutation by race and applying it to Equation 1 above.
도 5에서 아래의 표는 각 샘플(대상)들에 대하여 인종별 점수를 계산한 결과로서, Sample A의 경우 American 인종이 0.615, African 인종이 0.578, East Asian 인종이 0.3145의 값을 가지고, Sample B의 경우 American 인종이 0.275, African 인종이 0.342, East Asian 인종이 0.3314의 값을 가지는 것을 볼 수 있다. 이러한 값들은 후술하는 인종 예측부(240)에서 인종을 예측하는 데 활용될 수 있다.The table below in Figure 5 is the result of calculating scores by race for each sample (subject). For Sample A, the American race had a value of 0.615, the African race had a value of 0.578, and the East Asian race had a value of 0.3145, and Sample B In the case of , you can see that the American race has a value of 0.275, the African race has a value of 0.342, and the East Asian race has a value of 0.3314. These values can be used to predict race in the race prediction unit 240, which will be described later.
상기 인종 예측부(240)는 상기 대상의 인종별 점수에 기초하여 상기 대상의 인종을 예측할 수 있다. 즉, 상기 인종 예측부(240)는 상기 대상의 인종별 점수 중에서 가장 높은 점수에 해당하는 인종을 상기 대상의 인종으로 예측할 수 있다.The race prediction unit 240 may predict the race of the target based on the race score of the target. That is, the race prediction unit 240 may predict the race corresponding to the highest score among the scores for each race of the target as the race of the target.
도 5를 예로 들어 설명하면, 상기 인종 예측부(240)는 Sample A의 경우 American의 점수가 0.615로 가장 높으므로 대상 Sample A의 인종은 American으로 예측할 수 있다. 또한, 상기 인종 예측부(240)는 Sample B의 경우 African의 점수가 0.342로 가장 높으므로 대상 Sample B의 인종은 African으로 예측할 수 있다.5 as an example, the race prediction unit 240 can predict the race of sample A as American because the score of American is the highest at 0.615 for Sample A. In addition, the race prediction unit 240 can predict the race of sample B as African because the score of African is the highest at 0.342 for Sample B.
한편, 상기 대상의 인종별 점수가 근소한 차이를 가지는 경우가 발생할 수 있다. 이러한 경우, 상기 인종 예측부(240)는 근소한 차이의 점수를 가지는 해당 인종들을 상기 대상의 인종으로 예측할 수 있다. 다시 말해, 상기 인종 예측부(240)는 둘 이상의 특정 인종의 점수 차이가 미리 설정된 범위 내로 근소하면, 상기 특정 인종을 상기 대상의 인종으로 예측할 수 있다.Meanwhile, there may be cases where the scores for each race of the subject have a slight difference. In this case, the race prediction unit 240 may predict the races with slightly different scores as the race of the target. In other words, if the score difference between two or more specific races is small within a preset range, the racial prediction unit 240 may predict the specific race as the target's race.
이때, 상기 인종 예측부(240)는 상기 둘 이상의 특정 인종의 점수에 가중치를 적용하여 최종 점수가 가장 높은 인종을 상기 대상의 인종으로 예측할 수도 있다. 예컨대, 상기 인종 예측부(240)는 인종별 분포 비율을 가중치로서 적용하여 상기 둘 이상의 특정 인종의 점수를 최종 산출할 수 있으며, 최종 산출된 최종 점수가 가장 높은 인종을 상기 대상의 인종으로 예측할 수 있다.At this time, the race prediction unit 240 may apply a weight to the scores of the two or more specific races and predict the race with the highest final score as the target's race. For example, the racial prediction unit 240 may apply the distribution ratio by race as a weight to calculate the final score of the two or more specific races, and predict the race with the highest calculated final score as the race of the target. there is.
상기 제어부(250)는 상기 변이 빈도 계산부(210), 상기 변이 빈도 테이블 구축부(220), 상기 인종별 점수 계산부(230), 상기 인종 예측부(240) 등의 동작을 전반적으로 제어할 수 있다. 상기 제어부(250)는 상기 변이 빈도 계산부(210), 상기 변이 빈도 테이블 구축부(220), 상기 인종별 점수 계산부(230), 상기 인종 예측부(240) 등의 구성요소들을 기능적으로 일부 또는 전체 포함하여 구현될 수 있다. 즉, 상기 제어부(250)는 상기 구성요소들의 일부 기능을 수행할 수 있으며, 또 달리 상기 구성요소들의 전체 기능을 수행할 수도 있다.The control unit 250 generally controls the operations of the mutation frequency calculation unit 210, the mutation frequency table construction unit 220, the racial score calculation unit 230, and the racial prediction unit 240. You can. The control unit 250 functionally includes components such as the mutation frequency calculation unit 210, the mutation frequency table construction unit 220, the racial score calculation unit 230, and the racial prediction unit 240. Alternatively, it may be implemented including the entirety. That is, the control unit 250 may perform some of the functions of the components or may perform all of the functions of the components.
상기 제어부(250)는 상기 인종 예측 서버(110)의 전체적인 동작을 제어하며, CPU 등과 같은 프로세서를 포함할 수 있다. 상기 제어부(250)는 입출력부를 통해 수신한 사용자 입력에 대응되는 동작을 수행하도록 상기 인종 예측 서버(110)에 포함된 다른 구성들을 제어할 수 있다. 여기서 상기 프로세서는 컴퓨팅 장치 내에서 명령어를 처리할 수 있는데, 이런 명령어로는, 예컨대 고속 인터페이스에 접속된 디스플레이처럼 외부 입력, 출력 장치상에 GUI(Graphic User Interface)를 제공하기 위한 그래픽 정보를 표시하기 위해 메모리나 저장 장치에 저장된 명령어를 들 수 있다. 다른 실시예로서, 다수의 프로세서 및(또는) 다수의 버스가 적절히 다수의 메모리 및 메모리 형태와 함께 이용될 수 있다. 또한 상기 프로세서는 독립적인 다수의 아날로그 및(또는) 디지털 프로세서를 포함하는 칩들이 이루는 칩셋으로 구현될 수 있다.The control unit 250 controls the overall operation of the race prediction server 110 and may include a processor such as a CPU. The control unit 250 may control other components included in the race prediction server 110 to perform operations corresponding to user input received through the input/output unit. Here, the processor can process instructions within the computing device, such as displaying graphic information to provide a GUI (Graphic User Interface) on an external input or output device, such as a display connected to a high-speed interface. This includes instructions stored in memory or storage devices. In other embodiments, multiple processors and/or multiple buses may be utilized along with multiple memories and memory types as appropriate. Additionally, the processor may be implemented as a chipset comprised of chips including multiple independent analog and/or digital processors.
아래에서는 본 발명의 일 실시예에 있어서 인종별 변이 출현 빈도를 계산하는 단계와 대상의 인종별 조건부 확률을 계산하는 단계의 실시예에 대해 각각 구체적으로 설명한다.Below, in an embodiment of the present invention, the steps of calculating the frequency of occurrence of mutations by race and calculating the conditional probability by race of the target will be described in detail.
1. 인종별 변이 출현 빈도를 계산하는 단계의 실시예1. Example of calculating the frequency of mutations by race
상기 GnomAD의 데이터는 인종마다 변이별 allele count, allele number, number of nomozygote 값을 제시하는데, 이 데이터는 각 인종별로 따로 제공된다. 특정 염색체의 특정 위치의 서열을 allele이라고 하며, 사람마다 염색체가 두 개이므로 두 개의 allele을 가지게 된다. allele은 reference와 같은 서열이거나 alternative(변이 서열)일 수 있다.The GnomAD data presents allele count, allele number, and number of nomozygote values for each mutation for each race, and this data is provided separately for each race. The sequence of a specific position on a specific chromosome is called an allele, and since each person has two chromosomes, they have two alleles. An allele can be the same sequence as the reference or an alternative (mutated sequence).
allele count는 특정 집단에서 변이에 해당하는 allele이 발견되는 수를 뜻한다. 사람마다 두 개의 allele을 가지고 있으므로, 사람 수가 N일 때 allele count는 최소 0 부터 최대 2N의 수를 가지게 된다. 만약 특정 위치의 reference allele 이 A이고 alternative allele (변이 allele)이 T 라는 게 알려져 있다면, allele count는 해당 위치에서 발견되는 T allele의 수를 뜻한다.The allele count refers to the number of alleles corresponding to mutations found in a specific population. Since each person has two alleles, when the number of people is N, the allele count ranges from a minimum of 0 to a maximum of 2N. If it is known that the reference allele at a specific position is A and the alternative allele is T, the allele count refers to the number of T alleles found at that position.
allele number는 allele의 총 개수를 뜻하는 숫자로서, 사람 수 * 2 이므로 2N 이 된다. 즉 allele number를 2 로 나누게 되면 총 사람 수가 된다. number of homozygote는 homozygote 변이를 가지고 있는 사람의 수이다. 이는 사람의 수를 나타내므로 homozygote 인 사람들의 allele count는 2 * number of homozygote 가 된다.The allele number is a number that represents the total number of alleles. Since it is the number of people * 2, it becomes 2N. In other words, dividing the allele number by 2 gives the total number of people. Number of homozygote is the number of people who have a homozygote mutation. This represents the number of people, so the allele count of homozygote people is 2 * number of homozygote.
gnomAD에서 제공하는 이러한 수치로 변이 출현 빈도를 계산하기 위해 다음의 과정을 거치게 된다.To calculate the mutation frequency using these numbers provided by gnomAD, the following process is performed.
먼저 homozygote rate는 전체 사람 중 homozygote 변이를 가진 사람의 비율을 뜻하므로, homozygote rate = homozygote 수 / 총 사람 수 = number of homozygote / (allele number / 2) 로 계산할 수 있다.First, the homozygote rate refers to the ratio of people with a homozygote mutation among all people, so it can be calculated as: homozygote rate = number of homozygotes / total number of people = number of homozygote / (allele number / 2).
heterozygote rate는 heterozygote allele을 가진 사람의 비율이므로, 전체 allele count에서 homozygote allele count를 빼는 방식으로 계산할 수 있다. 즉, heterozygote rate = heterozygote 수 / 총 사람 수 = (allele count - homozygote allele count) / (allele number / 2) = (allele count - 2 * number of homozygote) / (allele number / 2) 로 구할 수 있다.Since the heterozygote rate is the proportion of people with a heterozygote allele, it can be calculated by subtracting the homozygote allele count from the total allele count. In other words, heterozygote rate = number of heterozygotes / total number of people = (allele count - homozygote allele count) / (allele number / 2) = (allele count - 2 * number of homozygote) / (allele number / 2).
도 6을 예로 들어 상기 homozygote rate와 상기 heterozygote rate를 구하는 과정을 설명한다. reference allele는 A이고 alternative allele는 T이다. 그리고, 12명의 사람 중에서 3명은 와일드 타입(wild type), 7명은 heterozygote, 나머지 2명은 homozygote이다. allele count는 11, allele number는 24, number of homozygotes는 2이다.The process of calculating the homozygote rate and the heterozygote rate will be explained using Figure 6 as an example. The reference allele is A and the alternative allele is T. And, of the 12 people, 3 are wild type, 7 are heterozygote, and the remaining 2 are homozygote. The allele count is 11, the allele number is 24, and the number of homozygotes is 2.
상기의 값들을 상기 homozygote rate와 상기 heterozygote rate를 구하는 식에 대입하면 다음과 같은 결과가 나온다.Substituting the above values into the formula for calculating the homozygote rate and the heterozygote rate gives the following results.
homozygotes ratehomozygotes rate
= number of homozygotes / (allele number / 2)= number of homozygotes / (allele number / 2)
= 2 / (24 / 2) = 2 / 12 = 0.167= 2 / (24 / 2) = 2 / 12 = 0.167
heterozygotes rateheterozygotes rate
= (allele count - 2 * number of homozygotes) / (allele number / 2)= (allele count - 2 * number of homozygotes) / (allele number / 2)
= (11 - 2*2) / (24/2) = 7 / 12 = 0.583= (11 - 2*2) / (24/2) = 7 / 12 = 0.583
본 발명의 실시예에서는 상기와 같이 homozygote rate와 heterozygote rate를 구함으로써 상기 인종별 변이 출현 빈도를 계산할 수 있다.In an embodiment of the present invention, the frequency of occurrence of mutations by race can be calculated by calculating the homozygote rate and heterozygote rate as described above.
따라서 도 7과 같은 gnomAD 데이터가 있을 때, 인종별 변이 출현 빈도는 도 8과 같이 계산된다. 예를 들어, 1-69270-A-G heterozygote 변이의 경우, heterozygote rate in african = (allele count - 2 * number of homozygotes) / (allele number / 2) = (1590 - 2*493) / (4428/2) = 0.27281 로 계산된다.Therefore, when there is gnomAD data as shown in Figure 7, the frequency of occurrence of mutations by race is calculated as shown in Figure 8. For example, for the 1-69270-A-G heterozygote mutation, heterozygote rate in african = (allele count - 2 * number of homozygotes) / (allele number / 2) = (1590 - 2*493) / (4428/2) It is calculated as = 0.27281.
1-976506-AGCGGGGGC-A 변이는 total allele number 가 2000 미만이기 때문에 제외되었다. 1-1007245-C-G 변이는 allele count의 비율이 5% 미만이기 때문에 제외되었다.The 1-976506-AGGCGGGGGC-A mutation was excluded because the total allele number was less than 2000. The 1-1007245-C-G mutation was excluded because the allele count was less than 5%.
2. 대상의 인종별 조건부 확률을 계산하는 단계의 실시예2. Example of calculating the conditional probability by race of the target
특정 대상의 변이 정보가 도 9와 같다고 가정한다.Assume that the mutation information of a specific object is the same as Figure 9.
이 대상의 변이 정보와 위의 예시에서 gnomAD 데이터로 계산한 인종별 변이 출현 빈도를 활용해 인종별 점수를 다음과 같이 계산한다.Using this subject's mutation information and the mutation frequency by race calculated using gnomAD data in the example above, the score for each race is calculated as follows.
African 점수 = (1-69270-A-G homozygote rate in african * 1-324822-A-T heterozygote rate in african)½ = (0.2226 * 0.0118)½ = 0.0513African score = (1-69270-AG homozygote rate in african * 1-324822-AT heterozygote rate in african) ½ = (0.2226 * 0.0118) ½ = 0.0513
East asian 점수 = (1-69270-A-G homozygote rate in east asian * 1-324822-A-T heterozygote rate in east asian)½ = 0.1398East asian score = (1-69270-AG homozygote rate in east asian * 1-324822-AT heterozygote rate in east asian) ½ = 0.1398
여기서, 도 7의 인종별 변이 출현 빈도 테이블에 있는 1-138593-G-T는 대상에서 발견된 변이가 아니기 때문에 계산에 사용되지 않는다. 또한, 대상의 변이 중 1-100293-G-C 와 1-592801-G-GA는 도 7의 인종별 변이 출현 빈도 테이블에 없기 때문에 계산에 사용되지 않는다.Here, 1-138593-G-T in the mutation frequency table by race in Figure 7 is not used in the calculation because it is not a mutation found in the subject. In addition, among the target's mutations, 1-100293-G-C and 1-592801-G-GA are not used in the calculation because they are not in the mutation frequency table by race in Figure 7.
이상에서 설명된 장치는 하드웨어 구성 요소, 소프트웨어 구성 요소, 및/또는 하드웨어 구성 요소 및 소프트웨어 구성 요소의 조합으로 구현될 수 있다. 예를 들어, 실시예들에서 설명된 장치 및 구성 요소는, 예를 들어, 프로세서, 컨트롤러, ALU(arithmetic logic unit), 디지털 신호 프로세서(digital signal processor), 마이크로컴퓨터, FPA(field programmable array), PLU(programmable logic unit), 마이크로프로세서, 또는 명령(instruction)을 실행하고 응답할 수 있는 다른 어떠한 장치와 같이, 하나 이상의 범용 컴퓨터 또는 특수 목적 컴퓨터를 이용하여 구현될 수 있다. 처리 장치는 운영 체제(OS) 및 상기 운영 체제 상에서 수행되는 하나 이상의 소프트웨어 애플리케이션을 수행할 수 있다. 또한, 처리 장치는 소프트웨어의 실행에 응답하여, 데이터를 접근, 저장, 조작, 처리 및 생성할 수도 있다. 이해의 편의를 위하여, 처리 장치는 하나가 사용되는 것으로 설명된 경우도 있지만, 해당 기술분야에서 통상의 지식을 가진 자는, 처리 장치가 복수 개의 처리 요소(processing element) 및/또는 복수 유형의 처리 요소를 포함할 수 있음을 알 수 있다. 예를 들어, 처리 장치는 복수 개의 프로세서 또는 하나의 프로세서 및 하나의 컨트롤러를 포함할 수 있다. 또한, 병렬 프로세서(parallel processor)와 같은, 다른 처리 구성(processing configuration)도 가능하다.The device described above may be implemented with hardware components, software components, and/or a combination of hardware components and software components. For example, devices and components described in embodiments may include, for example, a processor, a controller, an arithmetic logic unit (ALU), a digital signal processor, a microcomputer, a field programmable array (FPA), It may be implemented using one or more general-purpose or special-purpose computers, such as a programmable logic unit (PLU), microprocessor, or any other device capable of executing and responding to instructions. A processing device may execute an operating system (OS) and one or more software applications that run on the operating system. Additionally, a processing device may access, store, manipulate, process, and generate data in response to the execution of software. For ease of understanding, a single processing device may be described as being used; however, those skilled in the art will understand that a processing device includes multiple processing elements and/or multiple types of processing elements. It can be seen that it may include. For example, a processing device may include multiple processors or one processor and one controller. Additionally, other processing configurations, such as parallel processors, are possible.
소프트웨어는 컴퓨터 프로그램(computer program), 코드(code), 명령(instruction), 또는 이들 중 하나 이상의 조합을 포함할 수 있으며, 원하는 대로 동작하도록 처리 장치를 구성하거나 독립적으로 또는 결합적으로(collectively) 처리 장치를 명령할 수 있다. 소프트웨어 및/또는 데이터는, 처리 장치에 의하여 해석되거나 처리 장치에 명령 또는 데이터를 제공하기 위하여, 어떤 유형의 기계, 구성요소(component), 물리적 장치, 가상 장치(virtual equipment), 컴퓨터 저장 매체 또는 장치, 또는 전송되는 신호 파(signal wave)에 영구적으로, 또는 일시적으로 구체화(embody)될 수 있다. 소프트웨어는 네트워크로 연결된 컴퓨터 시스템 상에 분산되어서, 분산된 방법으로 저장되거나 실행될 수도 있다. 소프트웨어 및 데이터는 하나 이상의 컴퓨터 판독 가능 기록 매체에 저장될 수 있다.Software may include a computer program, code, instructions, or a combination of one or more of these, which may configure a processing unit to operate as desired, or may be processed independently or collectively. You can command the device. Software and/or data may be used on any type of machine, component, physical device, virtual equipment, computer storage medium or device to be interpreted by or to provide instructions or data to a processing device. , or may be permanently or temporarily embodied in a transmitted signal wave. Software may be distributed over networked computer systems and stored or executed in a distributed manner. Software and data may be stored on one or more computer-readable recording media.
도 10은 본 발명의 일 실시예에 따른 변이 출현 빈도를 이용한 인종 예측 방법을 설명하기 위해 도시한 흐름도이다.Figure 10 is a flowchart illustrating a method for predicting race using mutation frequency according to an embodiment of the present invention.
여기서 설명하는 인종 예측 방법은 상기 인종 예측 서버(도 1의 110 참조)에 의해 수행될 수 있다. 상기 인종 예측 서버는 본 발명의 일 실시예에 따른 변이 출현 빈도를 이용한 인종 예측 시스템의 구성요소 및 기능 등을 포함하는 개념으로 이해될 수 있다.The race prediction method described here can be performed by the race prediction server (see 110 in FIG. 1). The race prediction server can be understood as a concept that includes the components and functions of a race prediction system using mutation frequency according to an embodiment of the present invention.
한편, 상기 인종 예측 방법은 본 발명의 하나의 실시예에 불과하며, 그 이외에 필요에 따라 다양한 단계들이 아래와 같이 부가될 수 있고, 하기의 단계들도 순서를 변경하여 실시될 수 있으므로, 본 발명이 하기에 설명하는 각 단계 및 그 순서에 한정되는 것은 아니다.Meanwhile, the racial prediction method is only one embodiment of the present invention. In addition, various steps may be added as needed, and the following steps may also be performed by changing the order, so the present invention It is not limited to each step and its sequence described below.
도 1 및 도 10을 참조하면, 단계(1010)에서 상기 인종 예측 서버(110)는 집단 유전체 변이 데이터베이스(120)와 연동하여 인종별 변이 출현 빈도를 계산할 수 있다.Referring to FIGS. 1 and 10 , in step 1010, the race prediction server 110 may calculate the frequency of occurrence of mutations by race in conjunction with the population genome mutation database 120.
이를 위해, 상기 인종 예측 서버(110)는 상기 집단 유전체 변이 데이터베이스(120)로부터 각 변이가 특정 인종에서 출현한 횟수를 나타내는 인종별 변이 출현 횟수, 및 인종별 전체 사람 수를 수집하고, 수집된 상기 인종별 변이 출현 횟수 및 상기 인종별 전체 사람 수를 이용하여 상기 각 변이의 인종별 출현 빈도를 계산할 수 있다.For this purpose, the race prediction server 110 collects the number of mutations by race, which indicates the number of times each mutation appears in a specific race, and the total number of people by race from the population genome mutation database 120, and collects the collected The frequency of appearance of each mutation by race can be calculated using the number of mutations by race and the total number of people by race.
구체적으로, 상기 인종 예측 서버(110)는 homozygote 변이를 가지고 있는 사람의 수(number of homozygote) 및 전체 사람의 수(allele number / 2)를 이용하여, 전체 사람 중 상기 homozygote 변이를 가진 사람의 비율(homozygote rate)를 계산하고, heterozygote 변이를 가지고 있는 사람의 수(allele count - 2 * number of homozygote) 및 전체 사람의 수(allele number / 2)를 이용하여, 전체 사람 중 상기 heterozygote 변이를 가지고 있는 사람의 비율(heterozygote rate)을 계산함으로써 상기 인종별 변이 출현 빈도(homozygote rate 및 heterozygote rate)를 구할 수 있다.Specifically, the race prediction server 110 uses the number of people with a homozygote mutation (number of homozygote) and the total number of people (allele number / 2) to determine the ratio of people with the homozygote mutation among all people. Calculate the (homozygote rate) and use the number of people with a heterozygote mutation (allele count - 2 * number of homozygote) and the total number of people (allele number / 2) to calculate the number of people with the heterozygote mutation among all people. By calculating the human rate (heterozygote rate), the frequency of occurrence of mutations by race (homozygote rate and heterozygote rate) can be obtained.
다음으로, 단계(1020)에서 상기 인종 예측 서버(110)는 상기 인종별 변이 출현 빈도를 이용하여 대상의 인종별 점수를 계산할 수 있다.Next, in step 1020, the race prediction server 110 may calculate a score by race of the target using the frequency of occurrence of mutations by race.
이를 위해, 상기 인종 예측 서버(110) 상기 각 변이에 대하여 계산된 상기 homozygote rate 및 상기 heterozygote rate를 인종별로 테이블에 저장하여 인종별 변이 출현 빈도 테이블을 생성하고, 상기 인종별 변이 출현 빈도 테이블로부터 상기 대상에 관한 변이(대상 변이)를 탐색할 수 있다. 이후, 상기 인종 예측 서버(110)는 상기 탐색된 대상 변이 각각에 대한 인종별 homozygote rate 및 heterozygote rate의 값을 로딩하여, 상기 복수의 대상 변이로 이루어진 대상 변이 집합에 대한 인종별 점수를 계산할 수 있다.For this purpose, the racial prediction server 110 stores the homozygote rate and the heterozygote rate calculated for each mutation in a table by race to generate a mutation frequency table by race, and generates a mutation frequency table by race. Variations regarding the target (target variation) can be searched. Thereafter, the race prediction server 110 may load the values of the homozygote rate and heterozygote rate by race for each of the searched target mutations, and calculate a score by race for the target mutation set consisting of the plurality of target mutations. .
다음으로, 단계(1030)에서 상기 인종 예측 서버(110)는 상기 대상의 인종별 점수에 기초하여 상기 대상의 인종을 예측할 수 있다. 이때, 상기 인종 예측 서버(110)는 상기 대상의 인종별 점수 중에서 가장 높은 점수에 해당하는 인종을 상기 대상의 인종으로 예측할 수 있다.Next, in step 1030, the race prediction server 110 may predict the race of the target based on the race score of the target. At this time, the race prediction server 110 may predict the race corresponding to the highest score among the scores for each race of the target as the race of the target.
실시예에 따른 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 상기 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 매체에 기록되는 프로그램 명령은 실시예를 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CDROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. 상기된 하드웨어 장치는 실시예의 동작을 수행하기 위해 하나 이상의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.The method according to the embodiment may be implemented in the form of program instructions that can be executed through various computer means and recorded on a computer-readable medium. The computer-readable medium may include program instructions, data files, data structures, etc., singly or in combination. Program instructions recorded on the medium may be specially designed and configured for the embodiment or may be known and available to those skilled in the art of computer software. Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks, and magnetic tapes, optical media such as CDROMs and DVDs, and magneto-optical media such as floptical disks. Includes magneto-optical media and hardware devices specifically configured to store and execute program instructions, such as ROM, RAM, flash memory, etc. Examples of program instructions include machine language code, such as that produced by a compiler, as well as high-level language code that can be executed by a computer using an interpreter, etc. The hardware devices described above may be configured to operate as one or more software modules to perform the operations of the embodiments, and vice versa.
이상과 같이 실시예들이 비록 한정된 실시예와 도면에 의해 설명되었으나, 해당 기술분야에서 통상의 지식을 가진 자라면 상기의 기재로부터 다양한 수정 및 변형이 가능하다. 예를 들어, 설명된 기술들이 설명된 방법과 다른 순서로 수행되거나, 및/또는 설명된 시스템, 구조, 장치, 회로 등의 구성요소들이 설명된 방법과 다른 형태로 결합 또는 조합되거나, 다른 구성요소 또는 균등물에 의하여 대치되거나 치환되더라도 적절한 결과가 달성될 수 있다.As described above, although the embodiments have been described with limited examples and drawings, various modifications and variations can be made by those skilled in the art from the above description. For example, the described techniques are performed in a different order than the described method, and/or components of the described system, structure, device, circuit, etc. are combined or combined in a different form than the described method, or other components are used. Alternatively, appropriate results may be achieved even if substituted or substituted by an equivalent.
그러므로, 다른 구현들, 다른 실시예들 및 청구범위와 균등한 것들도 후술하는 청구범위의 범위에 속한다.Therefore, other implementations, other embodiments, and equivalents of the claims also fall within the scope of the following claims.

Claims (12)

  1. 집단 유전체 변이 데이터베이스와 연동하여 인종별 변이 출현 빈도를 계산하는 변이 빈도 계산부;A mutation frequency calculation unit that calculates the frequency of mutations by race in conjunction with a population genome mutation database;
    상기 인종별 변이 출현 빈도를 이용하여 대상의 인종별 점수를 계산하는 인종별 점수 계산부; 및a racial score calculation unit that calculates a score by race of the target using the frequency of occurrence of mutations by race; and
    상기 대상의 인종별 점수에 기초하여 상기 대상의 인종을 예측하는 인종 예측부A race prediction unit that predicts the race of the target based on the race score of the target
    를 포함하는 것을 특징으로 하는 변이 출현 빈도를 이용한 인종 예측 시스템.A race prediction system using the frequency of mutation occurrence, characterized in that it includes.
  2. 제1항에 있어서,According to paragraph 1,
    상기 변이 빈도 계산부는The mutation frequency calculation unit
    상기 집단 유전체 변이 데이터베이스로부터 각 변이가 특정 인종에서 출현한 횟수를 나타내는 인종별 변이 출현 횟수, 및 인종별 전체 사람 수를 수집하고, 수집된 상기 인종별 변이 출현 횟수 및 상기 인종별 전체 사람 수를 이용하여 상기 각 변이의 인종별 출현 빈도를 계산하는 것을 특징으로 하는 변이 출현 빈도를 이용한 인종 예측 시스템.From the population genome mutation database, the number of mutations by race, which indicates the number of times each mutation appears in a specific race, and the total number of people by race are collected, and the collected number of mutations by race and the total number of people by race are used. A race prediction system using the mutation frequency, characterized in that the frequency of occurrence of each mutation is calculated by race.
  3. 제2항에 있어서,According to paragraph 2,
    상기 변이 빈도 계산부는The mutation frequency calculation unit
    homozygote 변이를 가지고 있는 사람의 수(number of homozygote) 및 전체 사람의 수(allele number / 2)를 이용하여, 전체 사람 중 상기 homozygote 변이를 가진 사람의 비율(homozygote rate)를 계산하고, heterozygote 변이를 가지고 있는 사람의 수(allele count - 2 * number of homozygote) 및 전체 사람의 수(allele number / 2)를 이용하여, 전체 사람 중 상기 heterozygote 변이를 가지고 있는 사람의 비율(heterozygote rate)을 계산하며,Using the number of people with a homozygote mutation (number of homozygote) and the total number of people (allele number / 2), calculate the ratio of people with the homozygote mutation among all people (homozygote rate), and calculate the heterozygote mutation. Calculate the proportion of people with the heterozygote mutation among all people (heterozygote rate) using the number of people who have it (allele count - 2 * number of homozygote) and the total number of people (allele number / 2),
    상기 인종별 변이 출현 빈도는The frequency of mutations by race is
    상기 homozygote rate 및 상기 heterozygote rate를 포함하는 것을 특징으로 하는 변이 출현 빈도를 이용한 인종 예측 시스템.A race prediction system using mutation frequency, comprising the homozygote rate and the heterozygote rate.
  4. 제3항에 있어서,According to paragraph 3,
    상기 변이 빈도 계산부는The mutation frequency calculation unit
    상기 homozygote rate 및 상기 heterozygote rate의 계산 시, 변별력 있는 변이를 골라내기 위하여, 상기 전체 사람의 수(allele number / 2)가 최소한 천 명 이상인 변이를 고르고, 변이가 너무 희귀하면 계산에 영향을 미칠 수 있다는 점을 고려하여 전체 인종에서 allele count의 비율이 5% 이상, 95% 이하인 변이를 골라내서, 상기 homozygote rate 및 상기 heterozygote rate의 계산을 위한 대상 변이를 선정하는 것을 특징으로 하는 변이 출현 빈도를 이용한 인종 예측 시스템.When calculating the homozygote rate and the heterozygote rate, in order to select a discriminating mutation, a mutation in which the total number of people (allele number / 2) is at least 1,000 or more is selected, and if the mutation is too rare, it may affect the calculation. Considering that there is an allele count ratio of more than 5% and less than 95% in the entire race, the mutation frequency is used to select the target mutation for calculating the homozygote rate and the heterozygote rate. Racial prediction system.
  5. 제3항에 있어서,According to paragraph 3,
    상기 각 변이에 대하여 계산된 상기 homozygote rate 및 상기 heterozygote rate를 인종별로 테이블에 저장하여 인종별 변이 출현 빈도 테이블을 생성하는 변이 빈도 테이블 구축부A mutation frequency table construction unit that stores the homozygote rate and the heterozygote rate calculated for each mutation in a table by race to generate a mutation frequency table by race.
    를 더 포함하고,It further includes,
    상기 인종별 점수 계산부는The score calculation unit for each race is
    상기 인종별 변이 출현 빈도 테이블로부터 상기 대상에 관한 변이(대상 변이)를 탐색하고, 상기 탐색된 대상 변이 각각에 대한 인종별 homozygote rate 및 heterozygote rate의 값을 로딩하여, 상기 복수의 대상 변이로 이루어진 대상 변이 집합에 대한 인종별 점수를 계산하는 것을 특징으로 하는 변이 출현 빈도를 이용한 인종 예측 시스템.Search for mutations (target mutations) related to the object from the mutation frequency table by race, load the values of homozygote rate and heterozygote rate by race for each of the searched target mutations, and create a target consisting of the plurality of target mutations. A race prediction system using mutation frequency, characterized by calculating a race-specific score for a set of mutations.
  6. 제5항에 있어서,According to clause 5,
    상기 인종별 점수 계산부는The score calculation unit for each race is
    상기 대상 변이 각각에 대한 인종별 homozygote rate 및 heterozygote rate의 값을 이용하여, 특정 인종에서 상기 대상이 가진 변이 집합(대상 변이 집합)이 나타날 조건부 확률에 기반하여 상기 대상의 인종별 점수를 계산하는 것을 특징으로 하는 변이 출현 빈도를 이용한 인종 예측 시스템.Using the values of the homozygote rate and heterozygote rate by race for each of the target mutations, the score for each race of the target is calculated based on the conditional probability that the target mutation set (target mutation set) will appear in a specific race. Race prediction system using the frequency of occurrence of characteristic mutations.
  7. 제6항에 있어서,According to clause 6,
    상기 인종별 점수 계산부는The score calculation unit for each race is
    하기 수학식 1을 이용하여 상기 인종별 점수(Ethnicity score)를 계산하는 것을 특징으로 하는 변이 출현 빈도를 이용한 인종 예측 시스템.An ethnic prediction system using mutation frequency, characterized in that the ethnic score is calculated using Equation 1 below.
    [수학식 1][Equation 1]
    Figure PCTKR2022019581-appb-img-000003
    Figure PCTKR2022019581-appb-img-000003
    여기서, V는 대상 변이 집합(v1, v2, ..., vn)을 나타내고, E는 인종, n은 대상 변이의 수, Pr(Vn|E)은 대상 변이 vn이 특정 인종에서 발생활 확률을 각각 나타냄.Here, V represents the set of target variants (v 1 , v 2 , ..., v n ), E is race, n is the number of target variants, and Pr(Vn|E) is the number of target variants v n in a particular race. Each represents the probability of death.
  8. 제1항에 있어서,According to paragraph 1,
    상기 인종 예측부는The racial prediction unit
    상기 대상의 인종별 점수 중에서 가장 높은 점수에 해당하는 인종을 상기 대상의 인종으로 예측하는 것을 특징으로 하는 변이 출현 빈도를 이용한 인종 예측 시스템.A race prediction system using mutation frequency, characterized in that the race corresponding to the highest score among the race scores of the target is predicted as the target's race.
  9. 인종 예측 서버의 변이 출현 빈도를 이용한 인종 예측 방법에 있어서,In the race prediction method using the mutation frequency of the race prediction server,
    상기 인종 예측 서버가 집단 유전체 변이 데이터베이스와 연동하여 인종별 변이 출현 빈도를 계산하는 단계;Calculating the frequency of occurrence of mutations by race by linking the race prediction server with a population genome mutation database;
    상기 인종 예측 서버가 상기 인종별 변이 출현 빈도를 이용하여 대상의 인종별 점수를 계산하는 단계; 및The racial prediction server calculating a score by race of the target using the frequency of occurrence of mutations by race; and
    상기 인종 예측 서버가 상기 대상의 인종별 점수에 기초하여 상기 대상의 인종을 예측하는 단계A step where the race prediction server predicts the race of the target based on the race score of the target.
    를 포함하는 것을 특징으로 하는 변이 출현 빈도를 이용한 인종 예측 방법.A racial prediction method using the frequency of mutation occurrence, comprising:
  10. 제9항에 있어서,According to clause 9,
    상기 인종별 변이 출현 빈도를 계산하는 단계는The step of calculating the frequency of mutations by race is
    상기 집단 유전체 변이 데이터베이스로부터 각 변이가 특정 인종에서 출현한 횟수를 나타내는 인종별 변이 출현 횟수, 및 인종별 전체 사람 수를 수집하는 단계; 및Collecting the number of mutations by race, which represents the number of times each mutation appears in a specific race, and the total number of people by race from the population genome mutation database; and
    수집된 상기 인종별 변이 출현 횟수 및 상기 인종별 전체 사람 수를 이용하여 상기 각 변이의 인종별 출현 빈도를 계산하는 단계Calculating the frequency of occurrence of each mutation by race using the collected number of mutations by race and the total number of people by race.
    를 포함하는 것을 특징으로 하는 변이 출현 빈도를 이용한 인종 예측 방법.A racial prediction method using the frequency of mutation occurrence, comprising:
  11. 제10항에 있어서,According to clause 10,
    상기 각 변이의 인종별 출현 빈도를 계산하는 단계는The step of calculating the frequency of occurrence of each mutation by race is
    homozygote 변이를 가지고 있는 사람의 수(number of homozygote) 및 전체 사람의 수(allele number / 2)를 이용하여, 전체 사람 중 상기 homozygote 변이를 가진 사람의 비율(homozygote rate)를 계산하는 단계; 및Calculating the ratio of people with the homozygote mutation among all people (homozygote rate) using the number of people with the homozygote mutation (number of homozygote) and the total number of people (allele number / 2); and
    heterozygote 변이를 가지고 있는 사람의 수(allele count - 2 * number of homozygote) 및 전체 사람의 수(allele number / 2)를 이용하여, 전체 사람 중 상기 heterozygote 변이를 가지고 있는 사람의 비율(heterozygote rate)을 계산하는 단계를 포함하고,Using the number of people with a heterozygote mutation (allele count - 2 * number of homozygote) and the total number of people (allele number / 2), the proportion of people with the heterozygote mutation among all people (heterozygote rate) Including the step of calculating,
    상기 인종별 변이 출현 빈도는The frequency of mutations by race is
    상기 homozygote rate 및 상기 heterozygote rate를 포함하는 것을 특징으로 하는 변이 출현 빈도를 이용한 인종 예측 방법.A method for predicting race using mutation frequency, comprising the homozygote rate and the heterozygote rate.
  12. 제11항에 있어서,According to clause 11,
    상기 인종 예측 서버가 상기 각 변이에 대하여 계산된 상기 homozygote rate 및 상기 heterozygote rate를 인종별로 테이블에 저장하여 인종별 변이 출현 빈도 테이블을 생성하는 단계The racial prediction server storing the homozygote rate and the heterozygote rate calculated for each mutation in a table by race to generate a mutation frequency table by race.
    를 더 포함하고,It further includes,
    상기 대상의 인종별 점수를 계산하는 단계는The step of calculating the score by race of the target is
    상기 인종별 변이 출현 빈도 테이블로부터 상기 대상에 관한 변이(대상 변이)를 탐색하는 단계; 및Searching for mutations (target mutations) related to the object from the racial mutation frequency table; and
    상기 탐색된 대상 변이 각각에 대한 인종별 homozygote rate 및 heterozygote rate의 값을 로딩하여, 상기 복수의 대상 변이로 이루어진 대상 변이 집합에 대한 인종별 점수를 계산하는 단계Loading the values of homozygote rate and heterozygote rate by race for each of the searched target mutations, and calculating a score by race for the target mutation set consisting of the plurality of target mutations.
    를 포함하는 것을 특징으로 하는 변이 출현 빈도를 이용한 인종 예측 방법.A racial prediction method using the frequency of mutation occurrence, comprising:
PCT/KR2022/019581 2022-10-26 2022-12-05 Race prediction system and method, using variant frequency WO2024090667A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
KR1020220138807A KR102529401B1 (en) 2022-10-26 2022-10-26 Ethnicity prediction system and method using variant frequency
KR10-2022-0138807 2022-10-26

Publications (1)

Publication Number Publication Date
WO2024090667A1 true WO2024090667A1 (en) 2024-05-02

Family

ID=86381233

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/KR2022/019581 WO2024090667A1 (en) 2022-10-26 2022-12-05 Race prediction system and method, using variant frequency

Country Status (2)

Country Link
KR (1) KR102529401B1 (en)
WO (1) WO2024090667A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112489729A (en) * 2020-12-04 2021-03-12 北京诺禾致源科技股份有限公司 Gene data query method and device and nonvolatile storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8972406B2 (en) * 2012-06-29 2015-03-03 International Business Machines Corporation Generating epigenetic cohorts through clustering of epigenetic surprisal data based on parameters
KR102138165B1 (en) * 2020-01-02 2020-07-27 주식회사 클리노믹스 Method for providing identity analyzing service using standard genome map database by nationality, ethnicity, and race

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8972406B2 (en) * 2012-06-29 2015-03-03 International Business Machines Corporation Generating epigenetic cohorts through clustering of epigenetic surprisal data based on parameters
KR102138165B1 (en) * 2020-01-02 2020-07-27 주식회사 클리노믹스 Method for providing identity analyzing service using standard genome map database by nationality, ethnicity, and race

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
KYUNG SUN PARK: "Analysis of worldwide carrier frequency and predicted genetic prevalence of congenital hypothyroidism based on a general population database", GENES, vol. 12, 20 August 2020 (2020-08-20), pages 1 - 9, XP093163273, DOI: 10.22541/au.159795396.63518982 *
SANNA GUDMUNDSSON; MORIEL SINGER-BERK; NICHOLAS A. WATTS; WILLIAM PHU; JULIA K. GOODRICH; MATTHEW SOLOMONSON; GENOME AGGREGATION D: "Variant interpretation using population databases: lessons from gnomAD", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 23 July 2021 (2021-07-23), 201 Olin Library Cornell University Ithaca, NY 14853, XP091126968, DOI: 10.1002/humu.24309 *
TAO HUANG, YANG SHU, YU-DONG CAI: "Genetic differences among ethnic groups", BMC GENOMICS, vol. 16, no. 1, 21 December 2015 (2015-12-21), pages 1 - 10, XP055700079, DOI: 10.1186/s12864-015-2328-0 *

Also Published As

Publication number Publication date
KR102529401B1 (en) 2023-05-08

Similar Documents

Publication Publication Date Title
WO2018106005A1 (en) System for diagnosing disease using neural network and method therefor
WO2024090667A1 (en) Race prediction system and method, using variant frequency
WO2020096098A1 (en) Method for managing annotation work, and apparatus and system supporting same
WO2014038781A1 (en) Clustering support system and method, and device for supporting same
WO2020149447A1 (en) Insurance recommendation system and operating method therefor
WO2017116123A1 (en) System for identifying cause of disease using genetic variation information on individual's genome
WO2021149913A1 (en) Method and device for selecting disease-related gene in ngs analysis
WO2020111378A1 (en) Method and system for analyzing data in order to aid diagnosis of disease
WO2017116135A1 (en) System and method for analyzing genotype using genetic variation information on individual's genome
WO2022145564A1 (en) Model automatic compression method and device for deep-learning model serving optimization, and method for providing cloud inference service using same
WO2020032562A2 (en) Bioimage diagnosis system, bioimage diagnosis method, and terminal for executing same
WO2018030733A1 (en) Method and system for analyzing measurement-yield correlation
WO2022080583A1 (en) Deep learning-based bitcoin block data prediction system taking into account time series distribution characteristics
WO2018088585A1 (en) Method for managing taking medicine and device therefor
WO2017116139A1 (en) System for analyzing bioactive variation using genetic variation information on individual's genome
WO2018088824A1 (en) Method and apparatus for detecting abnormal user by using click log data
WO2023090825A1 (en) Ai model drift monitoring device and method
WO2024005474A1 (en) Augmented-reality service device and method for providing proper distance display
WO2015126058A1 (en) Method for predicting prognosis of cancer
WO2016085262A2 (en) Virtual drug screening method, intensive screening library constructing method, and system therefor
WO2015053480A1 (en) System and method for analyzing biological samples
WO2023113445A1 (en) Method and apparatus for floating point arithmetic
WO2023013959A1 (en) Apparatus and method for predicting amyloid beta accumulation
WO2022245063A1 (en) Method and system for analyzing genome and medical information and developing pharmaceutical substance on basis of artificial intelligence
WO2020235730A1 (en) Learning performance prediction method based on scan pattern of learner in video learning environment