WO2024090667A1

WO2024090667A1 - Race prediction system and method, using variant frequency

Info

Publication number: WO2024090667A1
Application number: PCT/KR2022/019581
Authority: WO
Inventors: 한헌종; 권기상
Original assignee: 주식회사 쓰리빌리언
Priority date: 2022-10-26
Filing date: 2022-12-05
Publication date: 2024-05-02
Also published as: KR102529401B1

Abstract

A race prediction system using variant frequency, according to an embodiment of the present invention, comprises: a variant frequency calculation unit for calculating the frequencies of variants in each race in conjunction with a population genomic variant database; a scores-by-race calculation unit for calculating scores-by-race of a subject using the frequencies of variants in each race; and a race prediction unit for predicting the race of the subject on the basis of the scores-by-race thereof.

Description

Racial prediction system and method using mutation frequency

Embodiments of the present invention relate to a system and method for predicting race, and more specifically, to a system and method for predicting race using the frequency of occurrence of mutations by race based on conditional probability.

This invention was made under the support of the Ministry of Science and ICT of the Republic of Korea under project number 1711160581 and task number 2022-0-00333. The research management agency for the project is IITP Information and Communication Planning and Evaluation Institute, and the research project name is "SW Computing Industry Source." “Technology Development (R&D)”, the research project name is “Development of AI integrated SW solution for multi-faceted analysis of rare pediatric diseases”, the host organization is Three Billion Co., Ltd., and the research period is 2022.04.01. ~ 2024.12.31.

This patent application claims priority to Korean Patent Application No. 10-2022-0138807, filed with the Korean Intellectual Property Office on October 26, 2022, the disclosure of which is incorporated herein by reference.

Existing methods for predicting a person's race use cohort data to create a group mutation profile when creating a prediction model. When the number of samples (number of people) in the data is N and the total number of mutations is M, a mutation profile of size N*M is constructed, and then several of the most meaningful axes are selected through PCA (Principal Component Analysis). When predicting the race of a target through a created prediction model, a method is used to calculate which group the target's mutation information has the closest average distance to in the created PCA.

The limitation of existing methods is that N*M mutation profiles must be constructed to create a prediction model. In other words, the mutation profiles of all samples used for prediction must be collected and analyzed, but such data is difficult to obtain and analysis requires high-spec analysis equipment.

Since the largest public database that provides N*M size mutation profile information is the 1000 genome project, existing methods make extensive use of this data. However, accurate predictions are difficult because the data includes only a small number of types of each race and a small number of samples for each race.

Therefore, there is a need to develop a method to predict race using mutation information summarized by race rather than mutation profile data.

One embodiment of the present invention is a mutation that can quickly and accurately predict the race of a target without requiring a high-specification analysis device for analysis by predicting the race of the target using the frequency of appearance of mutations summarized based on probability methodology. Provides a race prediction system and method using frequency of appearance.

The problem to be solved by the present invention is not limited to the problem(s) mentioned above, and other problem(s) not mentioned will be clearly understood by those skilled in the art from the description below.

A race prediction system using the frequency of mutation appearance according to an embodiment of the present invention includes a mutation frequency calculator that calculates the frequency of mutation appearance by race in conjunction with a population genome mutation database; a racial score calculation unit that calculates a score by race of the target using the frequency of occurrence of mutations by race; and a race prediction unit that predicts the race of the target based on the race score of the target.

The mutation frequency calculation unit collects the number of mutations by race, which represents the number of times each mutation appears in a specific race, and the total number of people by race from the population genome mutation database, and the collected number of mutations by race and the total number of people by race. The frequency of occurrence of each mutation by race can be calculated using the total number of people.

The mutation frequency calculation unit calculates the ratio of people with the homozygote mutation among all people (homozygote rate) using the number of people with the homozygote mutation (number of homozygote) and the total number of people (allele number / 2). And, using the number of people with a heterozygote mutation (allele count - 2 * number of homozygote) and the total number of people (allele number / 2), the ratio of people with the heterozygote mutation among all people (heterozygote rate) ) is calculated, and the frequency of occurrence of mutations by race may include the homozygote rate and the heterozygote rate.

When calculating the homozygote rate and the heterozygote rate, the mutation frequency calculation unit selects a mutation for which the total number of people (allele number / 2) is at least 1,000 or more in order to select a discriminating mutation, and calculates the mutation if the mutation is too rare. Considering that it may affect the overall race, variants with an allele count ratio of 5% or more and 95% or less can be selected to select the target variant for calculation of the homozygote rate and the heterozygote rate.

The race prediction system using mutation frequency according to an embodiment of the present invention stores the homozygote rate and heterozygote rate calculated for each mutation in a table by race to generate a mutation frequency table by race. It further includes a construction unit, wherein the score calculation unit for each race searches for mutations (target mutations) related to the target from the mutation frequency table for each race, and values of the homozygote rate and heterozygote rate for each race for each of the searched target mutations. By loading, a score by race can be calculated for the target mutation set consisting of the plurality of target mutations.

The race-specific score calculation unit uses the racial homozygote rate and heterozygote rate values for each target mutation, and determines the target's race based on the conditional probability that the target mutation set (target mutation set) will appear in a specific race. You can calculate star scores.

The ethnic score calculation unit may calculate the ethnic score using Equation 1 below.

[Equation 1]

Here, V represents the set of target variants (v ₁ , v ₂ , ..., v _n ), E is race, n is the number of target variants, and Pr(Vn|E) is the number of target variants v _n in a particular race. Each represents the probability of death.

The race prediction unit may predict the race corresponding to the highest score among the scores for each race of the target as the race of the target.

A method for predicting race using the frequency of mutation appearance according to an embodiment of the present invention includes the steps of the race prediction server linking with a population genome mutation database to calculate the frequency of mutation appearance by race; The racial prediction server calculating a score by race of the target using the frequency of occurrence of mutations by race; And a step of the race prediction server predicting the race of the target based on the race score of the target.

The step of calculating the frequency of occurrence of mutations by race includes collecting the number of mutations by race, which indicates the number of times each mutation appears in a specific race, and the total number of people by race from the population genome mutation database; And it may include calculating the frequency of appearance of each variant by race using the collected number of occurrences of variants by race and the total number of people by race.

The step of calculating the frequency of occurrence of each mutation by race is to calculate the number of people with the homozygote mutation (number of homozygote) and the total number of people (allele number / 2), and calculate the number of people with the homozygote mutation among all people. Calculating the homozygote rate; And using the number of people with a heterozygote mutation (allele count - 2 * number of homozygote) and the total number of people (allele number / 2), the ratio of people with the heterozygote mutation among all people (heterozygote rate) It includes the step of calculating , and the frequency of occurrence of mutations by race may include the homozygote rate and the heterozygote rate.

In the racial prediction method using mutation frequency according to an embodiment of the present invention, the racial prediction server stores the homozygote rate and heterozygote rate calculated for each mutation in a table by race to create a mutation frequency table by race. It further includes generating a score for each race of the object, wherein the step of calculating the score for each race includes: searching for a variant (target variant) related to the object from the variant appearance frequency table for each race; And it may include loading the values of the homozygote rate and heterozygote rate by race for each of the searched target mutations, and calculating a score by race for the target mutation set consisting of the plurality of target mutations.

Specific details of other embodiments are included in the detailed description and accompanying drawings.

According to one embodiment of the present invention, a probability methodology is used rather than a machine learning technique such as existing PCA or random forest, and the existing method uses an N*M mutation profile consisting of N mutations and M samples to build a model. Unlike those that require , because it only requires information on the number of occurrences of mutations in a summary, high-spec analysis equipment is not required for analysis, and the race of the target can be predicted quickly and accurately.

According to one embodiment of the present invention, the results can be interpreted in more detail because the average value of the probabilities for each race is presented, and the predicted racial information can be usefully used in various research and clinical diagnosis. For example, if a mutation known to cause a specific disease is found in large numbers in people of race A who do not have the disease, the association between the mutation and the disease can be lowered only for race A. Additionally, in the case of diseases whose prevalence varies depending on race, additional clues can be obtained for diagnosing the disease by confirming the patient's race.

Figure 1 is a diagram illustrating the configuration of a race prediction system using mutation frequency according to an embodiment of the present invention.

FIG. 2 is a block diagram illustrating the detailed configuration of the race prediction server of FIG. 1.

Figure 3 is a diagram illustrating an example of a mutation frequency table by race generated according to an embodiment of the present invention.

FIG. 4 is a diagram illustrating a method of generating a table of variation occurrence frequencies by race according to an embodiment of the present invention.

Figure 5 is a diagram illustrating a method of calculating scores by race according to an embodiment of the present invention.

Figures 6 to 8 are diagrams to explain the process of calculating the frequency of occurrence of variations (homozygote rate and heterozygote rate) by race according to an embodiment of the present invention.

Figure 9 is a table showing variation information of a specific object used in the step of calculating conditional probability for each race according to an embodiment of the present invention.

Figure 10 is a flowchart illustrating a method for predicting race using mutation frequency according to an embodiment of the present invention.

The advantages and/or features of the present invention and methods for achieving them will become clear by referring to the embodiments described in detail below in conjunction with the accompanying drawings. However, the present invention is not limited to the embodiments disclosed below and will be implemented in various different forms. The present embodiments only serve to ensure that the disclosure of the present invention is complete and are within the scope of common knowledge in the technical field to which the present invention pertains. It is provided to fully inform those who have the scope of the invention, and the present invention is only defined by the scope of the claims. Like reference numerals refer to like elements throughout the specification.

In addition, preferred embodiments of the present invention to be implemented below are provided in each system function configuration in order to efficiently explain the technical components constituting the present invention, or system functions commonly provided in the technical field to which the present invention pertains. The configuration will be omitted whenever possible, and the description will focus on the functional configuration that must be additionally provided for the present invention. If a person has ordinary knowledge in the technical field to which the present invention pertains, he or she will be able to easily understand the functions of conventionally used components among the functional configurations not shown and omitted below, as well as the omitted configurations as described above. The relationships between elements and components added for the present invention will also be clearly understood.

In addition, in the following description, "transmission", "communication", "transmission", "reception" and other similar terms of signals or information refer to the direct transmission of signals or information from one component to another component. In addition, it also includes those transmitted through other components. In particular, “transmitting” or “transmitting” a signal or information as a component indicates the final destination of the signal or information and does not mean the direct destination. This is the same for “receiving” signals or information.

Hereinafter, embodiments of the present invention will be described in detail with reference to the attached drawings.

FIG. 1 is a configuration diagram of a racial prediction system using mutation frequency according to an embodiment of the present invention, and FIG. 2 is a block diagram illustrating the detailed configuration of the racial prediction server 110 of FIG. 1.

Referring to Figures 1 and 2, the race prediction system using the frequency of mutation appearance according to an embodiment of the present invention may be implemented as a race prediction server 110. The race prediction server 110 includes a mutation frequency calculation unit 210, a mutation frequency table construction unit 220, a race score calculation unit 230, a race prediction unit 240, and a control unit 250. It can be.

The mutation frequency calculation unit 210 can calculate the mutation frequency by race in conjunction with the population genome mutation database 120. In this embodiment, the population genome variation database 120 may be implemented as a GnomAD (The Genome Aggregation Database) database.

Specifically, the mutation frequency calculation unit 210 may collect the number of mutations by race, which indicates the number of times each mutation appears in a specific race, and the total number of people by race from the population genome mutation database 120. In addition, the mutation frequency calculation unit 210 may calculate the frequency of occurrence of each mutation by race using the collected number of occurrences of mutations by race and the total number of people by race.

Here, the frequency of occurrence of mutations by race can be understood as a concept that includes the proportion of people with homozygote mutations (homozygote rate) and the proportion of people with heterozygote mutations (heterozygote rate) among all people. The process for calculating the homozygote rate and the heterozygote rate is as follows.

That is, the mutation frequency calculation unit 210 uses the number of people with a homozygote mutation (number of homozygote) and the total number of people (allele number / 2) to calculate the ratio of people with the homozygote mutation among all people. (homozygote rate) can be calculated (homozygote rate = number of homozygote / (allele number / 2)).

In addition, the mutation frequency calculation unit 210 uses the number of people with a heterozygote mutation (allele count - 2 * number of homozygote) and the total number of people (allele number / 2) to calculate the heterozygote mutation among all people. You can calculate the proportion of people who have (heterozygote rate) (heterozygote rate = (allele count - 2 * number of homozygote) / (allele number / 2)).

In order to select a discriminating mutation when calculating the homozygote rate and the heterozygote rate, the mutation frequency calculation unit 210 calculates a value for the total number of people (allele number / 2) and the ratio of the allele count to the total race. Restrictions may apply.

For example, the mutation frequency calculation unit 210 selects mutations for which the total number of people (allele number / 2) is at least 1,000 or more, and considering that if the mutations are too rare, it may affect the calculation, the overall number By selecting mutations with an allele count ratio of 5% or more and 95% or less in a race, target mutations for calculating the homozygote rate and the heterozygote rate can be selected.

The mutation frequency table construction unit 220 may store the homozygote rate and heterozygote rate calculated for each mutation in a table by race to generate a mutation frequency table by race.

For example, as shown in Figure 3, create a table with variation (variant 1, variant 2, ...) and race (Korean, South Asian, African, American) as rows and columns, and calculate the calculated probability value (homozygote rate). , heterozygote rate) can be filled in to create the mutation frequency table by race.

The method of generating the mutation frequency table by race will be described in detail with reference to FIG. 4. Referring to FIG. 4, in relation to 1-2-C-G, in the case of the American race of

Human

1, 2, and 3, a pair of alleles have all mutated from C to G based on the reference, so the value of the homozygote rate is 100% (1.0), and the value of heterozygote rate is 0% (0.0).

In addition, in relation to 1-2-C-G, in the case of the African race of

Humans

4 and 5, both alleles of the pair mutated from C to G based on the reference, so the value of the homozygote rate is 100% (1.0) and the heterozygote rate is 100% (1.0). The value becomes 0% (0.0).

In addition, in relation to 1-2-C-G, in the case of the East Asian race of

Humans

6, 7, 8, and 9, a pair of alleles in

Humans

6, 7, and 8 all mutated from C to G based on the reference, and Human 9 Since only one of the pair of alleles was mutated, the homozygote rate is 75% (0.75) and the heterozygote rate is 25% (0.25).

By repeating this process, the variant frequency table by race can be generated as shown in FIG. 4. In this embodiment, accurate and fast prediction is possible because mutation information from the mutation frequency table by race, summarized by race, is used rather than mutation profile data.

The score calculation unit 230 for each race may calculate the score for each race of the target using the frequency of occurrence of mutations for each race. For this purpose, the race-specific score calculation unit 230 may use the race-specific variant appearance frequency table.

That is, the race-specific score calculation unit 230 searches for mutations (target mutations) related to the target from the race-specific mutation appearance frequency table, and values the racial homozygote rate and heterozygote rate for each of the searched target mutations. By loading, a score by race can be calculated for the target mutation set consisting of the plurality of target mutations.

At this time, the race-specific score calculation unit 230 uses the values of the homozygote rate and heterozygote rate for each race for each target mutation to determine the conditional probability that the mutation set (target mutation set) of the target will appear in a specific race. Based on this, the score for each race of the subject can be calculated.

For example, the ethnicity score calculation unit 230 may calculate the ethnicity score using Equation 1 below.

[Equation 1]

That is, the race-specific score calculation unit 230 determines the probability (Pr(Vn|E)) of the target mutation v n occurring in a specific race in the race-specific variant appearance frequency table, that is, the probability (Pr(Vn|E)) of occurrence of the target variant v _n in a specific race, that is, the racial variant for each of the target variants. Both homozygote rate and heterozygote rate values can be multiplied.

At this time, since continuously multiplying each probability can make the result value of the multiplication operation very small, the racial score calculation unit 230 calculates the geometric mean (1/n squared) of each probability as in Equation 1 above. This can be calculated and calculated as a score for each race.

Regarding the method of calculating the score by race, referring to FIG. 5, the race score calculation unit 230 selects samples for each sample from the racial variant frequency table. You can calculate the score by race by taking the Zygosity (1/1: Homozygote, 1/0: Heterozygote) of the mutation by race and applying it to Equation 1 above.

The table below in Figure 5 is the result of calculating scores by race for each sample (subject). For Sample A, the American race had a value of 0.615, the African race had a value of 0.578, and the East Asian race had a value of 0.3145, and Sample B In the case of , you can see that the American race has a value of 0.275, the African race has a value of 0.342, and the East Asian race has a value of 0.3314. These values can be used to predict race in the race prediction unit 240, which will be described later.

The race prediction unit 240 may predict the race of the target based on the race score of the target. That is, the race prediction unit 240 may predict the race corresponding to the highest score among the scores for each race of the target as the race of the target.

5 as an example, the race prediction unit 240 can predict the race of sample A as American because the score of American is the highest at 0.615 for Sample A. In addition, the race prediction unit 240 can predict the race of sample B as African because the score of African is the highest at 0.342 for Sample B.

Meanwhile, there may be cases where the scores for each race of the subject have a slight difference. In this case, the race prediction unit 240 may predict the races with slightly different scores as the race of the target. In other words, if the score difference between two or more specific races is small within a preset range, the racial prediction unit 240 may predict the specific race as the target's race.

At this time, the race prediction unit 240 may apply a weight to the scores of the two or more specific races and predict the race with the highest final score as the target's race. For example, the racial prediction unit 240 may apply the distribution ratio by race as a weight to calculate the final score of the two or more specific races, and predict the race with the highest calculated final score as the race of the target. there is.

The control unit 250 generally controls the operations of the mutation frequency calculation unit 210, the mutation frequency table construction unit 220, the racial score calculation unit 230, and the racial prediction unit 240. You can. The control unit 250 functionally includes components such as the mutation frequency calculation unit 210, the mutation frequency table construction unit 220, the racial score calculation unit 230, and the racial prediction unit 240. Alternatively, it may be implemented including the entirety. That is, the control unit 250 may perform some of the functions of the components or may perform all of the functions of the components.

The control unit 250 controls the overall operation of the race prediction server 110 and may include a processor such as a CPU. The control unit 250 may control other components included in the race prediction server 110 to perform operations corresponding to user input received through the input/output unit. Here, the processor can process instructions within the computing device, such as displaying graphic information to provide a GUI (Graphic User Interface) on an external input or output device, such as a display connected to a high-speed interface. This includes instructions stored in memory or storage devices. In other embodiments, multiple processors and/or multiple buses may be utilized along with multiple memories and memory types as appropriate. Additionally, the processor may be implemented as a chipset comprised of chips including multiple independent analog and/or digital processors.

Below, in an embodiment of the present invention, the steps of calculating the frequency of occurrence of mutations by race and calculating the conditional probability by race of the target will be described in detail.

1. Example of calculating the frequency of mutations by race

The GnomAD data presents allele count, allele number, and number of nomozygote values for each mutation for each race, and this data is provided separately for each race. The sequence of a specific position on a specific chromosome is called an allele, and since each person has two chromosomes, they have two alleles. An allele can be the same sequence as the reference or an alternative (mutated sequence).

The allele count refers to the number of alleles corresponding to mutations found in a specific population. Since each person has two alleles, when the number of people is N, the allele count ranges from a minimum of 0 to a maximum of 2N. If it is known that the reference allele at a specific position is A and the alternative allele is T, the allele count refers to the number of T alleles found at that position.

The allele number is a number that represents the total number of alleles. Since it is the number of people * 2, it becomes 2N. In other words, dividing the allele number by 2 gives the total number of people. Number of homozygote is the number of people who have a homozygote mutation. This represents the number of people, so the allele count of homozygote people is 2 * number of homozygote.

To calculate the mutation frequency using these numbers provided by gnomAD, the following process is performed.

First, the homozygote rate refers to the ratio of people with a homozygote mutation among all people, so it can be calculated as: homozygote rate = number of homozygotes / total number of people = number of homozygote / (allele number / 2).

Since the heterozygote rate is the proportion of people with a heterozygote allele, it can be calculated by subtracting the homozygote allele count from the total allele count. In other words, heterozygote rate = number of heterozygotes / total number of people = (allele count - homozygote allele count) / (allele number / 2) = (allele count - 2 * number of homozygote) / (allele number / 2).

The process of calculating the homozygote rate and the heterozygote rate will be explained using Figure 6 as an example. The reference allele is A and the alternative allele is T. And, of the 12 people, 3 are wild type, 7 are heterozygote, and the remaining 2 are homozygote. The allele count is 11, the allele number is 24, and the number of homozygotes is 2.

Substituting the above values into the formula for calculating the homozygote rate and the heterozygote rate gives the following results.

homozygotes rate

= number of homozygotes / (allele number / 2)

= 2 / (24 / 2) = 2 / 12 = 0.167

heterozygotes rate

= (allele count - 2 * number of homozygotes) / (allele number / 2)

= (11 - 2*2) / (24/2) = 7 / 12 = 0.583

In an embodiment of the present invention, the frequency of occurrence of mutations by race can be calculated by calculating the homozygote rate and heterozygote rate as described above.

Therefore, when there is gnomAD data as shown in Figure 7, the frequency of occurrence of mutations by race is calculated as shown in Figure 8. For example, for the 1-69270-A-G heterozygote mutation, heterozygote rate in african = (allele count - 2 * number of homozygotes) / (allele number / 2) = (1590 - 2*493) / (4428/2) It is calculated as = 0.27281.

The 1-976506-AGGCGGGGGC-A mutation was excluded because the total allele number was less than 2000. The 1-1007245-C-G mutation was excluded because the allele count was less than 5%.

2. Example of calculating the conditional probability by race of the target

Assume that the mutation information of a specific object is the same as Figure 9.

Using this subject's mutation information and the mutation frequency by race calculated using gnomAD data in the example above, the score for each race is calculated as follows.

African score = (1-69270-AG homozygote rate in african * 1-324822-AT heterozygote rate in african) ^½ = (0.2226 * 0.0118) ^½ = 0.0513

East asian score = (1-69270-AG homozygote rate in east asian * 1-324822-AT heterozygote rate in east asian) ^½ = 0.1398

Here, 1-138593-G-T in the mutation frequency table by race in Figure 7 is not used in the calculation because it is not a mutation found in the subject. In addition, among the target's mutations, 1-100293-G-C and 1-592801-G-GA are not used in the calculation because they are not in the mutation frequency table by race in Figure 7.

The device described above may be implemented with hardware components, software components, and/or a combination of hardware components and software components. For example, devices and components described in embodiments may include, for example, a processor, a controller, an arithmetic logic unit (ALU), a digital signal processor, a microcomputer, a field programmable array (FPA), It may be implemented using one or more general-purpose or special-purpose computers, such as a programmable logic unit (PLU), microprocessor, or any other device capable of executing and responding to instructions. A processing device may execute an operating system (OS) and one or more software applications that run on the operating system. Additionally, a processing device may access, store, manipulate, process, and generate data in response to the execution of software. For ease of understanding, a single processing device may be described as being used; however, those skilled in the art will understand that a processing device includes multiple processing elements and/or multiple types of processing elements. It can be seen that it may include. For example, a processing device may include multiple processors or one processor and one controller. Additionally, other processing configurations, such as parallel processors, are possible.

Software may include a computer program, code, instructions, or a combination of one or more of these, which may configure a processing unit to operate as desired, or may be processed independently or collectively. You can command the device. Software and/or data may be used on any type of machine, component, physical device, virtual equipment, computer storage medium or device to be interpreted by or to provide instructions or data to a processing device. , or may be permanently or temporarily embodied in a transmitted signal wave. Software may be distributed over networked computer systems and stored or executed in a distributed manner. Software and data may be stored on one or more computer-readable recording media.

The race prediction method described here can be performed by the race prediction server (see 110 in FIG. 1). The race prediction server can be understood as a concept that includes the components and functions of a race prediction system using mutation frequency according to an embodiment of the present invention.

Meanwhile, the racial prediction method is only one embodiment of the present invention. In addition, various steps may be added as needed, and the following steps may also be performed by changing the order, so the present invention It is not limited to each step and its sequence described below.

Referring to FIGS. 1 and 10 , in step 1010, the race prediction server 110 may calculate the frequency of occurrence of mutations by race in conjunction with the population genome mutation database 120.

For this purpose, the race prediction server 110 collects the number of mutations by race, which indicates the number of times each mutation appears in a specific race, and the total number of people by race from the population genome mutation database 120, and collects the collected The frequency of appearance of each mutation by race can be calculated using the number of mutations by race and the total number of people by race.

Specifically, the race prediction server 110 uses the number of people with a homozygote mutation (number of homozygote) and the total number of people (allele number / 2) to determine the ratio of people with the homozygote mutation among all people. Calculate the (homozygote rate) and use the number of people with a heterozygote mutation (allele count - 2 * number of homozygote) and the total number of people (allele number / 2) to calculate the number of people with the heterozygote mutation among all people. By calculating the human rate (heterozygote rate), the frequency of occurrence of mutations by race (homozygote rate and heterozygote rate) can be obtained.

Next, in step 1020, the race prediction server 110 may calculate a score by race of the target using the frequency of occurrence of mutations by race.

For this purpose, the racial prediction server 110 stores the homozygote rate and the heterozygote rate calculated for each mutation in a table by race to generate a mutation frequency table by race, and generates a mutation frequency table by race. Variations regarding the target (target variation) can be searched. Thereafter, the race prediction server 110 may load the values of the homozygote rate and heterozygote rate by race for each of the searched target mutations, and calculate a score by race for the target mutation set consisting of the plurality of target mutations. .

Next, in step 1030, the race prediction server 110 may predict the race of the target based on the race score of the target. At this time, the race prediction server 110 may predict the race corresponding to the highest score among the scores for each race of the target as the race of the target.

The method according to the embodiment may be implemented in the form of program instructions that can be executed through various computer means and recorded on a computer-readable medium. The computer-readable medium may include program instructions, data files, data structures, etc., singly or in combination. Program instructions recorded on the medium may be specially designed and configured for the embodiment or may be known and available to those skilled in the art of computer software. Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks, and magnetic tapes, optical media such as CDROMs and DVDs, and magneto-optical media such as floptical disks. Includes magneto-optical media and hardware devices specifically configured to store and execute program instructions, such as ROM, RAM, flash memory, etc. Examples of program instructions include machine language code, such as that produced by a compiler, as well as high-level language code that can be executed by a computer using an interpreter, etc. The hardware devices described above may be configured to operate as one or more software modules to perform the operations of the embodiments, and vice versa.

As described above, although the embodiments have been described with limited examples and drawings, various modifications and variations can be made by those skilled in the art from the above description. For example, the described techniques are performed in a different order than the described method, and/or components of the described system, structure, device, circuit, etc. are combined or combined in a different form than the described method, or other components are used. Alternatively, appropriate results may be achieved even if substituted or substituted by an equivalent.

Therefore, other implementations, other embodiments, and equivalents of the claims also fall within the scope of the following claims.

Claims

A mutation frequency calculation unit that calculates the frequency of mutations by race in conjunction with a population genome mutation database;

a racial score calculation unit that calculates a score by race of the target using the frequency of occurrence of mutations by race; and

A race prediction unit that predicts the race of the target based on the race score of the target

A race prediction system using the frequency of mutation occurrence, characterized in that it includes.
According to paragraph 1,

The mutation frequency calculation unit

From the population genome mutation database, the number of mutations by race, which indicates the number of times each mutation appears in a specific race, and the total number of people by race are collected, and the collected number of mutations by race and the total number of people by race are used. A race prediction system using the mutation frequency, characterized in that the frequency of occurrence of each mutation is calculated by race.
According to paragraph 2,

The mutation frequency calculation unit

Using the number of people with a homozygote mutation (number of homozygote) and the total number of people (allele number / 2), calculate the ratio of people with the homozygote mutation among all people (homozygote rate), and calculate the heterozygote mutation. Calculate the proportion of people with the heterozygote mutation among all people (heterozygote rate) using the number of people who have it (allele count - 2 * number of homozygote) and the total number of people (allele number / 2),

The frequency of mutations by race is

A race prediction system using mutation frequency, comprising the homozygote rate and the heterozygote rate.
According to paragraph 3,

The mutation frequency calculation unit

When calculating the homozygote rate and the heterozygote rate, in order to select a discriminating mutation, a mutation in which the total number of people (allele number / 2) is at least 1,000 or more is selected, and if the mutation is too rare, it may affect the calculation. Considering that there is an allele count ratio of more than 5% and less than 95% in the entire race, the mutation frequency is used to select the target mutation for calculating the homozygote rate and the heterozygote rate. Racial prediction system.
According to paragraph 3,

A mutation frequency table construction unit that stores the homozygote rate and the heterozygote rate calculated for each mutation in a table by race to generate a mutation frequency table by race.

It further includes,

The score calculation unit for each race is

Search for mutations (target mutations) related to the object from the mutation frequency table by race, load the values of homozygote rate and heterozygote rate by race for each of the searched target mutations, and create a target consisting of the plurality of target mutations. A race prediction system using mutation frequency, characterized by calculating a race-specific score for a set of mutations.
According to clause 5,

The score calculation unit for each race is

Using the values of the homozygote rate and heterozygote rate by race for each of the target mutations, the score for each race of the target is calculated based on the conditional probability that the target mutation set (target mutation set) will appear in a specific race. Race prediction system using the frequency of occurrence of characteristic mutations.
According to clause 6,

The score calculation unit for each race is

An ethnic prediction system using mutation frequency, characterized in that the ethnic score is calculated using Equation 1 below.

[Equation 1]

Here, V represents the set of target variants (v 1 , v 2 , ..., v n ), E is race, n is the number of target variants, and Pr(Vn|E) is the number of target variants v n in a particular race. Each represents the probability of death.
According to paragraph 1,

The racial prediction unit

A race prediction system using mutation frequency, characterized in that the race corresponding to the highest score among the race scores of the target is predicted as the target's race.
In the race prediction method using the mutation frequency of the race prediction server,

Calculating the frequency of occurrence of mutations by race by linking the race prediction server with a population genome mutation database;

The racial prediction server calculating a score by race of the target using the frequency of occurrence of mutations by race; and

A step where the race prediction server predicts the race of the target based on the race score of the target.

A racial prediction method using the frequency of mutation occurrence, comprising:
According to clause 9,

The step of calculating the frequency of mutations by race is

Collecting the number of mutations by race, which represents the number of times each mutation appears in a specific race, and the total number of people by race from the population genome mutation database; and

Calculating the frequency of occurrence of each mutation by race using the collected number of mutations by race and the total number of people by race.

A racial prediction method using the frequency of mutation occurrence, comprising:
According to clause 10,

The step of calculating the frequency of occurrence of each mutation by race is

Calculating the ratio of people with the homozygote mutation among all people (homozygote rate) using the number of people with the homozygote mutation (number of homozygote) and the total number of people (allele number / 2); and

Using the number of people with a heterozygote mutation (allele count - 2 * number of homozygote) and the total number of people (allele number / 2), the proportion of people with the heterozygote mutation among all people (heterozygote rate) Including the step of calculating,

The frequency of mutations by race is

A method for predicting race using mutation frequency, comprising the homozygote rate and the heterozygote rate.
According to clause 11,

The racial prediction server storing the homozygote rate and the heterozygote rate calculated for each mutation in a table by race to generate a mutation frequency table by race.

It further includes,

The step of calculating the score by race of the target is

Searching for mutations (target mutations) related to the object from the racial mutation frequency table; and

Loading the values of homozygote rate and heterozygote rate by race for each of the searched target mutations, and calculating a score by race for the target mutation set consisting of the plurality of target mutations.

A racial prediction method using the frequency of mutation occurrence, comprising: