CN114783527A

CN114783527A - Construction method of various human haplotype ancestor source databases

Info

Publication number: CN114783527A
Application number: CN202210564500.2A
Authority: CN
Inventors: 宋清; 马丽
Original assignee: Guangzhou Hongxi Jianshan Technology Co ltd
Current assignee: Song Qing
Priority date: 2022-05-23
Filing date: 2022-05-23
Publication date: 2022-07-22
Anticipated expiration: 2042-05-23
Also published as: CN114783527B

Abstract

The invention discloses a method for constructing an ancestor source database of various ethnic haploids, which comprises the following steps: extracting information from haplotype sequences of various races by using extraction frames with different sizes, and marking the race information; comparing the same fragments of SNP sites in the same human species, and combining the fragments with the same SNP sites and the same base information on the SNP sites; and comparing the fragments with the same SNP site among the races, finding out the fragments with the same SNP site and the same base information on the SNP site, and marking all the race information corresponding to the fragments. The invention has the advantages that: the genetic information of the same haplotype is stored in fragments with different SNP site numbers, which is beneficial to the comparison of the haplotype to be detected; all the segments with the same SNP locus and the same base information of the SNP locus mark the corresponding race information, thereby avoiding the influence on the accuracy caused by incomplete ancestral information when searching the disease associated gene.

Description

Construction method of various human haplotype ancestor source databases

Technical Field

The invention relates to the technical field of biological information, in particular to a progenitor data sorting technology based on SNP.

Background

At the human genome level, the majority of human genetic variations are SNPs. About 1000bp of the human genome has one SNP site, which is widely existed in a non-coding region and a coding region. The SNP carried by individuals of different races is different, and long-distance migration of human beings occurs many times from ancient times to present, and the phenomenon of blood mixing of offspring is a common phenomenon, so that genetic information of a plurality of different races possibly exists in the genome of an individual. Therefore, it is not scientific to distinguish which race the individual is derived from only the appearance such as skin color.

Besides the differences of the characters such as height, skin color and body type of people are related to SNP, the probability of suffering from certain genetic diseases, the resistance level of the immune system to certain diseases and the like are also related to SNP. Genetic information analysis for individuals or specific groups requires to know which race the gene of a target individual or group originates from, and to know the ancestral information, so that the probability of the individuals or groups suffering from certain genetic diseases, the level of the immune system's resistance to certain diseases, and the like can be accurately analyzed. The genetic information in the ancestor source database is required to be comprehensive and correctly classified, although the existing ancestor source database records a plurality of ancestor source information related to the race obtained from the biomedical development, the same genetic information may be shared by a plurality of ancestor sources, and possibly the classification of the ancestor source information is not necessarily accurate, the genetic information is only marked to exist in some ancestor sources but also exist in other ancestor sources, so that the analysis of the genetic information of the overlooked ancestor sources is not in place, and the development process of the genetic information and the biomedical is restricted. If the SNP analysis can be applied to perfect ancestral classification, the ancestral accuracy of searching target genes/personal genetic information is improved, and the method has a guiding effect on analyzing the association of the SNP and genetic diseases, molecular diagnosis, precise medicine, pharmacy and personalized medicine.

Disclosure of Invention

The invention aims to provide a construction method of an ancestor source database of each ethnic haplotype, and solve the problems that the genetic information in the haplotype of a sample to be detected cannot be classified most correctly and the ancestor source cannot be traced back accurately due to incomplete database information in the prior art.

In order to achieve the purpose, the invention adopts the following technical scheme:

the construction method of various haplotype ancestor source databases comprises the following steps:

(1) collecting the whole genome data of each race, and taking a single haplotype sequence as a sample unit;

(2) setting an extraction frame, wherein the extraction frame moves from one end of a haplotype sequence to the other end and extracts SNP information of fragments in the extraction frame, each fragment is marked with corresponding race information, and each fragment is temporarily stored in sequence by the fragments in the same race according to the SNP locus of the closest 5 'end or 3' end of each fragment until the SNP information of each haplotype sequence of each race is extracted;

(3) comparing the same fragments of SNP sites in the same human species, and combining the fragments with the same SNP sites and the same base information on the SNP sites;

(4) comparing the segments with the same SNP sites among the races, finding out the segments with the same SNP sites and the same base information on the SNP sites, and marking all the race information corresponding to the segments.

Further, the movement of the extraction box from one end of the haplotype sequence to the other end is SNP by SNP;

the extraction box is moved from the 5 'end to the 3' end of the haplotype sequence or from the 3 'end to the 5' end of the haplotype sequence.

Further, the size of the extraction box is such that 10-200 consecutive SNPs can be extracted.

Further, in the step (2), more than 2 extraction boxes with different sizes are moved to extract the SNP information of the same haplotype sequence until the SNP information of each haplotype sequence of each race is completely extracted by each extraction box.

Further, the more than 2 extraction frames with different sizes are moved simultaneously or moved in batches to extract the SNP information of the same haplotype sequence.

Further, the 2 different sized fetch blocks are selected from: an extraction frame capable of extracting 20 consecutive SNPs, an extraction frame capable of extracting 21 consecutive SNPs, an extraction frame capable of extracting 22 consecutive SNPs, and an extraction frame capable of extracting 200 consecutive SNPs.

Further, the 2 different sized fetch boxes are selected from: extraction frames capable of extracting 20 consecutive SNPs, extraction frames capable of extracting 50 consecutive SNPs, extraction frames capable of extracting 80 consecutive SNPs, extraction frames capable of extracting 120 consecutive SNPs, extraction frames capable of extracting 160 consecutive SNPs, and extraction frames capable of extracting 200 consecutive SNPs.

Further, the steps also include:

(5) within the same ethnic group, fragments with the same SNP site closest to the 5 'end or the 3' end are classified into the same group.

Further, the steps also include:

(6) within the same ethnic group, fragments with the same SNP site are classified into the same group.

Further, in the step (1), the complete genome data of each race is collected from Hapmap project, international thousand people genome project, Qiyunnade.

The advantages of the invention include: the genetic information of the same haplotype in the constructed database is stored in fragments with different SNP locus numbers, which is beneficial to the comparison of the haplotype to be detected; the segments with the same SNP locus and the same base information of the SNP locus mark all the corresponding race information, thereby avoiding the influence on accuracy caused by incomplete ancestral information when searching disease-related genes, and reducing the limitation on the development of molecular diagnosis, precise medicine, pharmacy and individualized medication.

Detailed Description

The present invention will be described in detail with reference to specific embodiments, which are provided to illustrate the present invention but not to limit the present invention.

Example one

(1) collecting the complete genome data of each race from databases containing the genome data of the races such as a Hapmap project, an international thousand-person genome project, a Qiyunnade and the like, and taking a single haplotype sequence as a sample unit;

(2) setting extraction frames capable of extracting 20 continuous SNPs, 21 continuous SNPs, 22 continuous SNPs and 200 continuous SNPs, wherein each extraction frame moves from the 5 ' end to the 3 ' end of the haplotype sequence one by one to extract SNP information of fragments in the extraction frame and marks each fragment with corresponding race information, the extraction frames can move the extraction information at the same time or move the extraction information in batches, and fragments in the same race temporarily store each fragment in sequence according to the SNP site closest to the 5 ' end of each fragment until the SNP information of each haplotype sequence of each race is completely moved and extracted by each extraction frame;

(3) comparing the same segments of the SNP sites in the same race, combining the segments with the same SNP sites and the same base information on the SNP sites, and avoiding redundancy caused by repeated storage;

(4) and comparing the fragments with the same SNP site among the races, finding out the fragments with the same SNP site and the same base information on the SNP site, and marking all the race information corresponding to the fragments.

(5) Within the same ethnic group, the same fragments at the SNP site closest to the 5' end are classified into the same group.

Example two

The difference from the first embodiment is that: the moving direction of the extraction frame in the step (2) is to move from the 3 ' end to the 5 ' end of the haplotype sequence, and fragments in the same race are temporarily stored in sequence according to the SNP locus closest to the 3 ' end of each fragment; in the same race in step (5), the fragments with the same SNP site closest to the 3' end are classified into the same group.

EXAMPLE III

The difference from the first embodiment is that the extraction box in the step (2) is set as: extraction frames capable of extracting 20 consecutive SNPs, extraction frames capable of extracting 50 consecutive SNPs, extraction frames capable of extracting 80 consecutive SNPs, extraction frames capable of extracting 120 consecutive SNPs, extraction frames capable of extracting 160 consecutive SNPs, and extraction frames capable of extracting 200 consecutive SNPs.

Example four

The difference from the second embodiment is that the extraction box in the step (2) is set as follows: extraction frames capable of extracting 20 consecutive SNPs, extraction frames capable of extracting 50 consecutive SNPs, extraction frames capable of extracting 80 consecutive SNPs, extraction frames capable of extracting 120 consecutive SNPs, extraction frames capable of extracting 160 consecutive SNPs, and extraction frames capable of extracting 200 consecutive SNPs.

The database constructed by the invention has comprehensive data, can provide a basis for accurately finding out the ancestor source for individuals with mixed blood, the same haplotype is stored with genetic information by the fragments with different SNP site numbers, the association with genes such as character analysis, disease analysis, effective protection of individuals from serious progress of certain disease and the like can be facilitated, the phenomenon that the analysis direction of a certain gene is misled due to the fact that the ancestor source is not correctly found is avoided, and the development of molecular diagnosis, accurate medicine, pharmacy and individualized medication technology is actively promoted.

The technical solutions provided by the embodiments of the present invention are described in detail above, and the principles and embodiments of the present invention are explained herein by using specific examples, and the descriptions of the embodiments are only used to help understanding the principles of the embodiments of the present invention; meanwhile, for a person skilled in the art, according to the embodiments of the present invention, the specific implementation manners and the application ranges may be changed, and in conclusion, the content of the present specification should not be construed as limiting the invention.

Claims

1. The construction method of the various human haplotype ancestor source database is characterized in that:

the method comprises the following steps:

(1) collecting the complete genome data of each race, and taking a single haplotype sequence as a sample unit;

2. The method of claim 1, wherein the method comprises:

the extraction frame is moved from one end of the haplotype sequence to the other end one by one SNP;

3. The method of claim 1, wherein the method comprises the steps of:

the size of the extraction box is such that 10-200 consecutive SNPs can be extracted.

4. The method for constructing the ethnic haplotype progenitor database according to any one of claims 1-3, wherein:

in the step (2), more than 2 extraction boxes with different sizes move to extract the SNP information of the same haplotype sequence until the SNP information of each haplotype sequence of each race is completely extracted by each extraction box.

5. The method of claim 4, wherein the method comprises the steps of:

and the more than 2 extraction frames with different sizes are moved simultaneously or moved in batches to extract the SNP information of the same haplotype sequence.

6. The method of claim 4, wherein the method comprises the steps of:

the 2 different sized fetch blocks are selected from: an extraction frame capable of extracting 20 consecutive SNPs, an extraction frame capable of extracting 21 consecutive SNPs, an extraction frame capable of extracting 22 consecutive SNPs.

7. The method of claim 4, wherein the method comprises:

the 2 different sized fetch boxes are selected from: extraction frames capable of extracting 20 consecutive SNPs, extraction frames capable of extracting 50 consecutive SNPs, extraction frames capable of extracting 80 consecutive SNPs, extraction frames capable of extracting 120 consecutive SNPs, extraction frames capable of extracting 160 consecutive SNPs, and extraction frames capable of extracting 200 consecutive SNPs.

8. The method of constructing the ethnogenical haplotype progenitor database according to claim 1, 6 or 7, wherein:

the method also comprises the following steps:

(5) within the same race, fragments with the same SNP site closest to the 5 'end or 3' end are classified into the same group.

9. The method of claim 8, wherein the method further comprises:

the method also comprises the following steps:

10. The method of claim 1, wherein the method comprises:

in the step (1), the whole genome data of each race is collected from a Hapmap project, an international thousand-person genome project, and a Qiyunnade.