Construction method of various human haplotype ancestor source databases
Technical Field
The invention relates to the technical field of biological information, in particular to a progenitor data sorting technology based on SNP.
Background
At the human genome level, the majority of human genetic variations are SNPs. About 1000bp of the human genome has one SNP site, which is widely existed in a non-coding region and a coding region. The SNP carried by individuals of different races is different, and long-distance migration of human beings occurs many times from ancient times to present, and the phenomenon of blood mixing of offspring is a common phenomenon, so that genetic information of a plurality of different races possibly exists in the genome of an individual. Therefore, it is not scientific to distinguish which race the individual is derived from only the appearance such as skin color.
Besides the differences of the characters such as height, skin color and body type of people are related to SNP, the probability of suffering from certain genetic diseases, the resistance level of the immune system to certain diseases and the like are also related to SNP. Genetic information analysis for individuals or specific groups requires to know which race the gene of a target individual or group originates from, and to know the ancestral information, so that the probability of the individuals or groups suffering from certain genetic diseases, the level of the immune system's resistance to certain diseases, and the like can be accurately analyzed. The genetic information in the ancestor source database is required to be comprehensive and correctly classified, although the existing ancestor source database records a plurality of ancestor source information related to the race obtained from the biomedical development, the same genetic information may be shared by a plurality of ancestor sources, and possibly the classification of the ancestor source information is not necessarily accurate, the genetic information is only marked to exist in some ancestor sources but also exist in other ancestor sources, so that the analysis of the genetic information of the overlooked ancestor sources is not in place, and the development process of the genetic information and the biomedical is restricted. If the SNP analysis can be applied to perfect ancestral classification, the ancestral accuracy of searching target genes/personal genetic information is improved, and the method has a guiding effect on analyzing the association of the SNP and genetic diseases, molecular diagnosis, precise medicine, pharmacy and personalized medicine.
Disclosure of Invention
The invention aims to provide a construction method of an ancestor source database of each ethnic haplotype, and solve the problems that the genetic information in the haplotype of a sample to be detected cannot be classified most correctly and the ancestor source cannot be traced back accurately due to incomplete database information in the prior art.
In order to achieve the purpose, the invention adopts the following technical scheme:
the construction method of various haplotype ancestor source databases comprises the following steps:
(1) collecting the whole genome data of each race, and taking a single haplotype sequence as a sample unit;
(2) setting an extraction frame, wherein the extraction frame moves from one end of a haplotype sequence to the other end and extracts SNP information of fragments in the extraction frame, each fragment is marked with corresponding race information, and each fragment is temporarily stored in sequence by the fragments in the same race according to the SNP locus of the closest 5 'end or 3' end of each fragment until the SNP information of each haplotype sequence of each race is extracted;
(3) comparing the same fragments of SNP sites in the same human species, and combining the fragments with the same SNP sites and the same base information on the SNP sites;
(4) comparing the segments with the same SNP sites among the races, finding out the segments with the same SNP sites and the same base information on the SNP sites, and marking all the race information corresponding to the segments.
Further, the movement of the extraction box from one end of the haplotype sequence to the other end is SNP by SNP;
the extraction box is moved from the 5 'end to the 3' end of the haplotype sequence or from the 3 'end to the 5' end of the haplotype sequence.
Further, the size of the extraction box is such that 10-200 consecutive SNPs can be extracted.
Further, in the step (2), more than 2 extraction boxes with different sizes are moved to extract the SNP information of the same haplotype sequence until the SNP information of each haplotype sequence of each race is completely extracted by each extraction box.
Further, the more than 2 extraction frames with different sizes are moved simultaneously or moved in batches to extract the SNP information of the same haplotype sequence.
Further, the 2 different sized fetch blocks are selected from: an extraction frame capable of extracting 20 consecutive SNPs, an extraction frame capable of extracting 21 consecutive SNPs, an extraction frame capable of extracting 22 consecutive SNPs, and an extraction frame capable of extracting 200 consecutive SNPs.
Further, the 2 different sized fetch boxes are selected from: extraction frames capable of extracting 20 consecutive SNPs, extraction frames capable of extracting 50 consecutive SNPs, extraction frames capable of extracting 80 consecutive SNPs, extraction frames capable of extracting 120 consecutive SNPs, extraction frames capable of extracting 160 consecutive SNPs, and extraction frames capable of extracting 200 consecutive SNPs.
Further, the steps also include:
(5) within the same ethnic group, fragments with the same SNP site closest to the 5 'end or the 3' end are classified into the same group.
Further, the steps also include:
(6) within the same ethnic group, fragments with the same SNP site are classified into the same group.
Further, in the step (1), the complete genome data of each race is collected from Hapmap project, international thousand people genome project, Qiyunnade.
The advantages of the invention include: the genetic information of the same haplotype in the constructed database is stored in fragments with different SNP locus numbers, which is beneficial to the comparison of the haplotype to be detected; the segments with the same SNP locus and the same base information of the SNP locus mark all the corresponding race information, thereby avoiding the influence on accuracy caused by incomplete ancestral information when searching disease-related genes, and reducing the limitation on the development of molecular diagnosis, precise medicine, pharmacy and individualized medication.
Detailed Description
The present invention will be described in detail with reference to specific embodiments, which are provided to illustrate the present invention but not to limit the present invention.
Example one
The construction method of various haplotype ancestor source databases comprises the following steps:
(1) collecting the complete genome data of each race from databases containing the genome data of the races such as a Hapmap project, an international thousand-person genome project, a Qiyunnade and the like, and taking a single haplotype sequence as a sample unit;
(2) setting extraction frames capable of extracting 20 continuous SNPs, 21 continuous SNPs, 22 continuous SNPs and 200 continuous SNPs, wherein each extraction frame moves from the 5 ' end to the 3 ' end of the haplotype sequence one by one to extract SNP information of fragments in the extraction frame and marks each fragment with corresponding race information, the extraction frames can move the extraction information at the same time or move the extraction information in batches, and fragments in the same race temporarily store each fragment in sequence according to the SNP site closest to the 5 ' end of each fragment until the SNP information of each haplotype sequence of each race is completely moved and extracted by each extraction frame;
(3) comparing the same segments of the SNP sites in the same race, combining the segments with the same SNP sites and the same base information on the SNP sites, and avoiding redundancy caused by repeated storage;
(4) and comparing the fragments with the same SNP site among the races, finding out the fragments with the same SNP site and the same base information on the SNP site, and marking all the race information corresponding to the fragments.
(5) Within the same ethnic group, the same fragments at the SNP site closest to the 5' end are classified into the same group.
(6) Within the same ethnic group, fragments with the same SNP site are classified into the same group.
Example two
The difference from the first embodiment is that: the moving direction of the extraction frame in the step (2) is to move from the 3 ' end to the 5 ' end of the haplotype sequence, and fragments in the same race are temporarily stored in sequence according to the SNP locus closest to the 3 ' end of each fragment; in the same race in step (5), the fragments with the same SNP site closest to the 3' end are classified into the same group.
EXAMPLE III
The difference from the first embodiment is that the extraction box in the step (2) is set as: extraction frames capable of extracting 20 consecutive SNPs, extraction frames capable of extracting 50 consecutive SNPs, extraction frames capable of extracting 80 consecutive SNPs, extraction frames capable of extracting 120 consecutive SNPs, extraction frames capable of extracting 160 consecutive SNPs, and extraction frames capable of extracting 200 consecutive SNPs.
Example four
The difference from the second embodiment is that the extraction box in the step (2) is set as follows: extraction frames capable of extracting 20 consecutive SNPs, extraction frames capable of extracting 50 consecutive SNPs, extraction frames capable of extracting 80 consecutive SNPs, extraction frames capable of extracting 120 consecutive SNPs, extraction frames capable of extracting 160 consecutive SNPs, and extraction frames capable of extracting 200 consecutive SNPs.
The database constructed by the invention has comprehensive data, can provide a basis for accurately finding out the ancestor source for individuals with mixed blood, the same haplotype is stored with genetic information by the fragments with different SNP site numbers, the association with genes such as character analysis, disease analysis, effective protection of individuals from serious progress of certain disease and the like can be facilitated, the phenomenon that the analysis direction of a certain gene is misled due to the fact that the ancestor source is not correctly found is avoided, and the development of molecular diagnosis, accurate medicine, pharmacy and individualized medication technology is actively promoted.
The technical solutions provided by the embodiments of the present invention are described in detail above, and the principles and embodiments of the present invention are explained herein by using specific examples, and the descriptions of the embodiments are only used to help understanding the principles of the embodiments of the present invention; meanwhile, for a person skilled in the art, according to the embodiments of the present invention, the specific implementation manners and the application ranges may be changed, and in conclusion, the content of the present specification should not be construed as limiting the invention.