Construction method of haplotype progenitor source database of various people
Technical Field
The invention relates to the technical field of biological information, in particular to a ancestral data arrangement technology based on SNP.
Background
At the human genome level, most human genetic variations are SNPs. There is one SNP site in the human genome of about 1000bp, which is widely present in non-coding and coding regions. SNPs carried on individuals of different ethnic groups are different, so long-distance migration of human beings occurs many times from ancient times to date, and offspring blood mixing is a common phenomenon, so that genetic information of a plurality of different ethnic groups may exist in a genome of an individual. Therefore, it is not scientific to distinguish which ethnic source an individual is from only the appearance of skin color or the like.
In addition to differences in the body height, skin color, body shape, etc. of people, there are also the probability of suffering from certain genetic diseases, the level of immunity against certain diseases, etc. associated with SNPs. Genetic information analysis for individuals or specific groups requires grasping which race a target individual or group gene originates from, and knowledge of ancestral information can accurately analyze the probability of an individual or group suffering from certain genetic diseases, the level of the immune system's resistance to certain diseases, and the like. The genetic information in the ancestor database is required to be comprehensive and correctly classified, and although the existing ancestor database records the ancestor information related to the race obtained from the development of a plurality of biomedicines, the same genetic information may be shared by a plurality of ancestor sources, the classification is possibly not necessarily accurate, and only the genetic information is marked as some of the ancestor sources and the other ancestor sources are ignored, so that the genetic information of the ignored ancestor sources is not well analyzed, and the progress of the genetic information and the biomedicine development is restricted. If SNP analysis can be applied to perfect ancestral classification, the ancestral accuracy of searching the target gene/personal genetic information can be improved, and the method has guiding effect on SNP and genetic disease association analysis, molecular diagnosis, accurate medicine, pharmacy and personalized medicine application.
Disclosure of Invention
The invention aims to provide a construction method of various human haplotype progenitor source databases, which aims to solve the problems that in the prior art, the information of the databases is not comprehensive enough, so that the genetic information in the haplotype of a sample to be tested cannot be classified most accurately and the progenitor source cannot be traced accurately.
In order to achieve the above purpose, the invention adopts the following technical scheme:
the construction method of the haplotype progenitor source database of each person comprises the following steps:
(1) Collecting genome data of various people, and taking a single haplotype sequence as a sample unit;
(2) Setting an extraction frame, wherein the extraction frame moves from one end of a haplotype sequence to the other end, extracts SNP information of fragments in the extraction frame, marks corresponding race information of each fragment, sequentially temporarily stores each fragment according to SNP loci of the nearest 5 'end or 3' end of each fragment in the same race until the SNP information of each haplotype sequence of each race is extracted;
(3) Comparing fragments with the same SNP locus in the same human species, and combining fragments with the same SNP locus and the same base information on the SNP locus;
(4) Comparing fragments with identical SNP loci among various people, finding out fragments with identical SNP loci and identical base information on the SNP loci, and marking all the corresponding species information.
Further, the extraction box moves from one end of the haplotype sequence to the other end, and SNP moves one by one;
the extraction box moves from the 5 'end to the 3' end of the haplotype sequence or from the 3 'end to the 5' end of the haplotype sequence.
Further, the extraction box is sized to be capable of extracting 10-200 consecutive SNPs.
Further, in the step (2), more than 2 extraction frames with different sizes are moved to extract the SNP information of the same haplotype sequence until the SNP information of each haplotype sequence of each race is completely extracted by each extraction frame.
Further, the more than 2 extraction frames with different sizes simultaneously extract SNP information of the same haplotype sequence in a moving way or in batches.
Further, the 2 different sized extraction boxes are selected from: an extraction box capable of extracting 20 consecutive SNPs, an extraction box capable of extracting 21 consecutive SNPs, an extraction box capable of extracting 22 consecutive SNPs.
Further, the 2 different sized extraction boxes are selected from: an extraction box capable of extracting 20 consecutive SNPs, an extraction box capable of extracting 50 consecutive SNPs, an extraction box capable of extracting 80 consecutive SNPs, an extraction box capable of extracting 120 consecutive SNPs, an extraction box capable of extracting 160 consecutive SNPs, an extraction box capable of extracting 200 consecutive SNPs.
Further, the steps further include:
(5) Fragments within the same race, which are identical at the SNP site closest to the 5 'or 3' end, are grouped into identical panels.
Further, the steps further include:
(6) Within the same race, fragments with identical SNP sites are grouped into identical panels.
Further, in the step (1), whole genome data of each race is collected from Hapmap project, international thousand genome project, qi Yun Nuode.
The advantages of the invention include: genetic information of the same haplotype in the constructed database is stored in fragments with different SNP locus numbers, so that haplotype comparison to be detected is facilitated; all corresponding race information is marked by fragments with the same SNP locus and the same SNP locus base information, so that the influence on accuracy caused by incomplete ancestral information when searching for disease-associated genes is avoided, and the limitation on development of molecular diagnosis, accurate medicine, pharmacy and personalized medicine is reduced.
Detailed Description
The present invention will be described in detail with reference to specific examples, which are given herein for illustrative purposes and illustration of the present invention, but are not to be construed as limiting the invention.
Example 1
The construction method of the haplotype progenitor source database of each person comprises the following steps:
(1) Collecting whole genome data of various species from databases containing race genome data such as Hapmap project, international thousand-person genome project, qigong Yun Nuode and the like, and taking a single haplotype sequence as a sample unit;
(2) Setting an extraction frame capable of extracting 20 continuous SNPs, an extraction frame capable of extracting 21 continuous SNPs, an extraction frame capable of extracting 22 continuous SNPs, and an extraction frame capable of extracting 200 continuous SNPs, wherein each extraction frame moves from the 5' end to the 3' end of the haplotype sequence one by one, extracts SNP information of fragments in the extraction frame, marks the corresponding race information of each fragment, the extraction frames can move the extraction information simultaneously or in batches, temporarily stores each fragment in sequence according to the SNP locus of each fragment closest to the 5' end until the SNP information of each haplotype sequence of each race is completely moved and extracted by each extraction frame;
(3) Comparing fragments with the same SNP locus in the same race, merging fragments with the same SNP locus and the same base information on the SNP locus, and avoiding redundancy caused by repeated storage;
(4) Comparing fragments with identical SNP loci among various people, finding out fragments with identical SNP loci and identical base information on the SNP loci, and marking all the corresponding species information.
(5) Fragments of the same species, which are closest to the 5' end and have the same SNP site, are classified into the same panel.
(6) Within the same race, fragments with identical SNP sites are grouped into identical panels.
Example two
Unlike the first embodiment, the following is: the moving direction of the extraction frame in the step (2) is from the 3' end to the 5' end of the haplotype sequence, and the fragments in the same race are temporarily stored in sequence according to the SNP locus of each fragment closest to the 3' end; in the step (5), fragments with the same SNP locus closest to the 3' end in the same race are classified into the same group.
Example III
Unlike the first embodiment, the extraction block in step (2) is set to: an extraction box capable of extracting 20 consecutive SNPs, an extraction box capable of extracting 50 consecutive SNPs, an extraction box capable of extracting 80 consecutive SNPs, an extraction box capable of extracting 120 consecutive SNPs, an extraction box capable of extracting 160 consecutive SNPs, an extraction box capable of extracting 200 consecutive SNPs.
Example IV
The difference from the second embodiment is that the extraction block in step (2) is set as: an extraction box capable of extracting 20 consecutive SNPs, an extraction box capable of extracting 50 consecutive SNPs, an extraction box capable of extracting 80 consecutive SNPs, an extraction box capable of extracting 120 consecutive SNPs, an extraction box capable of extracting 160 consecutive SNPs, an extraction box capable of extracting 200 consecutive SNPs.
The database constructed by the invention has comprehensive data, can provide a basis for accurately finding out the ancestral sources of individuals with mixed blood, stores genetic information by fragments with different SNP locus numbers in the same haplotype, can facilitate analysis of characters and diseases, effectively protect individuals from being associated with genes such as serious disease progression, and the like, avoids misleading the phenomenon of a certain gene analysis direction caused by incorrect finding of the ancestral sources, and actively promotes development of molecular diagnosis, accurate medicine, pharmacy and personalized medication technology.
The foregoing has described in detail the technical solutions provided by the embodiments of the present invention, and specific examples have been applied to illustrate the principles and implementations of the embodiments of the present invention, where the above description of the embodiments is only suitable for helping to understand the principles of the embodiments of the present invention; meanwhile, as for those skilled in the art, according to the embodiments of the present invention, there are variations in the specific embodiments and the application scope, and the present description should not be construed as limiting the present invention.