CN114783527A - Construction method of various human haplotype ancestor source databases - Google Patents

Construction method of various human haplotype ancestor source databases Download PDF

Info

Publication number
CN114783527A
CN114783527A CN202210564500.2A CN202210564500A CN114783527A CN 114783527 A CN114783527 A CN 114783527A CN 202210564500 A CN202210564500 A CN 202210564500A CN 114783527 A CN114783527 A CN 114783527A
Authority
CN
China
Prior art keywords
same
snp
extraction
information
fragments
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210564500.2A
Other languages
Chinese (zh)
Other versions
CN114783527B (en
Inventor
宋清
马丽
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Song Qing
Original Assignee
Guangzhou Hongxi Jianshan Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Hongxi Jianshan Technology Co ltd filed Critical Guangzhou Hongxi Jianshan Technology Co ltd
Priority to CN202210564500.2A priority Critical patent/CN114783527B/en
Publication of CN114783527A publication Critical patent/CN114783527A/en
Application granted granted Critical
Publication of CN114783527B publication Critical patent/CN114783527B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/30Detection of binding sites or motifs

Landscapes

  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Theoretical Computer Science (AREA)
  • Evolutionary Biology (AREA)
  • Biophysics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Medical Informatics (AREA)
  • General Health & Medical Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Analytical Chemistry (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Molecular Biology (AREA)
  • Genetics & Genomics (AREA)
  • Chemical & Material Sciences (AREA)
  • Bioethics (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method for constructing an ancestor source database of various ethnic haploids, which comprises the following steps: extracting information from haplotype sequences of various races by using extraction frames with different sizes, and marking the race information; comparing the same fragments of SNP sites in the same human species, and combining the fragments with the same SNP sites and the same base information on the SNP sites; and comparing the fragments with the same SNP site among the races, finding out the fragments with the same SNP site and the same base information on the SNP site, and marking all the race information corresponding to the fragments. The invention has the advantages that: the genetic information of the same haplotype is stored in fragments with different SNP site numbers, which is beneficial to the comparison of the haplotype to be detected; all the segments with the same SNP locus and the same base information of the SNP locus mark the corresponding race information, thereby avoiding the influence on the accuracy caused by incomplete ancestral information when searching the disease associated gene.

Description

Construction method of various human haplotype ancestor source databases
Technical Field
The invention relates to the technical field of biological information, in particular to a progenitor data sorting technology based on SNP.
Background
At the human genome level, the majority of human genetic variations are SNPs. About 1000bp of the human genome has one SNP site, which is widely existed in a non-coding region and a coding region. The SNP carried by individuals of different races is different, and long-distance migration of human beings occurs many times from ancient times to present, and the phenomenon of blood mixing of offspring is a common phenomenon, so that genetic information of a plurality of different races possibly exists in the genome of an individual. Therefore, it is not scientific to distinguish which race the individual is derived from only the appearance such as skin color.
Besides the differences of the characters such as height, skin color and body type of people are related to SNP, the probability of suffering from certain genetic diseases, the resistance level of the immune system to certain diseases and the like are also related to SNP. Genetic information analysis for individuals or specific groups requires to know which race the gene of a target individual or group originates from, and to know the ancestral information, so that the probability of the individuals or groups suffering from certain genetic diseases, the level of the immune system's resistance to certain diseases, and the like can be accurately analyzed. The genetic information in the ancestor source database is required to be comprehensive and correctly classified, although the existing ancestor source database records a plurality of ancestor source information related to the race obtained from the biomedical development, the same genetic information may be shared by a plurality of ancestor sources, and possibly the classification of the ancestor source information is not necessarily accurate, the genetic information is only marked to exist in some ancestor sources but also exist in other ancestor sources, so that the analysis of the genetic information of the overlooked ancestor sources is not in place, and the development process of the genetic information and the biomedical is restricted. If the SNP analysis can be applied to perfect ancestral classification, the ancestral accuracy of searching target genes/personal genetic information is improved, and the method has a guiding effect on analyzing the association of the SNP and genetic diseases, molecular diagnosis, precise medicine, pharmacy and personalized medicine.
Disclosure of Invention
The invention aims to provide a construction method of an ancestor source database of each ethnic haplotype, and solve the problems that the genetic information in the haplotype of a sample to be detected cannot be classified most correctly and the ancestor source cannot be traced back accurately due to incomplete database information in the prior art.
In order to achieve the purpose, the invention adopts the following technical scheme:
the construction method of various haplotype ancestor source databases comprises the following steps:
(1) collecting the whole genome data of each race, and taking a single haplotype sequence as a sample unit;
(2) setting an extraction frame, wherein the extraction frame moves from one end of a haplotype sequence to the other end and extracts SNP information of fragments in the extraction frame, each fragment is marked with corresponding race information, and each fragment is temporarily stored in sequence by the fragments in the same race according to the SNP locus of the closest 5 'end or 3' end of each fragment until the SNP information of each haplotype sequence of each race is extracted;
(3) comparing the same fragments of SNP sites in the same human species, and combining the fragments with the same SNP sites and the same base information on the SNP sites;
(4) comparing the segments with the same SNP sites among the races, finding out the segments with the same SNP sites and the same base information on the SNP sites, and marking all the race information corresponding to the segments.
Further, the movement of the extraction box from one end of the haplotype sequence to the other end is SNP by SNP;
the extraction box is moved from the 5 'end to the 3' end of the haplotype sequence or from the 3 'end to the 5' end of the haplotype sequence.
Further, the size of the extraction box is such that 10-200 consecutive SNPs can be extracted.
Further, in the step (2), more than 2 extraction boxes with different sizes are moved to extract the SNP information of the same haplotype sequence until the SNP information of each haplotype sequence of each race is completely extracted by each extraction box.
Further, the more than 2 extraction frames with different sizes are moved simultaneously or moved in batches to extract the SNP information of the same haplotype sequence.
Further, the 2 different sized fetch blocks are selected from: an extraction frame capable of extracting 20 consecutive SNPs, an extraction frame capable of extracting 21 consecutive SNPs, an extraction frame capable of extracting 22 consecutive SNPs, and an extraction frame capable of extracting 200 consecutive SNPs.
Further, the 2 different sized fetch boxes are selected from: extraction frames capable of extracting 20 consecutive SNPs, extraction frames capable of extracting 50 consecutive SNPs, extraction frames capable of extracting 80 consecutive SNPs, extraction frames capable of extracting 120 consecutive SNPs, extraction frames capable of extracting 160 consecutive SNPs, and extraction frames capable of extracting 200 consecutive SNPs.
Further, the steps also include:
(5) within the same ethnic group, fragments with the same SNP site closest to the 5 'end or the 3' end are classified into the same group.
Further, the steps also include:
(6) within the same ethnic group, fragments with the same SNP site are classified into the same group.
Further, in the step (1), the complete genome data of each race is collected from Hapmap project, international thousand people genome project, Qiyunnade.
The advantages of the invention include: the genetic information of the same haplotype in the constructed database is stored in fragments with different SNP locus numbers, which is beneficial to the comparison of the haplotype to be detected; the segments with the same SNP locus and the same base information of the SNP locus mark all the corresponding race information, thereby avoiding the influence on accuracy caused by incomplete ancestral information when searching disease-related genes, and reducing the limitation on the development of molecular diagnosis, precise medicine, pharmacy and individualized medication.
Detailed Description
The present invention will be described in detail with reference to specific embodiments, which are provided to illustrate the present invention but not to limit the present invention.
Example one
The construction method of various haplotype ancestor source databases comprises the following steps:
(1) collecting the complete genome data of each race from databases containing the genome data of the races such as a Hapmap project, an international thousand-person genome project, a Qiyunnade and the like, and taking a single haplotype sequence as a sample unit;
(2) setting extraction frames capable of extracting 20 continuous SNPs, 21 continuous SNPs, 22 continuous SNPs and 200 continuous SNPs, wherein each extraction frame moves from the 5 ' end to the 3 ' end of the haplotype sequence one by one to extract SNP information of fragments in the extraction frame and marks each fragment with corresponding race information, the extraction frames can move the extraction information at the same time or move the extraction information in batches, and fragments in the same race temporarily store each fragment in sequence according to the SNP site closest to the 5 ' end of each fragment until the SNP information of each haplotype sequence of each race is completely moved and extracted by each extraction frame;
(3) comparing the same segments of the SNP sites in the same race, combining the segments with the same SNP sites and the same base information on the SNP sites, and avoiding redundancy caused by repeated storage;
(4) and comparing the fragments with the same SNP site among the races, finding out the fragments with the same SNP site and the same base information on the SNP site, and marking all the race information corresponding to the fragments.
(5) Within the same ethnic group, the same fragments at the SNP site closest to the 5' end are classified into the same group.
(6) Within the same ethnic group, fragments with the same SNP site are classified into the same group.
Example two
The difference from the first embodiment is that: the moving direction of the extraction frame in the step (2) is to move from the 3 ' end to the 5 ' end of the haplotype sequence, and fragments in the same race are temporarily stored in sequence according to the SNP locus closest to the 3 ' end of each fragment; in the same race in step (5), the fragments with the same SNP site closest to the 3' end are classified into the same group.
EXAMPLE III
The difference from the first embodiment is that the extraction box in the step (2) is set as: extraction frames capable of extracting 20 consecutive SNPs, extraction frames capable of extracting 50 consecutive SNPs, extraction frames capable of extracting 80 consecutive SNPs, extraction frames capable of extracting 120 consecutive SNPs, extraction frames capable of extracting 160 consecutive SNPs, and extraction frames capable of extracting 200 consecutive SNPs.
Example four
The difference from the second embodiment is that the extraction box in the step (2) is set as follows: extraction frames capable of extracting 20 consecutive SNPs, extraction frames capable of extracting 50 consecutive SNPs, extraction frames capable of extracting 80 consecutive SNPs, extraction frames capable of extracting 120 consecutive SNPs, extraction frames capable of extracting 160 consecutive SNPs, and extraction frames capable of extracting 200 consecutive SNPs.
The database constructed by the invention has comprehensive data, can provide a basis for accurately finding out the ancestor source for individuals with mixed blood, the same haplotype is stored with genetic information by the fragments with different SNP site numbers, the association with genes such as character analysis, disease analysis, effective protection of individuals from serious progress of certain disease and the like can be facilitated, the phenomenon that the analysis direction of a certain gene is misled due to the fact that the ancestor source is not correctly found is avoided, and the development of molecular diagnosis, accurate medicine, pharmacy and individualized medication technology is actively promoted.
The technical solutions provided by the embodiments of the present invention are described in detail above, and the principles and embodiments of the present invention are explained herein by using specific examples, and the descriptions of the embodiments are only used to help understanding the principles of the embodiments of the present invention; meanwhile, for a person skilled in the art, according to the embodiments of the present invention, the specific implementation manners and the application ranges may be changed, and in conclusion, the content of the present specification should not be construed as limiting the invention.

Claims (10)

1. The construction method of the various human haplotype ancestor source database is characterized in that:
the method comprises the following steps:
(1) collecting the complete genome data of each race, and taking a single haplotype sequence as a sample unit;
(2) setting an extraction frame, wherein the extraction frame moves from one end of a haplotype sequence to the other end and extracts SNP information of fragments in the extraction frame, each fragment is marked with corresponding race information, and each fragment is temporarily stored in sequence by the fragments in the same race according to the SNP locus of the closest 5 'end or 3' end of each fragment until the SNP information of each haplotype sequence of each race is extracted;
(3) comparing the same fragments of SNP sites in the same human species, and combining the fragments with the same SNP sites and the same base information on the SNP sites;
(4) comparing the segments with the same SNP sites among the races, finding out the segments with the same SNP sites and the same base information on the SNP sites, and marking all the race information corresponding to the segments.
2. The method of claim 1, wherein the method comprises:
the extraction frame is moved from one end of the haplotype sequence to the other end one by one SNP;
the extraction box is moved from the 5 'end to the 3' end of the haplotype sequence or from the 3 'end to the 5' end of the haplotype sequence.
3. The method of claim 1, wherein the method comprises the steps of:
the size of the extraction box is such that 10-200 consecutive SNPs can be extracted.
4. The method for constructing the ethnic haplotype progenitor database according to any one of claims 1-3, wherein:
in the step (2), more than 2 extraction boxes with different sizes move to extract the SNP information of the same haplotype sequence until the SNP information of each haplotype sequence of each race is completely extracted by each extraction box.
5. The method of claim 4, wherein the method comprises the steps of:
and the more than 2 extraction frames with different sizes are moved simultaneously or moved in batches to extract the SNP information of the same haplotype sequence.
6. The method of claim 4, wherein the method comprises the steps of:
the 2 different sized fetch blocks are selected from: an extraction frame capable of extracting 20 consecutive SNPs, an extraction frame capable of extracting 21 consecutive SNPs, an extraction frame capable of extracting 22 consecutive SNPs.
7. The method of claim 4, wherein the method comprises:
the 2 different sized fetch boxes are selected from: extraction frames capable of extracting 20 consecutive SNPs, extraction frames capable of extracting 50 consecutive SNPs, extraction frames capable of extracting 80 consecutive SNPs, extraction frames capable of extracting 120 consecutive SNPs, extraction frames capable of extracting 160 consecutive SNPs, and extraction frames capable of extracting 200 consecutive SNPs.
8. The method of constructing the ethnogenical haplotype progenitor database according to claim 1, 6 or 7, wherein:
the method also comprises the following steps:
(5) within the same race, fragments with the same SNP site closest to the 5 'end or 3' end are classified into the same group.
9. The method of claim 8, wherein the method further comprises:
the method also comprises the following steps:
(6) within the same ethnic group, fragments with the same SNP site are classified into the same group.
10. The method of claim 1, wherein the method comprises:
in the step (1), the whole genome data of each race is collected from a Hapmap project, an international thousand-person genome project, and a Qiyunnade.
CN202210564500.2A 2022-05-23 2022-05-23 Construction method of haplotype progenitor source database of various people Active CN114783527B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210564500.2A CN114783527B (en) 2022-05-23 2022-05-23 Construction method of haplotype progenitor source database of various people

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210564500.2A CN114783527B (en) 2022-05-23 2022-05-23 Construction method of haplotype progenitor source database of various people

Publications (2)

Publication Number Publication Date
CN114783527A true CN114783527A (en) 2022-07-22
CN114783527B CN114783527B (en) 2024-05-03

Family

ID=82408743

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210564500.2A Active CN114783527B (en) 2022-05-23 2022-05-23 Construction method of haplotype progenitor source database of various people

Country Status (1)

Country Link
CN (1) CN114783527B (en)

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100037342A1 (en) * 2008-08-01 2010-02-11 Monsanto Technology Llc Methods and compositions for breeding plants with enhanced yield
KR20100067493A (en) * 2008-12-11 2010-06-21 한국생명공학연구원 The system to detect and analyze the genetic and disease information elements within families
CN101956006A (en) * 2010-08-27 2011-01-26 公安部物证鉴定中心 Method for obtaining race specific loci and race inference system and application thereof
KR20110133223A (en) * 2010-06-04 2011-12-12 대한민국 (식품의약품안전청장) Methods of prediction of dpd enzyme activity using haplotype in korean
US20140067280A1 (en) * 2012-08-28 2014-03-06 Inova Health System Ancestral-Specific Reference Genomes And Uses Thereof
CN109993305A (en) * 2018-01-03 2019-07-09 成都二十三魔方生物科技有限公司 Ancestral source polymorphism prediction technique based on big data intelligent algorithm
CN110491441A (en) * 2019-05-06 2019-11-22 西安交通大学 A kind of gene sequencing data simulation system and method for simulation crowd background information
CN111210874A (en) * 2020-01-07 2020-05-29 北京奇云诺德信息科技有限公司 Algorithm for performing ancestral source analysis prediction based on gene big data
CN112233724A (en) * 2020-10-16 2021-01-15 深圳市盛景基因生物科技有限公司 Ancestral polymorphism prediction method based on big data artificial intelligence algorithm
CN112885408A (en) * 2021-02-22 2021-06-01 中国农业大学 Method and device for detecting SNP marker locus based on low-depth sequencing

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100037342A1 (en) * 2008-08-01 2010-02-11 Monsanto Technology Llc Methods and compositions for breeding plants with enhanced yield
KR20100067493A (en) * 2008-12-11 2010-06-21 한국생명공학연구원 The system to detect and analyze the genetic and disease information elements within families
KR20110133223A (en) * 2010-06-04 2011-12-12 대한민국 (식품의약품안전청장) Methods of prediction of dpd enzyme activity using haplotype in korean
CN101956006A (en) * 2010-08-27 2011-01-26 公安部物证鉴定中心 Method for obtaining race specific loci and race inference system and application thereof
US20140067280A1 (en) * 2012-08-28 2014-03-06 Inova Health System Ancestral-Specific Reference Genomes And Uses Thereof
CN109993305A (en) * 2018-01-03 2019-07-09 成都二十三魔方生物科技有限公司 Ancestral source polymorphism prediction technique based on big data intelligent algorithm
CN110491441A (en) * 2019-05-06 2019-11-22 西安交通大学 A kind of gene sequencing data simulation system and method for simulation crowd background information
CN111210874A (en) * 2020-01-07 2020-05-29 北京奇云诺德信息科技有限公司 Algorithm for performing ancestral source analysis prediction based on gene big data
CN112233724A (en) * 2020-10-16 2021-01-15 深圳市盛景基因生物科技有限公司 Ancestral polymorphism prediction method based on big data artificial intelligence algorithm
CN112885408A (en) * 2021-02-22 2021-06-01 中国农业大学 Method and device for detecting SNP marker locus based on low-depth sequencing

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
刘振伟等: "《植物新品种保护法律制度》", vol. 1, 中国民主法治出版社, pages: 127 - 130 *
殷才湧等: "EA-YPredictor:基于Y-STR数据的家系特异性单倍群归属判别分析软件", 《刑事技术》, no. 02 *
陈峰等: "DNA微单倍型的研究现状、挑战与展望", 《南京医科大学学报(自然科学版)》, no. 08 *

Also Published As

Publication number Publication date
CN114783527B (en) 2024-05-03

Similar Documents

Publication Publication Date Title
US9639659B2 (en) Ancestral-specific reference genomes and uses in identifying a candidate for a clinical trial
Leao et al. Comparative genomics uncovers the prolific and distinctive metabolic potential of the cyanobacterial genus Moorea
Krings et al. mtDNA analysis of Nile River Valley populations: A genetic corridor or a barrier to migration?
CN112446351B (en) Intelligent identification method for medical bills
Sneath Chapter II Computer Taxonomy
CN106156538A (en) The annotation method of a kind of full-length genome variation data and annotation system
CN107609347A (en) A kind of grand transcript profile data analysing method based on high throughput sequencing technologies
Dogan et al. A glimpse at the intricate mosaic of ethnicities from Mesopotamia: Paternal lineages of the Northern Iraqi Arabs, Kurds, Syriacs, Turkmens and Yazidis
Jordan et al. Native American admixture recapitulates population-specific migration and settlement of the continental United States
Huoponen et al. Mitochondrial DNA variation in an aboriginal Australian population: evidence for genetic isolation and regional differentiation
CN115083521B (en) Method and system for identifying tumor cell group in single cell transcriptome sequencing data
CN112397174A (en) Chronic disease medication guidance device and method
CN110970116A (en) Transcriptomics-based traditional Chinese medicine pharmacological mechanism analysis method
CN115631789A (en) Pangenome-based group joint variation detection method
CN112768080A (en) Medical keyword bank establishing method and system based on medical big data
Hauth et al. Beyond tandem repeats: complex pattern structures and distant regions of similarity
CN109993305A (en) Ancestral source polymorphism prediction technique based on big data intelligent algorithm
CN110111847A (en) Method and apparatus based on ITS2 plant identification species
CN114783527B (en) Construction method of haplotype progenitor source database of various people
CN114783528B (en) Application method of haplotype progenitor source database
Bonnen et al. European admixture on the Micronesian island of Kosrae: lessons from complete genetic information
CN111243661A (en) Gene physical examination system based on gene data
CN106529212A (en) Sequence-order dependent frequency matrix-based biological sequence evolution information extraction method and application thereof
CN111128297B (en) Preparation method of gene chip
CN114242171B (en) BCR classification method combining logistic regression and multi-example learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20240409

Address after: Unit 4062, Building 6, No. 16 Beitaiping Road, Haidian District, Beijing, 100000

Applicant after: Song Qing

Country or region after: China

Address before: 510300 unit 301, floor 3, production area, No. 1, helix 4th Road, Huangpu District (Guangzhou International Biological Island), Guangzhou City, Guangdong Province

Applicant before: Guangzhou Hongxi Jianshan Technology Co.,Ltd.

Country or region before: China

TA01 Transfer of patent application right
GR01 Patent grant
GR01 Patent grant