CN114783527B - Construction method of haplotype progenitor source database of various people - Google Patents

Construction method of haplotype progenitor source database of various people Download PDF

Info

Publication number
CN114783527B
CN114783527B CN202210564500.2A CN202210564500A CN114783527B CN 114783527 B CN114783527 B CN 114783527B CN 202210564500 A CN202210564500 A CN 202210564500A CN 114783527 B CN114783527 B CN 114783527B
Authority
CN
China
Prior art keywords
snp
haplotype
information
extraction
extracting
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210564500.2A
Other languages
Chinese (zh)
Other versions
CN114783527A (en
Inventor
宋清
马丽
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Song Qing
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to CN202210564500.2A priority Critical patent/CN114783527B/en
Publication of CN114783527A publication Critical patent/CN114783527A/en
Application granted granted Critical
Publication of CN114783527B publication Critical patent/CN114783527B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/30Detection of binding sites or motifs

Landscapes

  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Medical Informatics (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Theoretical Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Genetics & Genomics (AREA)
  • Molecular Biology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Databases & Information Systems (AREA)
  • Bioethics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a construction method of a haplotype progenitor source database of various people, which comprises the following steps: extracting information from haplotype sequences of various people by using extraction frames with different sizes, and marking the information of the people; comparing fragments with the same SNP locus in the same human species, and combining fragments with the same SNP locus and the same base information on the SNP locus; comparing fragments with identical SNP loci among various people, finding out fragments with identical SNP loci and identical base information on the SNP loci, and marking all the corresponding species information. The invention has the advantages that: genetic information of the same haplotype is stored in fragments with different SNP locus numbers, so that haplotype comparison to be detected is facilitated; all corresponding race information is marked by fragments with the same SNP locus and the same SNP locus base information, so that the influence on accuracy caused by incomplete ancestral information when searching for disease-associated genes is avoided.

Description

Construction method of haplotype progenitor source database of various people
Technical Field
The invention relates to the technical field of biological information, in particular to a ancestral data arrangement technology based on SNP.
Background
At the human genome level, most human genetic variations are SNPs. There is one SNP site in the human genome of about 1000bp, which is widely present in non-coding and coding regions. SNPs carried on individuals of different ethnic groups are different, so long-distance migration of human beings occurs many times from ancient times to date, and offspring blood mixing is a common phenomenon, so that genetic information of a plurality of different ethnic groups may exist in a genome of an individual. Therefore, it is not scientific to distinguish which ethnic source an individual is from only the appearance of skin color or the like.
In addition to differences in the body height, skin color, body shape, etc. of people, there are also the probability of suffering from certain genetic diseases, the level of immunity against certain diseases, etc. associated with SNPs. Genetic information analysis for individuals or specific groups requires grasping which race a target individual or group gene originates from, and knowledge of ancestral information can accurately analyze the probability of an individual or group suffering from certain genetic diseases, the level of the immune system's resistance to certain diseases, and the like. The genetic information in the ancestor database is required to be comprehensive and correctly classified, and although the existing ancestor database records the ancestor information related to the race obtained from the development of a plurality of biomedicines, the same genetic information may be shared by a plurality of ancestor sources, the classification is possibly not necessarily accurate, and only the genetic information is marked as some of the ancestor sources and the other ancestor sources are ignored, so that the genetic information of the ignored ancestor sources is not well analyzed, and the progress of the genetic information and the biomedicine development is restricted. If SNP analysis can be applied to perfect ancestral classification, the ancestral accuracy of searching the target gene/personal genetic information can be improved, and the method has guiding effect on SNP and genetic disease association analysis, molecular diagnosis, accurate medicine, pharmacy and personalized medicine application.
Disclosure of Invention
The invention aims to provide a construction method of various human haplotype progenitor source databases, which aims to solve the problems that in the prior art, the information of the databases is not comprehensive enough, so that the genetic information in the haplotype of a sample to be tested cannot be classified most accurately and the progenitor source cannot be traced accurately.
In order to achieve the above purpose, the invention adopts the following technical scheme:
the construction method of the haplotype progenitor source database of each person comprises the following steps:
(1) Collecting genome data of various people, and taking a single haplotype sequence as a sample unit;
(2) Setting an extraction frame, wherein the extraction frame moves from one end of a haplotype sequence to the other end, extracts SNP information of fragments in the extraction frame, marks corresponding race information of each fragment, sequentially temporarily stores each fragment according to SNP loci of the nearest 5 'end or 3' end of each fragment in the same race until the SNP information of each haplotype sequence of each race is extracted;
(3) Comparing fragments with the same SNP locus in the same human species, and combining fragments with the same SNP locus and the same base information on the SNP locus;
(4) Comparing fragments with identical SNP loci among various people, finding out fragments with identical SNP loci and identical base information on the SNP loci, and marking all the corresponding species information.
Further, the extraction box moves from one end of the haplotype sequence to the other end, and SNP moves one by one;
the extraction box moves from the 5 'end to the 3' end of the haplotype sequence or from the 3 'end to the 5' end of the haplotype sequence.
Further, the extraction box is sized to be capable of extracting 10-200 consecutive SNPs.
Further, in the step (2), more than 2 extraction frames with different sizes are moved to extract the SNP information of the same haplotype sequence until the SNP information of each haplotype sequence of each race is completely extracted by each extraction frame.
Further, the more than 2 extraction frames with different sizes simultaneously extract SNP information of the same haplotype sequence in a moving way or in batches.
Further, the 2 different sized extraction boxes are selected from: an extraction box capable of extracting 20 consecutive SNPs, an extraction box capable of extracting 21 consecutive SNPs, an extraction box capable of extracting 22 consecutive SNPs.
Further, the 2 different sized extraction boxes are selected from: an extraction box capable of extracting 20 consecutive SNPs, an extraction box capable of extracting 50 consecutive SNPs, an extraction box capable of extracting 80 consecutive SNPs, an extraction box capable of extracting 120 consecutive SNPs, an extraction box capable of extracting 160 consecutive SNPs, an extraction box capable of extracting 200 consecutive SNPs.
Further, the steps further include:
(5) Fragments within the same race, which are identical at the SNP site closest to the 5 'or 3' end, are grouped into identical panels.
Further, the steps further include:
(6) Within the same race, fragments with identical SNP sites are grouped into identical panels.
Further, in the step (1), whole genome data of each race is collected from Hapmap project, international thousand genome project, qi Yun Nuode.
The advantages of the invention include: genetic information of the same haplotype in the constructed database is stored in fragments with different SNP locus numbers, so that haplotype comparison to be detected is facilitated; all corresponding race information is marked by fragments with the same SNP locus and the same SNP locus base information, so that the influence on accuracy caused by incomplete ancestral information when searching for disease-associated genes is avoided, and the limitation on development of molecular diagnosis, accurate medicine, pharmacy and personalized medicine is reduced.
Detailed Description
The present invention will be described in detail with reference to specific examples, which are given herein for illustrative purposes and illustration of the present invention, but are not to be construed as limiting the invention.
Example 1
The construction method of the haplotype progenitor source database of each person comprises the following steps:
(1) Collecting whole genome data of various species from databases containing race genome data such as Hapmap project, international thousand-person genome project, qigong Yun Nuode and the like, and taking a single haplotype sequence as a sample unit;
(2) Setting an extraction frame capable of extracting 20 continuous SNPs, an extraction frame capable of extracting 21 continuous SNPs, an extraction frame capable of extracting 22 continuous SNPs, and an extraction frame capable of extracting 200 continuous SNPs, wherein each extraction frame moves from the 5' end to the 3' end of the haplotype sequence one by one, extracts SNP information of fragments in the extraction frame, marks the corresponding race information of each fragment, the extraction frames can move the extraction information simultaneously or in batches, temporarily stores each fragment in sequence according to the SNP locus of each fragment closest to the 5' end until the SNP information of each haplotype sequence of each race is completely moved and extracted by each extraction frame;
(3) Comparing fragments with the same SNP locus in the same race, merging fragments with the same SNP locus and the same base information on the SNP locus, and avoiding redundancy caused by repeated storage;
(4) Comparing fragments with identical SNP loci among various people, finding out fragments with identical SNP loci and identical base information on the SNP loci, and marking all the corresponding species information.
(5) Fragments of the same species, which are closest to the 5' end and have the same SNP site, are classified into the same panel.
(6) Within the same race, fragments with identical SNP sites are grouped into identical panels.
Example two
Unlike the first embodiment, the following is: the moving direction of the extraction frame in the step (2) is from the 3' end to the 5' end of the haplotype sequence, and the fragments in the same race are temporarily stored in sequence according to the SNP locus of each fragment closest to the 3' end; in the step (5), fragments with the same SNP locus closest to the 3' end in the same race are classified into the same group.
Example III
Unlike the first embodiment, the extraction block in step (2) is set to: an extraction box capable of extracting 20 consecutive SNPs, an extraction box capable of extracting 50 consecutive SNPs, an extraction box capable of extracting 80 consecutive SNPs, an extraction box capable of extracting 120 consecutive SNPs, an extraction box capable of extracting 160 consecutive SNPs, an extraction box capable of extracting 200 consecutive SNPs.
Example IV
The difference from the second embodiment is that the extraction block in step (2) is set as: an extraction box capable of extracting 20 consecutive SNPs, an extraction box capable of extracting 50 consecutive SNPs, an extraction box capable of extracting 80 consecutive SNPs, an extraction box capable of extracting 120 consecutive SNPs, an extraction box capable of extracting 160 consecutive SNPs, an extraction box capable of extracting 200 consecutive SNPs.
The database constructed by the invention has comprehensive data, can provide a basis for accurately finding out the ancestral sources of individuals with mixed blood, stores genetic information by fragments with different SNP locus numbers in the same haplotype, can facilitate analysis of characters and diseases, effectively protect individuals from being associated with genes such as serious disease progression, and the like, avoids misleading the phenomenon of a certain gene analysis direction caused by incorrect finding of the ancestral sources, and actively promotes development of molecular diagnosis, accurate medicine, pharmacy and personalized medication technology.
The foregoing has described in detail the technical solutions provided by the embodiments of the present invention, and specific examples have been applied to illustrate the principles and implementations of the embodiments of the present invention, where the above description of the embodiments is only suitable for helping to understand the principles of the embodiments of the present invention; meanwhile, as for those skilled in the art, according to the embodiments of the present invention, there are variations in the specific embodiments and the application scope, and the present description should not be construed as limiting the present invention.

Claims (7)

1. The construction method of the haplotype progenitor source database of each person is characterized by comprising the following steps:
The method comprises the following steps:
(1) Collecting genome data of various people, and taking a single haplotype sequence as a sample unit;
(2) Setting an extraction frame, wherein the extraction frame moves from one end of a haplotype sequence to the other end, extracts SNP information of fragments in the extraction frame, marks corresponding race information of each fragment, sequentially temporarily stores each fragment according to SNP loci of the nearest 5 'end or 3' end of each fragment in the same race until the SNP information of each haplotype sequence of each race is extracted;
the extraction frame moves from one end of the haplotype sequence to the other end and is moved from SNP to SNP;
The extraction frame moves from the 5 'end to the 3' end of the haplotype sequence or from the 3 'end to the 5' end of the haplotype sequence;
The size of the extraction frame is capable of extracting 10-200 consecutive SNPs;
In the step (2), more than 2 extraction frames with different sizes are moved to extract SNP information of the same haplotype sequence until the SNP information of each haplotype sequence of each race is completely moved and extracted by each extraction frame;
(3) Comparing fragments with the same SNP locus in the same human species, and combining fragments with the same SNP locus and the same base information on the SNP locus;
(4) Comparing fragments with identical SNP loci among various people, finding out fragments with identical SNP loci and identical base information on the SNP loci, and marking all the corresponding species information.
2. The method for constructing various haplotype progenitor source databases according to claim 1, wherein:
And simultaneously moving and extracting or moving and extracting SNP information of the same haplotype sequence in batches by more than 2 extraction frames with different sizes.
3. The method for constructing various haplotype progenitor source databases according to claim 1, wherein:
The 2 different sized extraction boxes are selected from: an extraction box capable of extracting 20 consecutive SNPs, an extraction box capable of extracting 21 consecutive SNPs, an extraction box capable of extracting 22 consecutive SNPs.
4. The method for constructing various haplotype progenitor source databases according to claim 1, wherein:
The 2 different sized extraction boxes are selected from: an extraction box capable of extracting 20 consecutive SNPs, an extraction box capable of extracting 50 consecutive SNPs, an extraction box capable of extracting 80 consecutive SNPs, an extraction box capable of extracting 120 consecutive SNPs, an extraction box capable of extracting 160 consecutive SNPs, an extraction box capable of extracting 200 consecutive SNPs.
5. The method for constructing the individual haplotype progenitor source database according to claim 1,3 or 4, wherein:
The method further comprises the steps of:
(5) Fragments within the same race, which are identical at the SNP site closest to the 5 'or 3' end, are grouped into identical panels.
6. The method for constructing various haplotype progenitor source databases according to claim 5, wherein:
(6) Within the same race, fragments with identical SNP sites are grouped into identical panels.
7. The method for constructing various haplotype progenitor source databases according to claim 1, wherein:
in the step (1), whole genome data of each race is collected from the Hapmap project, international thousand genome project, and Qigold Yun Nuode.
CN202210564500.2A 2022-05-23 2022-05-23 Construction method of haplotype progenitor source database of various people Active CN114783527B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210564500.2A CN114783527B (en) 2022-05-23 2022-05-23 Construction method of haplotype progenitor source database of various people

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210564500.2A CN114783527B (en) 2022-05-23 2022-05-23 Construction method of haplotype progenitor source database of various people

Publications (2)

Publication Number Publication Date
CN114783527A CN114783527A (en) 2022-07-22
CN114783527B true CN114783527B (en) 2024-05-03

Family

ID=82408743

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210564500.2A Active CN114783527B (en) 2022-05-23 2022-05-23 Construction method of haplotype progenitor source database of various people

Country Status (1)

Country Link
CN (1) CN114783527B (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20100067493A (en) * 2008-12-11 2010-06-21 한국생명공학연구원 The system to detect and analyze the genetic and disease information elements within families
CN101956006A (en) * 2010-08-27 2011-01-26 公安部物证鉴定中心 Method for obtaining race specific loci and race inference system and application thereof
KR20110133223A (en) * 2010-06-04 2011-12-12 대한민국 (식품의약품안전청장) Methods of prediction of dpd enzyme activity using haplotype in korean
CN109993305A (en) * 2018-01-03 2019-07-09 成都二十三魔方生物科技有限公司 Ancestral source polymorphism prediction technique based on big data intelligent algorithm
CN110491441A (en) * 2019-05-06 2019-11-22 西安交通大学 A kind of gene sequencing data simulation system and method for simulation crowd background information
CN111210874A (en) * 2020-01-07 2020-05-29 北京奇云诺德信息科技有限公司 Algorithm for performing ancestral source analysis prediction based on gene big data
CN112233724A (en) * 2020-10-16 2021-01-15 深圳市盛景基因生物科技有限公司 Ancestral polymorphism prediction method based on big data artificial intelligence algorithm
CN112885408A (en) * 2021-02-22 2021-06-01 中国农业大学 Method and device for detecting SNP marker locus based on low-depth sequencing

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100037342A1 (en) * 2008-08-01 2010-02-11 Monsanto Technology Llc Methods and compositions for breeding plants with enhanced yield
US9449143B2 (en) * 2012-08-28 2016-09-20 Inova Health System Ancestral-specific reference genomes and uses thereof

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20100067493A (en) * 2008-12-11 2010-06-21 한국생명공학연구원 The system to detect and analyze the genetic and disease information elements within families
KR20110133223A (en) * 2010-06-04 2011-12-12 대한민국 (식품의약품안전청장) Methods of prediction of dpd enzyme activity using haplotype in korean
CN101956006A (en) * 2010-08-27 2011-01-26 公安部物证鉴定中心 Method for obtaining race specific loci and race inference system and application thereof
CN109993305A (en) * 2018-01-03 2019-07-09 成都二十三魔方生物科技有限公司 Ancestral source polymorphism prediction technique based on big data intelligent algorithm
CN110491441A (en) * 2019-05-06 2019-11-22 西安交通大学 A kind of gene sequencing data simulation system and method for simulation crowd background information
CN111210874A (en) * 2020-01-07 2020-05-29 北京奇云诺德信息科技有限公司 Algorithm for performing ancestral source analysis prediction based on gene big data
CN112233724A (en) * 2020-10-16 2021-01-15 深圳市盛景基因生物科技有限公司 Ancestral polymorphism prediction method based on big data artificial intelligence algorithm
CN112885408A (en) * 2021-02-22 2021-06-01 中国农业大学 Method and device for detecting SNP marker locus based on low-depth sequencing

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
DNA微单倍型的研究现状、挑战与展望;陈峰等;《南京医科大学学报(自然科学版)》(第08期);全文 *
EA-YPredictor:基于Y-STR数据的家系特异性单倍群归属判别分析软件;殷才湧等;《刑事技术》(第02期);全文 *
刘振伟等.《植物新品种保护法律制度》.中国民主法治出版社,2022,(第1版第1次印刷版),127-130. *

Also Published As

Publication number Publication date
CN114783527A (en) 2022-07-22

Similar Documents

Publication Publication Date Title
US9639659B2 (en) Ancestral-specific reference genomes and uses in identifying a candidate for a clinical trial
CN106156538A (en) The annotation method of a kind of full-length genome variation data and annotation system
CN107609347A (en) A kind of grand transcript profile data analysing method based on high throughput sequencing technologies
CN102332064B (en) Biological species identification method based on genetic barcode
CN106480221B (en) Based on gene copy number variation site to the method for forest tree population genotyping
Dogan et al. A glimpse at the intricate mosaic of ethnicities from Mesopotamia: Paternal lineages of the Northern Iraqi Arabs, Kurds, Syriacs, Turkmens and Yazidis
CN109993305A (en) Ancestral source polymorphism prediction technique based on big data intelligent algorithm
CN115631789A (en) Pangenome-based group joint variation detection method
CN114783527B (en) Construction method of haplotype progenitor source database of various people
CN114360642A (en) Cancer transcriptome data processing method based on gene co-expression network analysis
CN112435712A (en) Method and system for analyzing gene sequencing data
Lu et al. GI-Cluster: detecting genomic islands via consensus clustering on multiple features
CN114783519A (en) Method for analyzing soil biological combined pollution by using metagenome
CN114783528B (en) Application method of haplotype progenitor source database
Bonnen et al. European admixture on the Micronesian island of Kosrae: lessons from complete genetic information
CN113380326B (en) Gene expression data analysis method based on PAM clustering algorithm
CN114974432A (en) Screening method of biomarker and related application thereof
Chen et al. Microarray gene expression
CN111118168B (en) SNP marker combination for deducing main ethnic group of northwest China and adjacent middle and Asia countries
CN113361752A (en) Protein solvent accessibility prediction method based on multi-view learning
CN111243661A (en) Gene physical examination system based on gene data
CN111128297B (en) Preparation method of gene chip
Hediyeh-zadeh et al. Identification of cell types, states and programs by learning gene set representations
CN106529212A (en) Sequence-order dependent frequency matrix-based biological sequence evolution information extraction method and application thereof
Maelicke et al. DEPD®, a high resolution gene expression profiling technique capable of identifying new drug targets in the central nervous system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20240409

Address after: Unit 4062, Building 6, No. 16 Beitaiping Road, Haidian District, Beijing, 100000

Applicant after: Song Qing

Country or region after: China

Address before: 510300 unit 301, floor 3, production area, No. 1, helix 4th Road, Huangpu District (Guangzhou International Biological Island), Guangzhou City, Guangdong Province

Applicant before: Guangzhou Hongxi Jianshan Technology Co.,Ltd.

Country or region before: China

TA01 Transfer of patent application right
GR01 Patent grant
GR01 Patent grant