CN117789822A - Biological individual geographic source positioning method based on multi-modal genetic information - Google Patents

Biological individual geographic source positioning method based on multi-modal genetic information Download PDF

Info

Publication number
CN117789822A
CN117789822A CN202311850019.0A CN202311850019A CN117789822A CN 117789822 A CN117789822 A CN 117789822A CN 202311850019 A CN202311850019 A CN 202311850019A CN 117789822 A CN117789822 A CN 117789822A
Authority
CN
China
Prior art keywords
biological
information
total
data
ancestral
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311850019.0A
Other languages
Chinese (zh)
Inventor
范虹
姚昊天
王春年
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shaanxi Normal University
Original Assignee
Shaanxi Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shaanxi Normal University filed Critical Shaanxi Normal University
Priority to CN202311850019.0A priority Critical patent/CN117789822A/en
Publication of CN117789822A publication Critical patent/CN117789822A/en
Pending legal-status Critical Current

Links

Abstract

The invention provides a biological individual geographic source positioning method based on multi-mode genetic information, which firstly constructs a multi-mode genetic information data set, exploits the multi-angle application of SNP typing results, and has higher inference or prediction accuracy when comparing single-mode information. Meanwhile, the method adopts principal component analysis, ancestral component analysis and ancestral homologous fragment length analysis, relates to the dimension reduction of high-density SNP, and avoids the practical problems that high-dimensional site information is difficult to adapt and fit in a population model assumption and parameter estimation method. The method evaluates the importance of each characteristic and has a certain reference effect on the factor exploration of the crowd layering in the field of crowd genetics. The method has wider universality range of regions and crowds, can process SNP genotype databases with a large number of samples and high density, comprehensively uses multi-mode biological genetic data, and does not need excessive biological group priori knowledge.

Description

Biological individual geographic source positioning method based on multi-modal genetic information
Technical Field
The invention relates to biological geographic position prediction, in particular to a biological individual geographic source positioning method based on multi-mode genetic information.
Background
The problem of locating the geographical source of a biological individual mainly comprises two major aspects of biological geographical ancestor (Biogeographical Ancestry, BGA) inference and biological geographical position prediction of the individual, and aims to determine the biological ancestor population or the geographical source of the individual according to the genetic information of the individual. In recent years, as the world is more and more tightly linked, people are more and more mobile, and analysis of geographical sources is becoming more and more important. Knowledge of individual geographical sources can help to understand the history of human migration and population structure in depth, and assist in breaking criminal cases that are very large across regions and countries. Typical geographic origin solutions are traditional forensic inference methods and methods based on high-density genetic marker loci.
SNP is a shorthand for single nucleotide polymorphism (Single Nucleotide Polymorphism) and refers to a DNA sequence polymorphism at the genomic level caused by a variation between A, T, C, G four bases on a single nucleotide. For a certain species or population, SNPs result in changes in DNA sequence such that polymorphisms in the presence of multiple bases occur at the same location on the genome.
SNP genotyping refers to determining the base pair type of a SNP, with a total of 4*4 =16 possible outcomes, except for undetected cases. The difference in genotyping may result in a difference in the phenotype of the samples. It is widely available in people, has abundant polymorphism, and is a good genetic marker. Particularly, high-throughput SNP detection methods have been widely used for bioinformatics analysis.
SNP has abundant content in human genome, is widely used in various populations, and is a good genetic marker widely used in bioinformatics analysis due to high genetic stability, low mutation rate, short amplified fragments required by site detection, uniform distribution and abundant polymorphism.
Geographically and historically, most organisms are more closely related to their species' neighbors than to distant neighbors due to the mating, and genetic similarity of the offspring individuals that they produce between geographically nearby individuals; on the other hand, the relative isolation of the crowd due to geographical conditions, humanity, history and other factors causes different differentiation of the ethnic groups. Such similarities and heterogeneities, which may be expressed geographically or morphologically, may naturally manifest themselves in the genome by genetic, mutation, communication, etc. biological activities of genes between individuals, whereas this allows one to predict the location of their origin by comparing a sample of known genetic information but not of its ancestral origin with a set of samples of known genetic information and geographical origin.
Traditional forensic genetic inference methods are based on small numbers of low-density genetic markers of a part of the population, such as: ancestral SNP (ancestry informative SNPs, AISNPs) typing information is subjected to specific site screening, then a reference data set is established on the selected region and crowd, the similarity of SNP typing data on the specific sites and a reference population is compared, and a cluster source component is calculated through random matching probability to give an inferred result of the population source. However, the number of positions in the commercial site set in the prior art is small, the regional limitation is caused, the population at the level in the continent is not distinguished in a common range, the limitation may exist in capturing the fine substructure of the population, and the finer level distinction is difficult to achieve.
With advances in gene sequencing technology, next generation sequencing technology (next generation sequencing technology, NGS, also known as high throughput sequencing) is widely used, and the number of available high density datasets is increasing dramatically. For this reason, there is an urgent need to build more complex models using high-density genetic data for finer genetic localization. At present, two methods for positioning biological individual geographic sources based on high-density SNP genotype data are mainly available:
the method is based on probability and statistics population model assumption and parameter estimation, and the method starts from the existing knowledge of population genetics, makes some precondition, priori and idealized explicit model assumptions (such as SNP typed geographic distribution spectrum) for the population, carries out parameter estimation on the model by utilizing the existing biological data under the reasonable assumption of the proposed model, and finally obtains the model which can describe the distribution change condition of the individual with time or space according to a certain specific genetic variation scene. However, as genetic localization problems are studied in depth, the required localization accuracy increases, which requires the use of higher dimensional biological features to make more complex model assumptions. Meanwhile, when the input data has a large number of features, a "dimension disaster" forms, and the model becomes difficult to fit.
The machine learning-based method is oriented to algorithm modeling, aims at optimization of the algorithm, can avoid using an idealized data parameter model to a certain extent, and focuses on acquiring hidden, effective and understandable knowledge from huge amounts of data. The machine learning tool can be used for learning various conclusions and phenomena accumulated under the operation of huge amounts of knowledge and natural development rules which are stored in a large amount of biological data, and can be used for representing various models by assuming natural proposal instead of the prior knowledge. Meanwhile, the machine learning method has good resolvable and adaptability to the continuous increase of the dimensionality of the modern genome data.
Currently, in the machine learning-based method, the processing method for SNP typing data is mainly: the principal component analysis method, the STRUCTURE ancestral component analysis method, and the like are used for finding unknown correlations between individuals. However, the information of each mode is obtained after being processed by the methods, the mode used by the current model is single, and the methods only maintain the characteristics of less dimension (such as 2-3 dimension main components), so that the data utilization rate and the prediction accuracy rate are still to be improved.
Along with the development and maturity of the second-generation sequencing technology and the high-density chip detection technology, rapid SNP typing of DNA samples is generally achieved, but among the different methods and models, the universality of the method and the model for applicable regional population is low, the adaptability to high-density SNP is poor, the understanding angle of the method and the model for SNP information is single, and the utilization rate of the excavated biological geographic information is low. How to comprehensively reflect the information of more dimensionalities of SNP in the aspect of biological ancestor source or geographic prediction in multiple angles, and comprehensively use the biological information of multiple modes in an actual inference or prediction model, and simultaneously evaluate the importance degree of the SNP on the positioning problem of biological individual geographic sources, and still remain to be studied.
Disclosure of Invention
The invention aims to provide a biological individual geographic source positioning method based on multi-mode genetic information, which aims to solve the problems of poor adaptability to high-density SNP, insufficient region applicability and insufficient multi-mode biological information in the prior art.
The invention is realized by the following technical scheme:
a method for locating a geographic source of a biological individual based on multimodal genetic information, comprising:
obtaining SNP (single nucleotide polymorphism) typing data of a reference sample and corresponding biological ancestor origin or biological geographic position; obtaining SNP typing data of a sample to be tested; combining SNP typing data of a sample to be detected and SNP typing data of a reference sample, and marking sample types to obtain combined total SNP typing data;
performing principal component analysis on the combined total SNP typing data to obtain information of all PC dimensions, selecting part of the PC dimensions according to interpretation variances, and obtaining information on part of the PC dimensions; performing ancestral component analysis on the combined total SNP typing data to obtain ancestral component proportion information, and selecting part of ancestral component proportion information according to the mean value and standard deviation of the ancestral component proportions; calculating the length information of ancestral homologous fragments of the combined total SNP typing data, and selecting part of the length information of ancestral homologous fragments according to the average value and standard deviation of the lengths of the ancestral homologous fragments;
combining three modes of information on the dimension of part of PC, part of ancestor component proportion information and part of ancestor homologous fragment length information to obtain a combined multi-mode data set; normalizing and normalizing the combined multi-mode data set to obtain an available total data set;
dividing reference samples in the total data set into a training set and a testing set, taking all dimensions in the total data set as characteristics, and carrying out characteristic importance assessment by using the training set and the testing set; determining a feature set based on the feature importance assessment; the information in the dimension of the feature set is taken from the total available data set, and the total available data set is selected;
the individual biological geographic sources are located based on the selected total available data set.
Preferably, the main component analysis is performed on the combined total SNP typing data, specifically: principal component analysis was performed on the combined total SNP typing data using the PCA function of the GCTA software.
Preferably, the progenitor component analysis is performed on the combined total SNP typing data, specifically: the pooled total SNP typing data were subjected to progenitor component analysis using the adm ixture software.
Preferably, the calculation of the length information of the ancestral homologous fragments of the combined total SNP typing data comprises the following specific steps: homologous chromosome separation is carried out on SNP (single nucleotide polymorphism) genotyping data of a reference sample by using EAGLE software to obtain a genotyping result on each chromosome, the same DNA sequence between all reference sample pairs is identified by using GERMINE software according to the genotyping result on each chromosome, the lengths of the searched IBD fragments are recorded, after the IBD fragments on each chromosome are combined, the average IBD fragment length between a certain reference sample and other reference samples is calculated by ERSA software according to biological ancestral information classification of the reference sample; homologous chromosome separation is carried out on SNP (single nucleotide polymorphism) genotyping data of a sample to be tested by using EAGLE software to obtain a genotyping result on each chromosome, the same DNA sequence between the sample to be tested and a reference sample pair is identified by using GERMINE software according to the genotyping result on each chromosome, the length of the searched IBD fragment is recorded, the IBD fragments on each chromosome are combined, and then the average IBD fragment length between a certain sample to be tested and other reference samples is calculated by ERSA software according to biological ancestral information classification.
Preferably, the selecting a part of the PC dimension according to the interpretation variance, and acquiring information on the part of the PC dimension, specifically: according to the information of all PC dimensions, calculating the interpretation variance of each PC dimension, selecting a plurality of first PC dimensions with accumulated interpretation variance more than 60%, and obtaining the information of the plurality of first PC dimensions.
Preferably, the normalizing and normalizing the combined multi-modal dataset is specifically: and respectively and sequentially carrying out standardization and normalization on the three modes in the combined multi-mode data set to obtain results of the three modes, and then carrying out standardization and normalization on the results of the three modes.
Further, the three modes in the multi-mode dataset after combination are respectively standardized and normalized in sequence, specifically: and (3) respectively and sequentially normalizing and normalizing the information on the dimension of the part PC, the proportion information of the part ancestral source components and the length information of the part ancestral source fragments so that the mean value is 0, the variance is 1, and the value is compressed to be between 0 and 1.
Preferably, the dividing the reference samples in the total available data set into a training set and a testing set is specifically: the reference samples in the total available data set are divided into a training set and a testing set according to a hierarchical sampling method.
Preferably, the feature importance assessment is performed by using a training set and a test set, characterized by all dimensions in the total number of available sets, specifically: feature importance evaluation is carried out by using a training set and a testing set based on a Lasso polynomial logistic regression method by taking all dimensions in the total available data set as features;
locating the geographic origin of the biological individual based on the carefully selected total available data set, in particular: selecting samples to be tested in the available total data set as test samples, selecting all reference samples in the available total data set as training samples, and obtaining probability values of the samples to be tested, which are inferred to be ancestral source categories of the reference samples, by using a Lasso polynomial logistic regression method.
Preferably, the feature importance assessment is performed by using a training set and a test set, characterized by all dimensions in the total number of available sets, specifically: based on the LightGBM method, performing feature importance assessment by using a training set and a testing set by taking all dimensions in the total available data set as features;
locating the geographic origin of the biological individual based on the carefully selected total available data set, in particular: and taking the samples to be tested in the selected total data set as test samples, taking all the reference samples in the selected total data set as training samples, and obtaining the longitude and latitude of the biological geographic position coordinates of the samples to be tested twice by using the LightGBM method.
Compared with the prior art, the invention has the following beneficial effects:
the multi-modal genetic information data set (namely the usable total data set) is constructed firstly, the multi-modal genetic information data set is constructed, the multi-angle application of SNP typing results is developed, multi-dimensional information contained in SNP typing is utilized more comprehensively, and higher inference or prediction accuracy is achieved when single-modal information is compared. Meanwhile, the method adopts principal component analysis, ancestral component analysis and ancestral homologous fragment length analysis, relates to the dimension reduction of high-density SNP, and avoids the practical problems that high-dimensional site information is difficult to adapt and fit in a population model assumption and parameter estimation method. The method evaluates the importance of each characteristic and has a certain reference effect on the factor exploration of the crowd layering in the field of crowd genetics. The method has wider region and crowd universality range, can process SNP genotype databases with a large number of samples and high density, comprehensively uses multi-mode biological genetic data, does not need excessive biological crowd priori knowledge, and solves the region limitation and the specific crowd limitation of the population inference method in the traditional forensic genetic field to a certain extent. The invention solves the problems of narrow applicability of geographical crowd range, high requirement on priori biological crowd genetic knowledge, low adaptability of high vitamin data, singleness of reflection of high-density SNP genotype data by using single-mode biological genetic information and the like in the positioning problem of biological individual geographical sources.
Drawings
FIG. 1 is a schematic diagram of one embodiment of a method of constructing a multimodal genetic information dataset of the present invention;
FIG. 2 is a schematic diagram of an embodiment of a method for locating the geographic origin of an individual of the present occurrence.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. The data types, feature selection criteria for each modality of the embodiments of the invention generally described and illustrated in the figures herein may be set in a variety of different options. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to fall within the scope of the invention.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used in the description presented herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. The term "and/or" as used in this specification includes any and all combinations of one or more of the associated listed items. In addition, the technical features of the different embodiments of the present invention described below may be combined with each other as long as they do not collide with each other.
With the development of high-throughput SNP microarray technology (micro array) and second generation sequencing technology (Next Generation Sequencing, NGS), SNP typing results of high-density biological genes have been rapidly, accurately and at lower cost.
Among them, a micro array (micro array) is also called an oligonucleotide array (Oligonucleitide array), which belongs to one of biochips. The principle of the technology is that after integrating gene probes with known sequences on the solid surface and hybridizing a large number of marked nucleic acid sequences in the detected biological cells or tissues with the gene probes, the gene information is rapidly detected by detecting hybridization probes at corresponding positions. Currently, the microarray technology in commercial use is mature and can accurately type millions of SNP loci at one time.
The core of the second generation sequencing technology is sequencing-by-synthesis, the sequence of DNA is determined by capturing the tag at the end of the new synthesis. The second generation sequencing technology used today has the advantages of low cost, high throughput, high speed and convenient operation, and is widely used in various large genome studies. The typing of SNP of the whole genome is carried out by utilizing a second generation sequencing technology, so that high accuracy can be achieved while SNP of the whole genome is detected.
Based on the development of the gene related basic technology, a large number of accurate SNP typing results can be provided.
The embodiment of the invention provides a method for constructing a multi-mode genetic information data set based on high-density SNP, which constructs a database of a large number of reference crowd samples based on SNP typing results of samples and conveniently acquires SNP typing data of samples to be detected so as to facilitate subsequent application.
As shown in fig. 1, the construction method includes:
s110, acquiring a plurality of reference sample data comprising a plurality of crowds, wherein the reference sample data comprises reference sample SNP (single nucleotide polymorphism) typing data and biological ancestor origin or biological geographic position thereof.
The SNP genotyping data of the reference sample can be obtained by carrying out gene detection in one or more modes disclosed in the embodiment, and then converted into a plink format to obtain a user.bed/bim/fam file; the biological ancestor of the reference sample can be obtained in the forms of questionnaires, consulting account information, interviewing and the like, the biological geographic position is in the form of geographic position coordinates of a family, village or city center to which the reference sample belongs, and the two types of information are respectively recorded in an access. Wherein, the accesstry. Txt only comprises 1 column, the sequence of the reference samples in the user. Fam file is used as a line sequence, and the biological ancestor of each reference sample is recorded; the location.txt contains 2 columns, and the reference sample in the user.fam file is used as a line sequence, wherein the 1 st column records the longitude of the biological geographic position, and the 2 nd column records the latitude of the biological geographic position.
The reference sample data can also be directly obtained from public sources such as public human genetic databases, and the like, and the information such as biological ancestor sources or sample acquisition places and the like of samples are generally matched in the databases, wherein the biological geographic position is determined according to the geographic center position of the lowest-level region.
S120, acquiring data of a sample to be detected, wherein the data of the sample to be detected only comprise SNP typing data of the sample to be detected.
The SNP genotyping data of the sample to be tested can be obtained by carrying out gene detection in one or more modes disclosed in the embodiment, and then converted into a plink format to obtain a ref.bed/bim/fam file.
S130, combining SNP (single nucleotide polymorphism) genotyping data of a sample to be tested and SNP genotyping data of a reference sample, and marking the type of the sample to obtain combined total SNP genotyping data.
And combining SNP typing data of the sample to be detected and SNP typing data of the reference sample by using a combining function in plink software to obtain combined total SNP typing data (merge. Ref. Bed/bim/fam file). plink is free, open-source whole genome association analysis software developed by harvard university, and can be used for basic file format conversion, quality control, sample merging and the like.
According to the sample sequence in the merge. Fam, the sample type is recorded in lines by using a sample_type. Txt file, the sample to be measured is marked as 0, and the reference sample is marked as 1.
S140, calculating the result of the combined total data set under each mode (PC information mode, ancestral source component proportion information mode and ancestral homology fragment length information mode)
S141, calculating PC information (information of all PC dimensions, namely PC) of the combined total SNP typing data in principal component analysis (Principal Component Analysis, PCA, PCA) 1 To PC q Where q is the smaller of the number of samples and the number of SNP typing sites).
Principal component analysis of the combined total SNP typing data using PCA function of GCTA (Genome-wide Complex Trait Analysis) software to obtain information (merge. Eigenevec/eigeneval) of all PC dimensions; the GCTA tool is a whole genome complex trait analysis tool, can be used for principal component analysis, projects high-dimensional SNP loci to a plurality of low-dimensional PC variables, and is developed by Yang et al.
S142, calculating the proportion information of each ancestral component in the ancestral component analysis of the combined total SNP typing data.
Performing ancestral component analysis on the combined total SNP typing data by using ADMIXTURE software to obtain ancestral component proportion information (merge. Q file) under the condition of optimal K value; the ADMIXTURE tool is open source software for estimating the ancestral source based on the combined total SNP typing data, and is developed by UCLA. The input to the tool is a binary PLINK file, with each line in the resulting file format represented as (y 1 ,y 2 ,...,y n )。Wherein n is the optimal value of parameter K, each element represents the specific gravity of the progenitor component, and the sum of the specific gravities of the total n progenitor components of all samples is 1.
S143, calculating the length information of the combined total SNP typing data on the ancestral homology (Identical By Descent, IBD) fragments.
IBD refers to the fact that a common DNA region or common allele in offspring originates from a common ancestor, and the genetic distance of a sample can be described by the length of the common ancestor fragment between samples. Homologous chromosome separation is carried out on SNP (single nucleotide polymorphism) typing data of a reference sample by using EAGLE software to obtain typing results (phased. Chrx) of vccf data formats on all chromosomes, then vcftOOLS software is used for converting the vccf data formats into plink formats to obtain phased. Ped/map, GERMINE software is used for identifying the same DNA sequences between all reference sample pairs according to the typing results on all chromosomes, the lengths of the searched IBD fragments are recorded, after the IBD fragments on all chromosomes are combined, the average IBD fragment length between a certain reference sample and all people is calculated according to biological ancestor source information classification recorded in the accesstry. Txt by ERSA software, and the results are recorded in idb _length. Txt. For SNP typing data of a sample to be detected, the same DNA sequence between the sample to be detected and a reference sample pair is obtained by the same method, the length of the searched IBD fragment is recorded and added into idb _length. idb _length. Txt file is listed in p columns, where each column represents the average IBD fragment length of a sample for each ancestor type and p is the number of different categories of biological ancestor sources in the reference sample.
S150, analyzing the calculation results under each mode and taking a specific part or dimension.
According to merge. Eigenevec/eigeneval, calculating interpretation variance of each PC dimension, and taking the first several PC dimensions (PC 1 To PC m ) And obtaining information on the dimension of the selected part of PC. The computation is in each PC dimension (i.e., PC i Upper), mean value of the interior of each ancestor class sampleAnd standard deviation sigma i Knock out deviation->Exceeding 3 sigma i Samples corresponding to other values (i is 1 to m) are synchronously modified, and finally a merge_new.eigenec file is generated.
Calculating the mean value of the ratio of the source component of each sample in the ith category based on the merge. Q file (the ratio information of the source components of each source) and the category of biological sources recorded in the processing. TxtAnd standard deviation sigma i Reject deviationsExceeding 3 sigma i Samples corresponding to other values (wherein i takes 1 to n, n is the value of the optimal parameter K in ancestor component analysis), and the processing. Txt file is synchronously modified, and finally a new merge_new.Q file is generated.
From the idb _length.txt (average IBD fragment length of samples for each ancestor type) file and the categories of biological ancestor types recorded in the processing.txt, the mean value of each IBD fragment length of each sample in the ith category is calculated, respectivelyAnd standard deviation sigma i Knock out deviation->Exceeding 3 sigma i Samples corresponding to other values (i is 1 to p, and p is the category number of different biological sources in the reference sample), and synchronously modifying the processing. Txt file, and finally generating a new idb _length_new.Q file.
S160, merging data of all modes (namely information on the dimension of the PC, proportion information of all ancestor source components and length information of ancestor homologous fragments), marking meanings of all dimensions, and obtaining a merged multi-mode data set.
Merge_new.eigenevec (information in part of PC dimension), merge_new.Q (information on ratio of each ancestor origin component), and idb _length_new.txt (sample for each ancestor)Average IBD fragment length of source type) files are concatenated into columns in rows to obtain combined feature data (multimodal data file) for all modalities and multimodal features files that mark the meaning of each feature. The two data together form a multi-mode data set after combination, and [ (x) 11 ,x 12 ,...,x 1m ,y 11 ,y 12 ,...y 1n ,z 11 ,z 12 ,...,z 1p ),(x 21 ,x 22 ,...,x 2m ,y 21 ,y 22 ,...y 2n ,z 11 ,z 12 ,...,z 1p ),...,(x k1 ,x k2 ,...,x km ,y k1 ,y k2 ,...y kn ,z k1 ,z k2 ,...,z kp )]Such a formal representation. Wherein, { x.. y... z.. The values in the dimensions in the respective modalities for each sample, m, n and p are the number of dimensions of each mode, and k is the serial number of each sample.
S170, normalizing and normalizing the combined multi-mode data set to obtain a usable total data set.
Respectively normalizing and normalizing the modes (namely, the information in PC dimension, the proportion of the ancestral source components and the length of the ancestral source fragments) to ensure that the data mean value is 0 and the variance is 1 under each mode, and compressing the values to be between 0 and 1, such as PC informationAnd then, normalizing and normalizing all dimensions under all modes to obtain an available total data set, wherein the data form is kept unchanged, and the data form is generated into a multimodal_new.
As shown in fig. 2, the multi-mode biological individual geographic origin positioning method (including biological geographic ancestor inference and biological geographic position prediction of an individual) comprises the following steps:
s210, dividing the reference samples in the total available data set into a training set and a testing set according to a hierarchical sampling method, and respectively marking.
In the case of a sufficient amount of reference samples in the total data set available, the reference samples may be partitioned in a certain proportion for training and testing of a subsequent inference or predictive model, respectively. Specifically, the reference sample may be divided into a training set and a test set according to a ratio of 0.7:0.3, and the training set and the test set are used for training and testing, respectively.
And S220, taking all dimensions in the total number set to be used as characteristics, and performing characteristic preliminary screening and characteristic importance assessment by using the training set and the testing set.
S221, aiming at the problem of biological geographic ancestral source inference, taking all dimensions in the total available data set as characteristics, and performing characteristic preliminary screening and characteristic importance assessment by using a training set and a testing set based on a Lasso polynomial logistic regression method.
The Lasso polynomial logistic regression method is based on polynomial logistic regression. Polynomial logistic regression can predict the probability of a distribution of a dependent variable to different possible outcomes given a set of independent variables (which can be real, binary, or categorical). Whereas the Lasso polynomial logistic regression method adds the L1 regularization term α β i 1 The coefficient of the unimportant variable can be automatically contracted to 0, so that preliminary feature selection is realized, and the purpose of preliminary feature screening is achieved. Based on the Python environment and the Lasso correlation function under the linear model sub-package in the machine learning software package scikit-learn, using the Lasso CV implementation (Lasso linear model with iterative fit along the regularized path), it uses cross-validation of the super-parameters, allowing the model to automatically help select an appropriate alpha value. Multi-classification prediction of the biological geographic progenitor is performed using training and testing sets, with sample labels being the column values of the unique columns in the accesstry. Txt, respectively, retaining feature variables where the coefficients are not 0 (i.e., not screened out).
S222, aiming at the problem of predicting the biological geographic position, taking all dimensions in the total data set as characteristics, and carrying out characteristic importance assessment by using a training set and a testing set based on the LightGBM method.
LightGBM (Light Gradient Boosting Machine) is a high precision and high speed gradient enhancement framework promulgated by Microsoft. Based on a Python environment and a lightgbm software package, training and regression prediction are performed by using a training set and a testing set, wherein an objective parameter option is designated as regression, longitude and latitude regression prediction is performed twice respectively, and sample labels set twice are longitude columns and latitude columns in location. After the model is trained twice, the importance of the features of the prediction problem of the longitude and the latitude can be respectively obtained by calling the plot_importance function of the training model, and the feature names and the corresponding feature importance scores are stored.
S230, determining a final feature set based on feature importance evaluation.
And sequencing the features obtained by preliminary screening of the Lasso polynomial logistic regression method according to the coefficients from large to small, and taking out the first 50% of features in each mode as a final feature set. And sorting the features in the LightGBM from large to small according to the feature importance score, and taking out the first 50% of features in each mode as a final feature set.
S240, using the determined feature set to generate a carefully selected available total data set;
and acquiring the information in the dimension in the determined feature set from the available total data set to obtain two carefully selected available total data sets of two types of problems of biological geographic ancestor inference and individual biological geographic position prediction respectively.
S250, positioning geographic sources of the biological individuals based on the carefully selected total available data sets;
s251, using the determined feature set, for the biological geography ancestor inference problem, taking the samples to be detected in the carefully chosen available total data set inferred for the biological geography ancestor as test samples, taking all the reference samples in the carefully chosen available total data set as training samples (sample type distinction by means of sample type. Txt files), obtaining probability values inferred as ancestor categories of the respective reference samples of the samples to be detected by using a process similar to the Lasso polynomial logistic regression method in S221, in the form of (prob 1 ,prob 2 ,...,prob p ) Wherein p is the number of different biological progenitor categories in the reference sample, and the sum of the probability values is 1.
S252, aiming at the problem of predicting the biological geographic position by using the determined feature set, taking samples to be tested in the carefully selected available total data set aiming at the prediction of the individual biological geographic position as test samples, taking all reference samples in the carefully selected available total data set as training samples (sample type distinction is carried out by means of a sample type. Txt file), obtaining longitude and latitude of biological geographic position coordinates of the samples to be tested twice in sequence by using a process similar to the LightGBM method in S222, and finally obtaining the form of the predicted biological geographic position coordinates as (x_longitude).
In summary, the biological individual geographic source positioning method provided by the embodiment of the invention uses high-density SNP typing based on sequencing or chip technology and native information which is investigated or declared by a user as data bases, calculates biological genetic information of multiple modes, combines the biological genetic information, and establishes a uniformly available multi-mode genetic information data set based on the high-density SNP after standardization and normalization. According to the difference of different requirements of the two types of biological geographic ancestral source inference problems and the biological geographic position prediction problems, importance analysis and feature screening are respectively carried out on features of different modal data by using Lasso polynomial logistic regression and a LightGBM method, so that geographic sources of samples to be detected are positioned, and two optional results are given.
The prediction method can be applied in many different fields. For example, forensic genetics identifies the source of suspects of cross-regional crime, etc., in population genetics to help explain population migration.
It will be understood that equivalents and modifications will occur to those skilled in the art in light of the present teachings and concepts, and all such modifications and substitutions are intended to be included within the scope of the present invention as defined in the accompanying claims.

Claims (10)

1. A method for locating a geographic source of a biological individual based on multimodal genetic information, comprising:
obtaining SNP (single nucleotide polymorphism) typing data of a reference sample and corresponding biological ancestor origin or biological geographic position; obtaining SNP typing data of a sample to be tested; combining SNP typing data of a sample to be detected and SNP typing data of a reference sample, and marking sample types to obtain combined total SNP typing data;
performing principal component analysis on the combined total SNP typing data to obtain information of all PC dimensions, selecting part of the PC dimensions according to interpretation variances, and obtaining information on part of the PC dimensions; performing ancestral component analysis on the combined total SNP typing data to obtain ancestral component proportion information, and selecting part of ancestral component proportion information according to the mean value and standard deviation of the ancestral component proportions; calculating the length information of ancestral homologous fragments of the combined total SNP typing data, and selecting part of the length information of ancestral homologous fragments according to the average value and standard deviation of the lengths of the ancestral homologous fragments;
combining three modes of information on the dimension of part of PC, part of ancestor component proportion information and part of ancestor homologous fragment length information to obtain a combined multi-mode data set; normalizing and normalizing the combined multi-mode data set to obtain an available total data set;
dividing reference samples in the total data set into a training set and a testing set, taking all dimensions in the total data set as characteristics, and carrying out characteristic importance assessment by using the training set and the testing set; determining a feature set based on the feature importance assessment; the information in the dimension of the feature set is taken from the total available data set, and the total available data set is selected;
the individual biological geographic sources are located based on the selected total available data set.
2. The method for locating the geographic origin of a biological individual based on multi-modal genetic information according to claim 1, wherein the principal component analysis of the combined total SNP typing data is specifically as follows: principal component analysis was performed on the combined total SNP typing data using the PCA function of the GCTA software.
3. The method for locating the geographic origin of a biological individual based on multi-modal genetic information according to claim 1, wherein the analysis of the progenitor component of the combined total SNP typing data is specifically: the pooled total SNP typing data were subjected to progenitor component analysis using the adm ixture software.
4. The method for locating the geographic origin of a biological individual based on multi-modal genetic information according to claim 1, wherein the calculation of the length information of the ancestral homologous segments of the combined total SNP typing data is specifically: homologous chromosome separation is carried out on SNP (single nucleotide polymorphism) genotyping data of a reference sample by using EAGLE software to obtain a genotyping result on each chromosome, the same DNA sequence between all reference sample pairs is identified by using GERMINE software according to the genotyping result on each chromosome, the lengths of the searched IBD fragments are recorded, after the IBD fragments on each chromosome are combined, the average IBD fragment length between a certain reference sample and other reference samples is calculated by ERSA software according to biological ancestral information classification of the reference sample; homologous chromosome separation is carried out on SNP (single nucleotide polymorphism) genotyping data of a sample to be tested by using EAGLE software to obtain a genotyping result on each chromosome, the same DNA sequence between the sample to be tested and a reference sample pair is identified by using GERMINE software according to the genotyping result on each chromosome, the length of the searched IBD fragment is recorded, the IBD fragments on each chromosome are combined, and then the average IBD fragment length between a certain sample to be tested and other reference samples is calculated by ERSA software according to biological ancestral information classification.
5. The method for locating the geographic origin of a biological individual based on multi-modal genetic information according to claim 1, wherein the selecting a part of the PC dimension according to the interpretation variance and obtaining the information on the part of the PC dimension comprises: according to the information of all PC dimensions, calculating the interpretation variance of each PC dimension, selecting a plurality of first PC dimensions with accumulated interpretation variance more than 60%, and obtaining the information of the plurality of first PC dimensions.
6. The method for locating the geographic origin of a biological individual based on multimodal genetic information according to claim 1, wherein the normalization and normalization of the combined multimodal dataset is performed in particular by: and respectively and sequentially carrying out standardization and normalization on the three modes in the combined multi-mode data set to obtain results of the three modes, and then carrying out standardization and normalization on the results of the three modes.
7. The method for locating the geographic origin of a biological individual based on multi-modal genetic information according to claim 6, wherein the three modalities in the combined multi-modal dataset are respectively normalized and normalized in sequence, specifically: and (3) respectively and sequentially normalizing and normalizing the information on the dimension of the part PC, the proportion information of the part ancestral source components and the length information of the part ancestral source fragments so that the mean value is 0, the variance is 1, and the value is compressed to be between 0 and 1.
8. The method for locating the geographic origin of a biological individual based on multimodal genetic information according to claim 1, wherein the dividing of the reference samples in the total available data set into a training set and a test set is specifically: the reference samples in the total available data set are divided into a training set and a testing set according to a hierarchical sampling method.
9. The method for locating the geographic origin of a biological individual based on multimodal genetic information according to claim 1, characterized in that the feature importance is evaluated using training sets and test sets, characterized by all dimensions in the total set of available data, in particular: feature importance evaluation is carried out by using a training set and a testing set based on a Lasso polynomial logistic regression method by taking all dimensions in the total available data set as features;
locating the geographic origin of the biological individual based on the carefully selected total available data set, in particular: selecting samples to be tested in the available total data set as test samples, selecting all reference samples in the available total data set as training samples, and obtaining probability values of the samples to be tested, which are inferred to be ancestral source categories of the reference samples, by using a Lasso polynomial logistic regression method.
10. The method for locating the geographic origin of a biological individual based on multimodal genetic information according to claim 1, characterized in that the feature importance is evaluated using training sets and test sets, characterized by all dimensions in the total set of available data, in particular: based on the LightGBM method, performing feature importance assessment by using a training set and a testing set by taking all dimensions in the total available data set as features;
locating the geographic origin of the biological individual based on the carefully selected total available data set, in particular: and taking the samples to be tested in the selected total data set as test samples, taking all the reference samples in the selected total data set as training samples, and obtaining the longitude and latitude of the biological geographic position coordinates of the samples to be tested twice by using the LightGBM method.
CN202311850019.0A 2023-12-28 2023-12-28 Biological individual geographic source positioning method based on multi-modal genetic information Pending CN117789822A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311850019.0A CN117789822A (en) 2023-12-28 2023-12-28 Biological individual geographic source positioning method based on multi-modal genetic information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311850019.0A CN117789822A (en) 2023-12-28 2023-12-28 Biological individual geographic source positioning method based on multi-modal genetic information

Publications (1)

Publication Number Publication Date
CN117789822A true CN117789822A (en) 2024-03-29

Family

ID=90381435

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311850019.0A Pending CN117789822A (en) 2023-12-28 2023-12-28 Biological individual geographic source positioning method based on multi-modal genetic information

Country Status (1)

Country Link
CN (1) CN117789822A (en)

Similar Documents

Publication Publication Date Title
US20230102326A1 (en) Discovering population structure from patterns of identity-by-descent
US10347365B2 (en) Systems and methods for visualizing a pattern in a dataset
AU2002359549B2 (en) Methods for the identification of genetic features
Brás et al. Improving cluster-based missing value estimation of DNA microarray data
AU2017338775A1 (en) Phenotype/disease specific gene ranking using curated, gene library and network based data structures
US20020095260A1 (en) Methods for efficiently mining broad data sets for biological markers
Pagnuco et al. Analysis of genetic association using hierarchical clustering and cluster validation indices
JP2016165286A (en) Gene-expression profiling with reduced numbers of transcript measurements
Gu et al. Role of gene expression microarray analysis in finding complex disease genes
CN114038502A (en) Method for associating expression quantitative traits with CNV (CNV) based on gene interaction network clustering and group sparse learning
Saei et al. A glance at DNA microarray technology and applications
KR20210110241A (en) Prediction system and method of cancer immunotherapy drug Sensitivity using multiclass classification A.I based on HLA Haplotype
CN117789822A (en) Biological individual geographic source positioning method based on multi-modal genetic information
JP3936851B2 (en) Clustering result evaluation method and clustering result display method
Callahan et al. Data mining of rare alleles to assess biogeographic ancestry
US6994965B2 (en) Method for displaying results of hybridization experiment
CN108182347B (en) Large-scale cross-platform gene expression data classification method
Liu et al. Assessing agreement of clustering methods with gene expression microarray data
EP1691311A1 (en) Method, system and software for carrying out biological interpretations of microarray experiments
CN114155910B (en) Method for predicting cancer somatic mutation function influence
CN117116364B (en) Single cell database and associated cell subgroup automatic recommendation method thereof
CN115995262B (en) Method for analyzing corn genetic mechanism based on random forest and LASSO regression
Sommer et al. Predicting protein structure classes from function predictions
Aloqaily et al. Feature prioritisation on big genomic data for analysing gene-gene interactions
Kuijjer et al. Expression Analysis

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination