CN115966259B

CN115966259B - Sample homology detection and verification method and system based on logistic regression modeling

Info

Publication number: CN115966259B
Application number: CN202211678658.9A
Authority: CN
Inventors: 朱燕萍; 谢剑邦; 郑晖; 林健; 曹野
Original assignee: Nanjing Puenrui Biotechnology Co ltd
Current assignee: Nanjing Puenrui Biotechnology Co ltd
Priority date: 2022-12-26
Filing date: 2022-12-26
Publication date: 2023-10-13
Anticipated expiration: 2042-12-26
Also published as: CN115966259A

Abstract

The invention discloses a sample homology detection and verification method and system based on logistic regression modeling, wherein the method comprises the following steps: obtaining two VCF format gene files; filtering and screening SNPs in the two gene files according to preset filtering and screening conditions to obtain filtered and screened SNPs; calculating first association parameters of the two samples; calculating the conversion and transversion ratio of two gene files and the consistency index before fitting; calculating mutation frequencies of SNPs after filtering and screening; determining a plurality of calculation parameters including inter-sample mutation stability coefficients, a post-fitting consistency index, a fitting slope and a determining coefficient R of a fitting equation after linear fitting of mutation frequencies of SNPs after filtering and screening based on the first condition ² The Pearson coefficient after fitting, the fitting iteration number, the intra-group correlation coefficient and the population snp library duty ratio; performing logistic regression modeling based on the plurality of calculation parameters; predicting whether the samples are homologous based on logistic regression modeling.

Description

Sample homology detection and verification method and system based on logistic regression modeling

Technical Field

The invention relates to the technical field of sequencing sample detection, in particular to a sample homology detection and verification method and system based on logistic regression modeling.

Background

The common Next generation sequencing technology NGS (Next-generation sequencing technology) data sample homology detection method comprises the following two steps:

the first detection method is to judge the sample homology by comparing the repetition numbers of different sample-specific STR (short tandem repeat); the specific method for judging the homology relationship between two samples according to the repetition number of the specific STR is as follows: the number of tandem repeats of the STRs in the data of both samples is calculated, the same number of repeats indicating that both samples are derived from the same individual. This method has several disadvantages: high cost and poor efficiency. Common commercial kit capture intervals do not contain all known stable STR regions, and when the STR repetition number is calculated, a separate design scheme is needed, so that the design cost is increased; batch and quality problems of NGS sequencing data result in undetected STR regions, resulting in deviation of results, affecting judgment; the STR repetition number calculation analysis is independent of an NGS analysis standard flow, and needs to be independently analyzed each time, so that the analysis period is increased; NGS data is prone to errors in the continuously repeated regions and in the regions of high GC (ratio of guanine to cytosine) content, and STRs present in these regions can lead to inaccurate results.

The second detection method is to determine sample homology by calculating the correlation of mutation frequencies of different sample-specific SNPs (single nucleotide polymorphisms ). The specific method for judging the homology relationship between two samples according to the mutation frequency correlation of specific SNPs is as follows: and respectively calculating mutation frequencies of specific SNPs in the two samples, and then calculating correlation between the mutation frequencies of the SNPs, wherein the stronger the correlation is, the higher the homology between the samples is. This method has several disadvantages: the SNPs used are designated, the SNPs are not necessarily covered by the sequenced panel completely, and the calculation result is inaccurate; the application range is limited, and the method is accurate only on fixed panel.

Disclosure of Invention

In order to solve the problems in the prior art, the invention provides a sample homology detection and verification method and system based on logistic regression modeling, and the method can directly use VCF (Variant Call Format) files generated by NGS standard analysis flow by only using the same sequencing method for two samples or a large number of overlapped SNPs between the two samples, automatically acquire dynamic SNPs information according to different files, and can determine sample homology analysis results by combining parameter evaluation and logistic regression modeling.

The invention provides a sample homology detection and verification method based on logistic regression modeling, which is characterized by comprising the following steps of:

s1, acquiring two gene files, wherein the two gene files are in a VCF format;

s2, filtering and screening SNPs in the two gene files according to preset filtering and screening conditions to obtain filtered and screened SNPs;

s3, calculating a first parameter and a second parameter of samples corresponding to the two gene files based on the SNPs after filtering and screening, wherein the first parameter is a conversion and transversion ratio, and the second parameter is a consistency index primary c-index before fitting;

s4, calculating mutation frequencies of SNPs after filtering and screening; determining to perform linear fitting on mutation frequencies of SNPs after filtering and screening based on the conversion-to-transversion ratio and the consistency index primary c-index before fitting meeting a first condition, and determining a plurality of calculation parameters after the linear fitting; the plurality of calculated parameters includes a third parameter, a fourth parameter, a fifth parameter, a sixth parameter, a seventh parameter, an eighth parameter, a ninth parameter, and a tenth parameter; the third parameter is inter-sample mutation stability coefficient mut_c, the fourth parameter is fit-after-fit consistency index fit c-index, the fifth parameter is fit slope, and the sixth parameter is a determination coefficient R of a fit equation ² The seventh parameter is a fitted pearson coefficient fitting_pearson, the eighth parameter is a fitted iteration number iteration, the ninth parameter is a intra-group phase relation number fitting_ICC, and the tenth parameter is a crowd snp library occupancy ratio common_snps_percentage;

s5, carrying out logistic regression modeling based on the plurality of calculation parameters;

s6, predicting whether the samples are homologous based on logistic regression modeling.

Preferably, the predetermined filtering conditions in S2 include: one or more of a first condition, a second condition, a third condition, and a fourth condition, wherein the first condition is deletion of SNPs having a total sequencing depth of less than 10X; the second condition is the deletion of sex chromosome mutated SNPs; the third condition is that SNPs retaining heterozygous mutations; the fourth condition is to reserve SNPs supporting more than 4 reads.

Preferably, the step of calculating the transition-to-transversion ratio in S3 includes:

respectively calculating two transition and transversion ratios of the two samples;

the step of calculating the pre-fitting consistency index (primary c-index) in S3 includes:

determining the useful pair number includes: if there are n observed individuals, then all the sub-numbers should be the combined number C _n ² Two types of pairs are excluded based on an exclusion criterion, wherein the exclusion criterion excludes pairs that do not reach an observation endpoint due to insufficient observation time in all pairs, and pairs that do not reach an observation endpoint for both individuals in all pairs; thereby obtaining the remaining useful pairs; calculating the number of useful pairs as the number of useful pairs;

determining the number of pairs in which the predicted result is consistent with the actually observed result in the useful pairs; wherein the agreement of the predicted result and the actual observed result indicates that the actual observed time of the corresponding individual is greater than a first threshold, and the disagreement of the predicted result and the actual observed result indicates that the actual observed time of the corresponding observed individual is less than the first threshold;

the pre-fit consistency index is calculated to be equal to the quotient of the number of pairs that are consistent with the predicted outcome and the actual observed outcome and the number of useful pairs.

Preferably, the determining that the mutation frequencies of the SNPs after filtering and screening are linearly fitted based on the transition-to-transversion ratio and the pre-fitting consistency index (primary c-index) satisfying the first condition includes: if the absolute value of the difference between the transition and the transversion ratio is smaller than 0.1, performing linear fitting, otherwise, not performing linear fitting; and if the consistency index primary c-index before fitting is more than or equal to 0.7, performing linear fitting, otherwise, not performing linear fitting.

The linear fitting of mutation frequencies of SNPs after a plurality of filtering screens comprises:

s41, regarding two gene files as a first sample and a second sample, extracting data of the two samples and counting mutation frequencies of SNPs, if one sample has a certain SNP, and the other sample does not have a certain SNP, marking the mutation frequency of the certain SNP as 0;

s42, selecting a certain SNP, recording mutation frequencies of the SNP as x and y in two samples, performing linear fitting by using a least square method, and obtaining the fitting slope fixing_slope and a determination coefficient R of a fitting equation after fitting ² Fitting the post-fitting pearson coefficient fixing_pearson; when the fit slope is set to [0.9,1.1 ]]Between, fit equation's decision coefficient R ² >0.9, and the post-fitting Pearson coefficient of mutation frequencies of the same SNPs in two samples>0.9, if not, the fitting is successful, otherwise, the fitting is failed;

s43, if the fitting is determined to be successful, outputting SNPs at the moment, and calculating a mutation stability coefficient mut_c between samples, a consistency index fit c-index after the fitting, a intra-group correlation number fit_ICC, fitting iteration numbers candidates and a population snp library duty ratio common_snps_percentage;

S44, if the fitting failure is determined, defining the mutation frequency of a SNP in a sample as Fa _n The mutation frequency corresponding to the same SNP in another sample is Fb _n Difference i= |fa of mutation frequencies of two samples corresponding to SNPs _n -Fb _n I (I); simultaneously giving an initial threshold k; when I > k, deleting the SNP and returning to the steps S42 and S43;

s45, if the fitting failure is continuously determined, the threshold k is reduced according to a first decreasing rule, and the step S44 is continuously performed until the threshold is reached for the first time, the statistic is recorded as 0 after the integral fitting failure is determined, and different sources of the sequencing sample are determined.

Preferably, the initial threshold k=0.5; the first decreasing rule is decreasing according to a method of k=k-0.01; the first order number threshold ranges from 30 to 50.

Preferably, the calculating the population pnp library occupancy ratio common_pnp_percentage in S43 includes: calculating the proportion of the fitted SNPs in the population SNPs library, wherein the proportion is common_snps_percentage; the specific steps of constructing the crowd SNPs library comprise:

acquiring gnomAD data, including downloading data in a genome library and an exome library, respectively;

forming a gene file based on the gnomAD data;

filtering SNPs loci in the gene files corresponding to the genome library based on the first data filtering standard and the second data filtering standard to obtain a first result file;

Filtering SNPs loci in the gene file corresponding to the exon group library based on the first data filtering standard and the second data filtering standard to obtain a second result file;

acquiring an intersection of the first result file and the second result file as the crowd SNPs library;

wherein, the first data filtering standard is that the frequency ref of all people is more than or equal to 0.01; the second data filtering standard is that the frequency AF_ eas of the east Asia crowd is more than or equal to 0.01.

Preferably, the step S5 of performing logistic regression modeling based on the plurality of calculation parameters includes: based on the inter-sample mutation stability coefficient Mut_c and the fit-after-consistency index fit c-index, the fit slope is fit_slope, and the decision coefficient R of the fit equation ² Performing logistic regression modeling on the fitted pearson coefficient fit_pearson, the fitted iteration number candidates, the intra-group phase number fit_icc and the population snp library duty ratio common_snps_multicenter, and increasing the weight of the difference parameters, wherein the logistic regression modeling comprises the following steps:

s51, determining coefficient R of fitting equation based on inter-sample mutation stability coefficient mut_c and fitting slope ² The method comprises the steps of dividing two samples into a modeling data set and an independent sample set according to a first proportion by a pearson coefficient fitting_pearson after fitting, fitting iteration number relationships, a group internal phase number fitting_ICC and a population snp library ratio common_snps_percentage; randomly sampling the modeling data set for N times, and dividing the modeling data set into a training sample and a test sample according to a second proportion; the test samples form a test set;

S52, obtaining a logistic regression model after modeling M times based on the logistic regression, and predicting the corresponding test set and independent sample set by using the logistic regression model to obtain predicted values of the test set and the independent sample set;

s53, screening a first round of models, namely comparing predicted values of a test set and an independent sample set with real values, and calculating a consistency index fixing c-index and accuracy after fitting based on comparison results; performing first-round model screening based on the fitted consistency index fixing c-index and accuracy;

s54, screening the model for the second time, wherein the screening comprises the steps of carrying out cluster analysis on the predicted values of the models after the first screening, screening the models of which the predicted values of the sample homologous group are clustered near 0.9 and the predicted values of the sample non-homologous group are clustered near 0.1;

s55, third-round model screening, wherein the third-round model screening comprises the steps of counting non-zero coefficients of each model after second-round model screening, and screening a plurality of groups of models based on the fact that the model non-zero coefficients cover all coefficients and a training set of the model covers all training samples;

and S56, combining multiple groups of models obtained by the third-round model screening as a final model, wherein the final model is used for sample homology detection.

In a second aspect of the present invention, there is provided a sample homology detection system based on logistic regression modeling, comprising:

The gene acquisition module is used for acquiring two gene files;

the filtering and screening module is used for respectively filtering and screening SNPs in the two gene files according to preset filtering and screening conditions to obtain filtered and screened SNPs;

the correlation parameter module is used for calculating a first parameter and a second parameter of samples corresponding to the two gene files based on the SNPs after filtering and screening, wherein the first parameter is a conversion and transversion ratio, and the second parameter is a consistency index primary c-index before fitting;

the linear fitting module is used for calculating mutation frequencies of SNPs after filtering and screening; determining to perform linear fitting on mutation frequencies of SNPs after filtering and screening based on the conversion-to-transversion ratio and the consistency index primary c-index before fitting meeting a first condition, and determining a plurality of calculation parameters after the linear fitting; the plurality of calculated parameters includes a third parameter, a fourth parameter, a fifth parameter, a sixth parameter, a seventh parameter, an eighth parameter, a ninth parameter, and a tenth parameter; the third parameter is the inter-sample mutation stability coefficient Mut_c,The fourth parameter is the fitted consistency index (fit c-index), the fifth parameter is the fitted slope (fit_slope), and the sixth parameter is the determination coefficient R of the fitting equation ² The seventh parameter is a fitted pearson coefficient fitting_pearson, the eighth parameter is a fitted iteration number iteration, the ninth parameter is a intra-group phase relation number fitting_ICC, and the tenth parameter is a crowd snp library occupancy ratio common_snps_percentage;

the logistic regression modeling module is used for performing logistic regression modeling based on the plurality of calculation parameters;

and the homology judging module is used for predicting whether the samples are homologous or not based on logistic regression modeling.

A third aspect of the invention provides an electronic device comprising a processor and a memory, the memory storing a plurality of instructions, the processor being for reading the instructions and performing the method according to the first aspect.

A fourth aspect of the invention provides a computer readable storage medium storing a plurality of instructions readable by a processor and for performing the method of the first aspect.

The sample homology detection method and system based on logistic regression modeling provided by the invention have the following beneficial effects:

the VCF file generated by the NGS standard analysis flow can be directly used only by using the same sequencing method for two samples or a large number of overlapped SNPs exist between the two samples, the dynamic SNPs can be automatically obtained according to different files, and the sample homology analysis is carried out by combining parameter evaluation and logistic regression modeling. The logistic regression modeling increases the weight of the difference parameters, reduces the model bias, reduces the detection cost, shortens the analysis period, greatly improves the efficiency, reduces the statistical error of NGS data, has accurate judgment result and wide application range, is not limited to specific panel, and can be more easily applied to commercial kits.

Drawings

Fig. 1 is a schematic flow chart of a sample homology detection method based on logistic regression modeling.

Fig. 2 is a data flow diagram of a sample homology detection method based on logistic regression modeling for two samples provided by the invention.

Fig. 3 is a schematic diagram of a sample homology detection system based on logistic regression modeling.

Fig. 4 is a schematic structural diagram of an embodiment of an electronic device according to the present invention.

Detailed Description

In order to better understand the above technical solutions, the following detailed description will be given with reference to the accompanying drawings and specific embodiments.

The method provided by the invention can be implemented in a terminal environment, and the terminal can comprise one or more of the following components: processor, memory and display screen. Wherein the memory stores at least one instruction that is loaded and executed by the processor to implement the method described in the embodiments below.

The processor may include one or more processing cores. The processor connects various parts within the overall terminal using various interfaces and lines, performs various functions of the terminal and processes data by executing or executing instructions, programs, code sets, or instruction sets stored in the memory, and invoking data stored in the memory.

The Memory may include random access Memory (Random Access Memory, RAM) or Read-Only Memory (ROM). The memory may be used to store instructions, programs, code, sets of codes, or instructions.

In addition, it will be appreciated by those skilled in the art that the structure of the terminal described above is not limiting and that the terminal may include more or fewer components, or may combine certain components, or a different arrangement of components. For example, the terminal further includes components such as a radio frequency circuit, an input unit, a sensor, an audio circuit, a power supply, and the like, which are not described herein.

Example 1

Referring to fig. 1, in one aspect, the present invention provides a method for detecting sample homology based on logistic regression modeling, including:

s1, acquiring two gene files, wherein the two gene files are in a VCF format; VCF is a text file describing SNP (variation on single base), INDEL (INDEL marker) and SV (structural variation site) results. The best support is obtained in GATK software, and files in VCF format can be obtained through SAMtools; the VCF file is divided into two parts of content, an annotation part beginning with "#", and a body part without "#"; each row in the body portion represents information of one Variant; the Variant information includes CHROM (representing which continuous site is obtained by a call in the variation site, corresponding chr1, … chr22 in the case of human whole genome), POS (representing the position of the variation site relative to the reference genome, the position of the first base in the case of index), ID (the ID of the Variant), rs number in the corresponding dbSNP if the call SNP is present in the dbSNP database, if not, the value of the FILTER is used, REF and ALT (representing the base corresponding to the variation site in the reference genome and the quality value of the corresponding base in the target genome (Variant), QUAL [ Phred_standard ] can be understood as the quality value of the variation site in the case of index, representing the possibility of the variation site in the case of index, the possibility of the Variant being large if the call SNP is present in the dbSNP database, and the FILTER value is not represented by a FILTER, if the FILTER is not used, and the FILTER value of the FILTER is not used, and the FILTER is calculated to be the FILTER value of the FILTER is not used in the case of the FILTER, and the FILTER is not used, and the FILTER value is not represented by the FILTER.

In this embodiment, two gene files are taken as an example, namely, the gene file 1 and the gene file 2 shown in fig. 2, and those skilled in the art will understand that the number of gene files can be more, but it is most suitable to test the homology of not more than four gene files at the same time in the case that the requirement of logistic regression modeling is required to be simultaneously followed by the present invention.

S2, filtering and screening SNPs in the plurality of gene files according to preset filtering and screening conditions to obtain the filtered and screened SNPs.

In this embodiment, each of the screening results is initially represented by a vector, and the final multiple screening results are obtained after preprocessing the initial screening result represented by the vector to screen out the low-frequency false points.

In this example, single Nucleotide Polymorphisms (SNPs) refer to DNA sequence diversity at the genomic level caused by variation of a single nucleotide, are known, inheritable, and detectable as genetic markers, and can be used for positioning, cloning, and identification of disease genes and influence of SNPs themselves on the body due to correlation of the genetic polymorphisms with the disease, and are focused on sample homology detection.

As a preferred embodiment, the predetermined filtering conditions include: one or more of a first condition, a second condition, a third condition, and a fourth condition, wherein the first condition is deletion of SNPs having a total sequencing depth of less than 10X; the second condition is deletion of SNPs on sex chromosomes; the third condition is that SNPs retaining heterozygous mutations; the fourth condition is to reserve SNPs supporting more than 4 reads.

S3, calculating a first parameter and a second parameter of samples corresponding to the two gene files based on the SNPs after filtering and screening, wherein the first parameter is a conversion and transversion ratio, and the second parameter is a consistency index primary c-index before fitting.

The calculating step of the first parameter for the conversion and transversion ratio comprises the following steps:

the bases of nucleotides are divided into two classes according to the ring structure characteristics, one class being purines, including adenine a and guanine G (two rings); the other class is pyrimidines, including cytosine C and thymine T (one ring). If the substitution of the DNA base remains the same, it is called a conversion, such as substitution of adenine A for guanine G or cytosine C for thymine T, i.e., purine for purine and pyrimidine for pyrimidine; if the number of rings changes, it is called a transversion, such as substitution of adenine A for cytosine C, or thymine T for guanine G, i.e., purine for pyrimidine, or pyrimidine for purine. The transition does not change the base type, and the transversion is changed. During evolution, transitions occur at a much higher frequency than transversions. In the genome, the ratio of transition to transversion frequency is about 2. In the protein coding region, this ratio can exceed 3, since the transition does not easily change the codon encoded amino acid relative to the transversion; therefore, the invention adopts the ratio of transition frequency to transversion frequency to identify the protein coding region and then carries out homology determination.

In this example, in the gene file vcf, all mutation types are counted, and base transitions and substitutions are calculated, respectively, where ti represents transitions and tv represents transversions.

The second parameter is the pre-fit consistency index primary c-index of the sample.

The second parameter C-index (concordance index, consistency index) is essentially the probability that the predicted outcome is estimated to be consistent with the actually observed outcome. Wherein the second parameter is fitted with a pre-fitting consistency index primary c-index of between 0.5 and 1 (the probability of consistency and inconsistency in the random case of any pairing is exactly 0.5). 0.5 indicates that the predicted result is completely inconsistent with the actually observed result, thus indicating that the model has no predicted effect on homology, and 1 indicates that the predicted result is completely consistent with the actually observed result, thus indicating that the predicted result of the model for homology is completely consistent with the actually observed result.

In this embodiment, referring to the general case, the accuracy of the prime c-index before fitting is low at 0.50-0.70: an accuracy of 0.71-0.90 is moderate; while higher than 0.90 is high accuracy.

The step of calculating the primary c-index of the consistency index before the second parameter fitting comprises the following steps:

determining the useful pair number includes: if there are n observed individuals, then all pairs should be C _n ² (number of combinations)Two types of pairs are excluded based on an exclusion criterion, wherein the exclusion criterion excludes pairs that do not reach an observation endpoint due to insufficient observation time in all pairs and pairs that do not reach an observation endpoint for both individuals in all pairs; thereby obtaining the remaining useful pairs; calculating the number of useful pairs as the number of useful pairs;

the pre-fit consistency index primary c-index = the number of pairs/useful pairs that the predicted outcome and the actual observed outcome agree with.

S4, calculating mutation frequencies of SNPs after filtering and screening; determining to perform linear fitting on mutation frequencies of SNPs after filtering and screening based on the conversion-to-transversion ratio and the consistency index primary c-index before fitting meeting a first condition, and determining a plurality of calculation parameters after the linear fitting; the plurality of calculated parameters includes a third parameter, a fourth parameter, a fifth parameter, a sixth parameter, a seventh parameter, an eighth parameter, a ninth parameter, and a tenth parameter; the third parameter is inter-sample mutation stability coefficient mut_c, the fourth parameter is fit-after-fit consistency index fit c-index, the fifth parameter is fit slope, and the sixth parameter is a determination coefficient R of a fit equation ² The seventh parameter is a fitted pearson coefficient fit_pearson, the eighth parameter is a fitted iteration number, the ninth parameter is a intra-group phase relation number fit_icc, and the tenth parameter is a crowd snp library occupancy ratio common_snps_percentage.

Calculating mutation frequencies (Variant allele frequency, VAF) of SNPs after filtering and screening, namely the proportion of mutant genes in alleles; and performing linear fitting on mutation frequencies of SNPs after filtering and screening. In this example, after two samples were filtered, VAFs (Varient Allele Frequency, mutation frequencies) for paired SNPs were obtained; and (3) performing successive linear fitting according to the two groups of corresponding VAF values.

As a preferred embodiment, the determining, in S4, that the mutation frequencies of the SNPs after the filtering and screening are linearly fitted based on the transition-to-transversion ratio and the pre-fitting consistency index primary c-index satisfying the first condition includes: if the conversion and transversion ratio is smaller than 0.1, performing linear fitting, otherwise, not performing linear fitting; and if the consistency index primary c-index before fitting is more than or equal to 0.7, performing linear fitting, otherwise, not performing linear fitting.

S41, regarding two gene files as a first sample and a second sample, extracting data of the two samples, counting mutation frequency VAF of SNP, if one sample has a certain SNP and the other sample does not have a certain SNP, marking the mutation frequency VAF of the certain SNP as 0;

s42, selecting a certain SNP, recording mutation frequencies of the SNP as x and y in two samples, performing linear fitting by using a least square method, and obtaining the fitting slope fixing_slope and a determination coefficient R of a fitting equation after fitting ² Fitting the post-fitting pearson coefficient fixing_pearson; when the fit slope is set to [0.9,1.1 ]]Between, fit equation's decision coefficient R ² >0.9, and the post-fitting Pearson coefficient of mutation frequencies of the same SNPs in two samples>And 0.9, the fitting is successful, otherwise, the fitting fails.

In this embodiment, a least squares method is used as the linear regression solution that finds the best function match for the data by minimizing the sum of squares of the errors. The purpose of the least squares method is to find the functional relationship y=f (x) between the dependent variable x and the independent variable y; wherein x, y represents the mutation frequency VAF value of the same SNPs in two samples; from this functional relationship, a straight line can be determined, which is the fitted straight line; the objective function of the least squares method is to minimize the sum of squares of the errors.

In this embodiment, pearson correlation coefficient (Pearson Correlation Coefficient) is a coefficient used to measure whether two data sets are above a line, which is used to measure the linear correlation between distance variables. It is defined as if (x, y) is a random two-dimensional variable, then the Pearson correlation coefficient is the covariance of the two variables divided by the standard deviation product of the two variables.

If Pearson correlation coefficient=0, the wireless correlation between x and y is indicated, and no correlation cannot be said. The larger the absolute value of the Pearson correlation coefficient, the stronger the correlation: the closer the Pearson correlation coefficient is to 1 or-1, the stronger the correlation, the closer the Pearson correlation coefficient is to 0, and the weaker the correlation. Pearson correlation coefficient of 0.8 (no inclusion) -1.0, indicating extremely strong correlation, pearson correlation coefficient of 0.6 (no inclusion) -0.8 (inclusion), indicating strong correlation, pearson correlation coefficient of 0.4 (no inclusion) -0.6 (inclusion), indicating moderate correlation, pearson correlation coefficient of 0.2 (no inclusion) -0.4 (inclusion), indicating weak correlation, pearson correlation coefficient of 0.0-0.2 (inclusion), indicating extremely weak correlation or no correlation.

S43, if the fitting is determined to be successful, outputting SNPs at the moment, and calculating a mutation stability coefficient mut_c between samples, a consistency index fit c-index after the fitting, a intra-group correlation number fit_ICC, fitting iteration numbers and a population snp library duty ratio common_snps_percentage.

The inter-sample mutation stability coefficient mut_c is calculated for the purpose of measuring the inter-sample mutation difference, and the inter-sample mutation stability coefficient mut_c is expressed by using the coefficients between (0, 1). The inter-sample mutation stability factor mut_c is calculated as follows:

Mut_c＝-1/lg|Diff_iv|；

wherein diffiv is the difference between the transition and the transversion ratio of the two samples.

The ICC value is known as intraclass correlation coefficient, i.e. intra-group correlation coefficient. It is one of the confidence coefficients (reliability coefficient) of the inter-observer confidence (inter-observer reliability) and the retest confidence (test-retest reliability) index. The ICC value is equal to the individual variability divided by the total variability, so it is between 0 and 1. 0 represents untrusted and 1 represents fully trusted. It is generally considered that a confidence coefficient below 0.4 indicates poor confidence and above 0.75 indicates good confidence, and higher ICC values are often required for quantitative data.

S44, if the fitting failure is determined, defining the mutation frequency VAF of a SNP in a sample as Fa _n Then the mutation frequency VAF corresponding to the same SNP in another sample is Fb _n Difference i= |fa of mutation frequency VAF of two samples corresponding to SNPs _n -Fb _n I (I); simultaneously giving an initial threshold k; when I > k, after deleting the SNP, the procedure returns to steps S42 and S43.

S45, if the fitting failure is continuously determined, the initial threshold k is reduced according to a first decreasing rule, and the step S44 is continuously performed until the first time of the threshold, the statistic is recorded as 0 after the integral fitting failure is determined, and different sources of the sequencing sample are determined.

As a preferred embodiment, the initial threshold k=0.5.

In a preferred embodiment, the first decrementing rule is decremented by k=k-0.01.

In a preferred embodiment, the first order number threshold is in the range of 30-50, preferably 40.

As a preferred embodiment, the calculating the population pnp library duty ratio common_pnp_percentage in S43 includes: the ratio of the fit SNPs in the population SNPs library is calculated as common_snps_percentage.

In this example, the population SNPs library builds data of a reference genome aggregation database (Genome Aggregation Database, gnomAD); wherein gnomAD is a database that collects and coordinates exome and genomic sequencing data from a variety of large-scale sequencing projects.

The specific steps of constructing the crowd SNPs library comprise:

gnomAD data is acquired. Downloading data from genomic and exome libraries (v2.1.1, based on GRCh 37) from the gnomaD backbone http:// www.gnomad-sg.org/respectively;

Forming a gene file based on the gnomAD data;

filtering SNPs loci in the gene files corresponding to the genome library respectively based on the first data filtering standard and the second data filtering standard to obtain a first result file;

acquiring an intersection of a first result file and a second result file as the crowd SNPs library, wherein the first data filtering standard is ref (crowd frequency) which is more than or equal to 0.01; the second data filtering standard is AF_ eas (east Asian crowd frequency) which is more than or equal to 0.01.

S5, carrying out logistic regression modeling based on the plurality of calculation parameters.

In this embodiment, S5 includes: performing logistic regression modeling based on the plurality of calculated parameters includes: determining coefficient R based on inter-sample mutation stability coefficient mut_c, fit-after-fit consistency index fixing c-index, fit slope fixing_slope and fit equation ² Performing logistic regression modeling on the fitted pearson coefficient fit_pearson, the fitted iteration number candidates, the intra-group phase number fit_icc and the population snp library duty ratio common_snps_multicenter, and increasing the weight of the difference parameters, wherein the logistic regression modeling comprises the following steps:

S51, determining coefficient R of fitting equation based on inter-sample mutation stability coefficient mut_c and fitting slope ² The method comprises the steps of dividing two samples into a modeling data set and an independent sample set according to a first proportion by a pearson coefficient fitting_pearson after fitting, fitting iteration number relationships, a group internal phase number fitting_ICC and a population snp library ratio common_snps_percentage; the modeling dataset is randomly sampled N times and divided into training samples and test samples according to a second ratio, the test samples constituting a test set. In this embodiment, the first ratio is 8:2; the N times are 10 ten thousand times of random sampling; the second ratio is 7:3;

s52, obtaining a logistic regression model after modeling M times based on the logistic regression, and predicting the corresponding test set and independent sample set by using the logistic regression model to obtain predicted values of the test set and the independent sample set. In this embodiment, M times are 10 ten thousand times;

s53, screening a first round of models, namely comparing predicted values of a test set and an independent sample set with real values, and calculating a consistency index fixing c-index and accuracy after fitting based on comparison results; and performing first-round model screening based on the fitted consistency index fixing c-index and the accuracy. The area under the model and the curve is AUC (Area Under Curve), in this embodiment, AUC is defined as the area enclosed by the lower part of the ROC curve and the coordinate axis, obviously the area is not greater than 1, and because the ROC curve is always above the line of y=x, the value range is between 0.5 and 1, the closer the AUC is to 1, the higher the authenticity of the detection method is, and the lowest the authenticity is when the AUC is equal to 0.5, and the application value is not provided; the ROC curve, which is collectively referred to as the subject's working characteristic curve (receiver operating characteristic curve), is a curve plotted on the ordinate with true positive rate (sensitivity) and false positive rate (1-specificity) on the abscissa, according to a series of different classification schemes (demarcation values or decision thresholds). AUC is obtained by summing the areas of the parts under the ROC curve.

s55, third-round model screening, wherein the third-round model screening comprises the steps of counting non-zero coefficients of each model after second-round model screening, and screening a plurality of groups of models based on the model non-zero coefficients which can cover all coefficients and training sets of the models which can cover all training samples; classifying each model after the second round of model screening according to non-zero coefficients, and screening the models which can cover more than 50% of sample points and simultaneously cover all training sample sets from the models as third round of model screening results, so that the bias of the models is reduced;

and S56, combining a plurality of groups of models obtained by the third-round model screening as a final model, wherein the final model is used for detecting the sample homology in the S6.

Logistic regression (logistic regression) is an algorithm proposed to solve the two classification problem, which gives a conditional probability distribution on the assumption that the data obeys the bernoulli distribution, and solves the optimal parameters with maximum likelihood estimation.

Assuming that there are reaction variables y _k And independent variable x _k There is a linear relationship between, namely:

y _k *＝α+βx _k +ε _k the method comprises the steps of carrying out a first treatment on the surface of the Alpha and beta are two constant coefficients respectively;

a critical point may be present;

when y is _k *>0, then y _k ＝1；

When y is _k *<0, then y _k ＝0；

Namely:

wherein F (·) is the error term ε _k Assuming an error term epsilon _k Obeying Logistic distribution or standard normal distribution. When epsilon _k Obeying the Logistic distribution to obtain a Logistic regression model, and determining epsilon _k And (5) obtaining the Probit model after conforming to standard normal distribution. In the Logistic regression model, selecting such a variance allows the cumulative distribution function to yield a simpler formula.

Thus the Logistic regression model is:

the above model can be changed into by logarithmic transformation:

wherein p is _k The probability of event occurrence for the kth case is determined by an explanatory variable x _k A nonlinear function is constructed. When there are M independent variables, the two-classification dependent variable Logic regression model is:

the above model can be changed into by logarithmic transformation:

where k=1, 2, K; m=1, 2, M;

after the model estimation is completed, it is necessary to evaluate whether the model effectively describes the reaction variables and how well the model matches the observed data. When the predicted value of the model can have higher consistency with the corresponding observed value, the model is considered to fit the data, otherwise the model cannot be accepted, and the model needs to be reset. To set the maximum likelihood value estimated by a model, it summarizes how well sample data is fitted by this model. />Is the maximum likelihood value of the saturated model, and a reference model must be used as a standard for comparing the set model fitting goodness in the same set of data, namely the saturated model. />Is called likelihood ratio and is denoted as l. Multiplying the natural logarithm of likelihood ratio by-2 forms a statistic that is subject to χ when the sample is large enough ² The degree of freedom of the distribution is equal to the difference obtained by subtracting the coefficient number from the covariate type number in the set model. Referred to as bias, generally denoted by D:

when (when)Value relative to->When the value is smaller, a larger D value is obtained, and the model is poor, otherwise, the model is +.>The value is approximately +.>When the value is obtained, the value D is small, and the set model fits well.

Example two

Referring to fig. 3, there is provided a sample homology detection system based on logistic regression modeling, comprising: the gene acquisition module 101 is used for acquiring two gene files, wherein the two gene files are in a VCF format; the filtering and screening module 102 is used for respectively filtering and screening SNPs in the two gene files according to preset filtering and screening conditions to obtain filtered and screened SNPs; the correlation parameter module 103 is configured to calculate a first parameter and a second parameter of samples corresponding to the two gene files based on the SNPs after filtering and screening, where the first parameter is a transition-to-transversion ratio, and the second parameter is a consistency index primary c-index before fitting; a linear fitting module 104, configured to calculate mutation frequencies of the SNPs after filtering and screening; determining to perform linear fitting on mutation frequencies of SNPs after filtering and screening based on the conversion-to-transversion ratio and the consistency index primary c-index before fitting meeting a first condition, and determining a plurality of calculation parameters after the linear fitting; the plurality of calculated parameters includes a third parameter, a fourth parameter, a fifth parameter, a sixth parameter, a seventh parameter, an eighth parameter, a ninth parameter, and a tenth parameter; the third parameter is inter-sample mutation stability coefficient mut_c, the fourth parameter is fit-after-fit consistency index fit c-index, the fifth parameter is fit slope, and the sixth parameter is a determination coefficient R of a fit equation ² The seventh parameter is a fitted pearson coefficient fitting_pearson, the eighth parameter is a fitted iteration number iteration, the ninth parameter is a intra-group phase relation number fitting_ICC, and the tenth parameter is a crowd snp library occupancy ratio common_snps_percentage;a logistic regression modeling module 105 for performing logistic regression modeling based on the plurality of calculation parameters; a homology determination module 106 for predicting whether the sequencing samples are homologous based on logistic regression modeling.

The system may implement the detection method provided in the first embodiment, and the specific detection method may be referred to the description in the first embodiment, which is not repeated here.

The invention also provides a memory storing a plurality of instructions for implementing the method according to the first embodiment.

As shown in fig. 4, the present invention further provides an electronic device, including a processor 301 and a memory 302 connected to the processor 301, where the memory 302 stores a plurality of instructions, and the instructions may be loaded and executed by the processor, so that the processor can execute the method according to the first embodiment.

Embodiment and verification example under specific application scenario:

as shown in tables 1 and 2, the test was performed using 291 samples (pair of homologous sample data 88, as shown in table 1, and pair of non-homologous sample data 203, as shown in table 2). Comprising the following steps: calculating the conversion and transversion ratio of two gene files and a primary c-index of consistency before fitting; after linear fitting, a plurality of calculation parameters are determined, including inter-sample mutation stability coefficient mut_c, fit-after-consistency index fixing c-index, fit slope fixing_slope, and determination coefficient R of fit equation ² The parameters such as the pearson coefficient fixing_pearson after fitting, the fitting iteration number candidates, the intra-group phase number fixing_ICC, the population snp library duty ratio common_snps_multicenter and the like.

TABLE 1 sample homology sets

/>

TABLE 2 sample non-homologous groups

/>

Tables 3 and 4 were obtained by separately performing statistics on the homologous sample and the different homologous sample using the method of the present invention.

TABLE 3 homology sample statistics

Total number of homologous groups	Predicting the number of homologous groups	Predicting an uncertainty group number	Predicting non-homologous group numbers
				88	88	0	0

Table 4 different source sample statistics table

Total number of non-homologous groups	Predicting the number of homologous groups	Predicting an uncertainty group number	Predicting non-homologous group numbers
				203	0	0	203

Through the calculation of the table 3 and the table 4, the accuracy rate of the method in the detection of the current homologous group reaches 100%, and the accuracy rate in the detection of the current non-homologous group reaches 100%.

While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the invention. It will be apparent to those skilled in the art that various modifications and variations can be made to the present invention without departing from the spirit or scope of the invention. Thus, it is intended that the present invention also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims

1. A sample homology detection and verification method based on logistic regression modeling is characterized by comprising the following steps:

s1, acquiring two gene files, wherein the two gene files are in a VCF format;

s4, calculating mutation frequencies of SNPs after filtering and screening; determining to perform linear fitting on mutation frequencies of SNPs after filtering and screening based on the conversion-to-transversion ratio and the consistency index primary c-index before fitting meeting a first condition, and determining a plurality of calculation parameters after the linear fitting; the plurality of calculated parameters includes a third parameter, a fourth parameter, a fifth parameter, a sixth parameter, a seventh parameter, an eighth parameter, a ninth parameter, and a tenth parameter; the third parameter is inter-sample mutation stability coefficient mut_c, the fourth parameter is fit-after-fit consistency index fit c-index, the fifth parameter is fit slope, and the sixth parameter is a determination coefficient R of a fit equation ² The seventh parameter is the fitted pearson coefficient fitting_pearson, the eighth parameter is the fitting iteration number candidates, the ninth parameter is the intra-group phase relation number fitting_icc, and the tenth parameter is a human groupThe pnp library duty cycle common_pnp_multicenter;

s6, predicting whether the samples are homologous or not based on logistic regression modeling;

wherein the step S5 of performing logistic regression modeling based on the plurality of calculation parameters includes: determining coefficient R based on inter-sample mutation stability coefficient mut_c, fit-after-fit consistency index fixing c-index, fit slope fixing_slope and fit equation ² Performing logistic regression modeling on the fitted pearson coefficient fit_pearson, the fitted iteration number candidates, the intra-group phase number fit_icc and the population snp library duty ratio common_snps_multicenter, and increasing the weight of the difference parameters, wherein the logistic regression modeling comprises the following steps:

2. The method for detecting and checking sample homology based on logistic regression modeling according to claim 1, wherein the predetermined filtering and screening conditions in S2 include: one or more of a first condition, a second condition, a third condition, and a fourth condition, wherein the first condition is deletion of SNPs having a total sequencing depth of less than 10X; the second condition is the deletion of sex chromosome mutated SNPs; the third condition is that SNPs retaining heterozygous mutations; the fourth condition is to reserve SNPs supporting more than 4 reads.

3. The method for detecting and checking sample homology based on logistic regression modeling according to claim 1, wherein the step of calculating the first parameter in S3 is a transition to transversion ratio comprises:

the step of calculating the pre-fitting consistency index primary c-index in S3 includes:

4. The method for detecting and checking sample homology based on logistic regression modeling according to claim 1, wherein the determining that the mutation frequencies of the SNPs after the filtering and screening are linearly fitted based on the first condition that the transition-to-transversion ratio and the consistency index primary c-index before fitting meet the first condition comprises: if the absolute value of the difference between the transition and the transversion ratio is smaller than 0.1, performing linear fitting, otherwise, not performing linear fitting; and if the consistency index primary c-index before fitting is more than or equal to 0.7, performing linear fitting, otherwise, not performing linear fitting;

s45, if the fitting failure is continuously determined, the threshold k is reduced according to a first decreasing rule, and the step S44 is continuously performed until the first time of the threshold, the statistic is recorded as 0 after the integral fitting failure is determined, and the different sources of the sequencing samples are determined.

5. The method for checking sample homology detection based on logistic regression modeling according to claim 4, wherein the initial threshold k=0.5; the first decreasing rule is decreasing according to a method of k=k-0.01; the first order number threshold ranges from 30 to 50.

6. The method for detecting and checking sample homology based on logistic regression modeling according to claim 5, wherein the calculating of the population snp library duty cycle common_snps_percentage in S43 includes: calculating the proportion of the fitted SNPs in the population SNPs library, wherein the proportion is common_snps_percentage; the specific steps of constructing the crowd SNPs library comprise:

forming a gene file based on the gnomAD data;

7. A sample homology detection and verification system based on logistic regression modeling, for implementing the detection and verification method according to any one of claims 1 to 6, comprising:

The gene acquisition module (101) is used for acquiring two gene files, wherein the two gene files are in a VCF format;

the filtering and screening module (102) is used for respectively filtering and screening SNPs in the two gene files according to preset filtering and screening conditions to obtain filtered and screened SNPs;

the correlation parameter module (103) is used for calculating a first parameter and a second parameter of samples corresponding to two gene files based on SNPs after filtering and screening, wherein the first parameter is a conversion and transversion ratio, and the second parameter is a consistency index primary c-index before fitting;

a linear fitting module (104) for calculating mutation frequencies of SNPs after filtering and screening; determining to perform linear fitting on mutation frequencies of SNPs after filtering and screening based on the conversion-to-transversion ratio and the consistency index primary c-index before fitting meeting a first condition, and determining a plurality of calculation parameters after the linear fitting; the plurality of calculated parameters includes a third parameter, a fourth parameter, a fifth parameter, a sixth parameter, a seventh parameter, an eighth parameter, a ninth parameter, and a tenth parameter; the third parameter is inter-sample mutation stability coefficient mut_c, the fourth parameter is fit-after-fit consistency index fit c-index, the fifth parameter is fit slope, and the sixth parameter is a determination coefficient R of a fit equation ² The seventh parameter is a fitted pearson coefficient fitting_pearson, the eighth parameter is a fitted iteration number iteration, the ninth parameter is a intra-group phase relation number fitting_ICC, and the tenth parameter is a crowd snp library occupancy ratio common_snps_percentage;

a logistic regression modeling module (105) for logistic regression modeling based on the plurality of calculation parameters;

a homology determination module (106) for predicting whether the samples are homologous based on logistic regression modeling.

8. An electronic device comprising a processor and a memory, the memory storing a plurality of instructions, the processor configured to read the instructions and perform the detection and verification method of any one of claims 1-6.

9. A computer readable storage medium storing a plurality of instructions readable by a processor and for performing the detection verification method according to any one of claims 1-6.