WO2021098615A1 - 基因型数据缺失的填充方法、装置及服务器 - Google Patents

基因型数据缺失的填充方法、装置及服务器 Download PDF

Info

Publication number
WO2021098615A1
WO2021098615A1 PCT/CN2020/128853 CN2020128853W WO2021098615A1 WO 2021098615 A1 WO2021098615 A1 WO 2021098615A1 CN 2020128853 W CN2020128853 W CN 2020128853W WO 2021098615 A1 WO2021098615 A1 WO 2021098615A1
Authority
WO
WIPO (PCT)
Prior art keywords
gene
sample
dimension
missing
value
Prior art date
Application number
PCT/CN2020/128853
Other languages
English (en)
French (fr)
Inventor
殷力
殷鹏
Original Assignee
中国科学院深圳先进技术研究院
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中国科学院深圳先进技术研究院 filed Critical 中国科学院深圳先进技术研究院
Publication of WO2021098615A1 publication Critical patent/WO2021098615A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Definitions

  • the present invention relates to the technical field of gene prediction, in particular to a method, device and server for filling genotype data missing.
  • the loss of genetic data caused by SNP (Single Nucleotide Polymorphism Marker) chip sequencing has brought great challenges to genome-wide association analysis.
  • the loss of genotype data is divided into genetic loss and detectable loss.
  • In the process of analyzing genotypic deletions we generally discuss technical deletions rather than artificial deletions, mainly due to the following reasons: deletions caused by whole-genome resequencing, deletions caused by simplified gene sequencing, exome sequencing, and targets Deletions caused by region capture sequencing and deletions caused by SNP chips, etc.
  • the embodiments of the present invention provide a filling method, device and server for missing genotype data to solve the problem of low filling efficiency and high error rate of predicted gene filling values.
  • the first aspect of the embodiments of the present invention provides a method for filling missing genotype data, including:
  • each of the gene samples includes several gene values that are randomly covered;
  • a padding value that pre-fills the missing position of the gene is generated; the padding value carries the corresponding dynamic linkage The parameters of the relationship;
  • Each pre-filled gene sample is input into the missing gene prediction model, the gene value prediction is performed on the gene missing position according to the pre-filled filling value, and a complete gene sample filled with the predicted gene value is output.
  • generating a padding value for pre-filling the gene missing position according to the dynamic linkage relationship between the gene missing position and the gene sample in which it is located includes:
  • a filling value carrying the parameter is generated to pre-fill the gene missing position.
  • the pre-filled gene sample is input into the missing gene prediction model, the gene value prediction is performed on the missing gene position according to the pre-filled filling value, and the complete gene filled with the predicted gene value is output Samples, including:
  • the N decoders perform gene value prediction and alignment output on the missing position of the gene where each filling value is located, to obtain the complete gene sample filled with the predicted gene value.
  • the obtaining genetic data of several different individuals from the gene bank to generate several genetic samples includes:
  • the gene value prediction is performed on the missing gene position according to the pre-filled filling value, and a complete gene sample filled with the predicted gene value is output after that, it also includes:
  • a gradient is calculated backwards according to the complete gene sample and the original value of the corresponding gene sample, and the parameters of the missing gene prediction model are updated through the gradient.
  • the determining the physical sequence dimension of the gene dimension and the value of the sample dimension according to the position of the gene deletion includes:
  • the jump in any one of the sample dimensions in the physical sequence dimension t of the gene dimension and the jump in the physical sequence dimension of any gene in the sample dimension d are both s t is a position corresponding to the physical sequence dimension t of the gene dimension in the preset scale matrix; s d is a position corresponding to the sample dimension d in the preset scale matrix.
  • the calculation of the data relationship in the physical sequence dimension of the gene dimension and the data relationship in the sample dimension of the gene sample where the gene deletion position is located, to obtain the physical sequence dimension parameter and the sample in the gene dimension Dimension parameters including:
  • ⁇ t exp ⁇ -max(0,W ⁇ ⁇ t +b ⁇ ) ⁇ ;
  • ⁇ d exp ⁇ -max(0,W ⁇ ⁇ d +b ⁇ ) ⁇ ;
  • ⁇ t is the physical sequence dimension parameter of the gene dimension
  • ⁇ d is the sample dimension parameter
  • W ⁇ is the preset parameter
  • b ⁇ is the offset.
  • the gene value prediction is performed on the missing gene position according to the pre-filled filling value, and a complete gene sample filled with the predicted gene value is output after that, it also includes:
  • a second aspect of the embodiments of the present invention provides a filling device for missing genotype data, including:
  • the gene sample generation module is used to obtain the gene data of several different individuals from the gene bank to generate several gene samples; each of the gene samples includes several gene values that are randomly covered;
  • the padding value generating module is used to generate padding values for pre-filling the missing positions of the genes according to the dynamic linkage relationship between the missing positions of the genes and the gene samples for each missing position of the gene in the gene sample;
  • the recharge carries the parameters corresponding to the dynamic linkage relationship;
  • the gene prediction filling module is used to input each pre-filled gene sample into the missing gene prediction model, perform gene value prediction on the missing gene position according to the pre-filled filling value, and output a complete gene sample filled with the predicted gene value .
  • a third aspect of the embodiments of the present invention provides a server, including: a memory, a processor, and a computer program stored in the memory and running on the processor.
  • the processor executes the computer program, Realize the filling method of genotype data missing in the first aspect.
  • a method, device, and server for filling genotype data missing are generated by obtaining genetic data of several different individuals from a gene bank; each of the genetic samples includes several randomly covered ones.
  • Gene value for each gene missing position in the gene sample, according to the dynamic linkage relationship between the gene missing position and the gene sample, a padding value pre-filling the gene missing position is generated; the padding value carries the corresponding The parameters of the dynamic linkage relationship; input each pre-filled gene sample into the missing gene prediction model, perform gene value prediction on the missing gene position according to the pre-filled filling value, and output a complete gene sample filled with the predicted gene value .
  • Training data is generated by randomly covering the gene value in each gene sample generated from valid gene data.
  • the model can predict the gene value according to the dynamic linkage relationship corresponding to the parameter carried by the padding value, output the complete gene sample filled with the predicted gene value, and complete the training of the model.
  • the missing gene prediction model generated by training is used to predict and fill missing gene values to improve the efficiency of gene filling.
  • the model combines the dynamic linkage relationship carried by the filling value at the missing gene position to predict and fill the missing gene value to improve the accuracy of predicting the gene value.
  • FIG. 1 is a schematic flow chart of a method for filling in missing genotype data according to Embodiment 1 of the present invention
  • Embodiment 1 of the present invention is a schematic structural diagram of a missing gene prediction model provided by Embodiment 1 of the present invention.
  • FIG. 3 is a schematic flowchart of a method for filling in missing genotype data provided in the second embodiment of the present invention.
  • FIG. 4 is a schematic structural diagram of a filling device for missing genotype data provided in the third embodiment of the present invention.
  • FIG. 5 is a schematic diagram of the structure of a server provided in Embodiment 4 of the present invention.
  • FIG. 1 it is a schematic flowchart of a method for filling missing genotype data provided by Embodiment 1 of the present invention.
  • This embodiment can be applied to the application scenario of predicting and filling missing gene values in genetic data.
  • the method can be executed by a processor in a filling device with missing genotype data.
  • the device can be a server, a smart terminal, a tablet, or a PC. Etc.; in the embodiment of this application, the filling device with missing genotype data is used as the execution subject for description.
  • the method specifically includes the following steps:
  • the prior art fits a parameter through a gene sequence with missing values, learns the overall characteristics of the missing data, and then fills in the missing values according to the characteristics.
  • This method requires the missing values of the data to have a relatively small impact on the overall distribution of the data , But the current number of genetic samples that have been studied and tested is not enough to support such a large amount of data.
  • the missing genetic data is random, the genetic data does not proceed independently during the expression process, but has a certain rule that is linkage dynamic imbalance.
  • the missing gene prediction model can be constructed by combining the data distribution in the reference genome sequence and deep learning of the genome data to realize the prediction and filling of the genotype data missing.
  • the reference genome sequence is a human genome data set collected by sampling people from various regions of the world.
  • the data is of statistical significance for the analysis of the human genome.
  • the 1000 Genome Project selected population distributions from 26 countries and regions.
  • a total of more than 2500 samples constitute genome data.
  • a sufficient number of gene sequences of different individuals can be obtained from a known gene bank to generate several gene samples. And randomly cover several gene values for each gene sample, so that each gene sample has gene data missing as training data.
  • the process of obtaining the genetic data of several different individuals from the gene bank and generating several genetic samples may be: obtaining the genetic data of several different individuals from the gene bank, and intercepting the genetic data into lengths through sliding window processing Several identical genetic samples. Because the length of the gene sequence is too long, if the calculation of the too long gene sequence is continued during the model training process, the calculation efficiency will be low. Therefore, it is optional to split several gene sequences into several sample data meeting preset standards by cutting the data. The gene sequences can be intercepted by sliding window processing to obtain several gene samples with the same length.
  • the missing gene prediction model In order to combine the dynamic linkage relationship between the missing position of each gene on the gene sample and the gene sample to perform gene prediction filling, before inputting the generated gene samples into the missing gene prediction model for model training, for each gene in each gene sample For the missing position, based on the dynamic linkage relationship between the missing position of the gene and the gene sample where it is located, a padding value that pre-fills the missing position of the gene is generated. Therefore, the gene deletion position of each gene sample has a filling value carrying the parameter corresponding to the dynamic linkage relationship.
  • the missing gene prediction model can predict the missing gene at the missing gene position based on the dynamic linkage relationship carried by the padding value, and improve the prediction accuracy of the missing gene.
  • the specific process of generating the padding value for filling the gene deletion position may be: Determine the physical sequence dimension of the gene dimension and the value of the sample dimension according to the gene deletion position; calculate the data relationship of the gene sample at the gene deletion position in the physical sequence dimension of the gene dimension and the data in the sample dimension Relationship, the physical sequence dimension parameter of the gene dimension and the sample dimension parameter are obtained; according to the physical sequence dimension parameter of the gene dimension and the sample dimension parameter, a padding value carrying the parameter is generated to pre-fill the gene missing position.
  • the padding value is generated by the physical sequence dimension of the genetic sample in the genetic dimension and the genetic data distribution of the sample dimension.
  • the physical sequence dimension of the gene dimension and the value of the sample dimension are determined according to the position of each gene deletion, so that according to the physical sequence dimension of the selected gene dimension and the selected sample dimension, the sample in the gene sample where the gene deletion position is located is different.
  • Gene value can be determined based on the dominant or recessive position of the gene deletion, so that the values of the physical sequence dimensions and sample dimensions of the gene dimensions of multiple sites to be selected can be determined by the cover value.
  • the process of determining the physical sequence dimension of the gene dimension and the value of the sample dimension according to the gene deletion position includes: for each gene deletion position, determining the cover value according to the dominant or recessiveness of the gene deletion position. If the gene is missing Is dominant, set the mask value 1; if the gene is missing If it is not dominant, set the mask value Is 0; where t is the physical sequence dimension of the gene dimension; d is the vector dimension of the sample dimension; the jump ⁇ t in the physical sequence dimension t of the gene dimension and the jump ⁇ d in the sample dimension d are determined according to the set cover value , So that the filling device can sample the corresponding site in the gene sample where the gene deletion position is located according to the jump ⁇ t of the physical sequence dimension t of the determined gene dimension and the jump ⁇ d in the sample dimension d.
  • the rules for determining the jump ⁇ t in the physical sequence dimension t of the gene dimension and the jump ⁇ d in the sample dimension d includes: for each gene
  • the jump in any one of the sample dimensions in the physical sequence dimension t of the gene dimension and the jump in the physical sequence dimension of any gene in the sample dimension d are both st is a position of the physical sequence dimension t of the gene dimension in the preset scale matrix; s d is a position of the sample dimension d in the preset scale matrix.
  • the location at time t-2 of the scale matrix corresponds to the sample dimension to sample the location of the preset scale matrix d-2 dimension; when the mask value corresponding to the missing position of the gene is set to 1, then the physical sequence dimension corresponding to the gene dimension is sampled
  • the position at time t-1 of the preset scale matrix corresponds to the position of the dimension d-1 of the preset scale matrix sampling in the corresponding sample dimension.
  • the preset scale matrix The value of the element in each row included in the matrix increases one by one from left to right, and the value of the element in each column increases one by one from top to bottom.
  • preset scale matrix Can be:
  • the jump ⁇ t in the physical sequence dimension t of the gene dimension and the jump ⁇ d in the sample dimension d are determined according to the set cover value
  • the jump ⁇ t in the physical sequence dimension t of the determined gene dimension can be determined according to the jump ⁇ t of the physical sequence dimension t in the genetic dimension and the sample
  • the jump of dimension d ⁇ d Sample the corresponding locus in the gene sample where the gene deletion position is located, and calculate the data relationship of the gene sample where the gene deletion position is in the physical sequence dimension of the determined gene dimension and the data in the determined sample dimension Relationship, the physical sequence dimension parameter and the sample dimension parameter of the gene dimension are obtained.
  • calculating the data relationship of the gene sample at the gene deletion position in the physical sequence dimension of the gene dimension and the data relationship in the sample dimension to obtain the physical sequence dimension parameter and the sample dimension parameter of the gene dimension can be obtained by The following formula is calculated:
  • ⁇ t exp ⁇ -max(0,W ⁇ ⁇ t +b ⁇ ) ⁇ ;
  • ⁇ d exp ⁇ -max(0,W ⁇ ⁇ d +b ⁇ ) ⁇ ;
  • ⁇ t is the physical sequence dimension parameter of the gene dimension
  • ⁇ d is the sample dimension parameter
  • W ⁇ is the preset parameter
  • b ⁇ is the offset.
  • the physical sequence dimension parameters and sample dimension parameters of the gene dimension can be used to control the value of different bases in the physical sequence dimension of the gene dimension and the sample dimension respectively.
  • the weighted sum with parameters is added to the location obtained by sampling in the gene sample where the gene is missing according to the jump ⁇ t of the physical sequence dimension t of the determined gene dimension and the jump ⁇ d in the sample dimension d Obtain the padding value carrying the parameter to pre-fill the gene deletion position.
  • the calculation formula of the filling value of the carrying parameter may be:
  • S130 Input each pre-filled gene sample into the missing gene prediction model, perform gene value prediction on the missing gene position according to the pre-filled value, and output a complete gene sample filled with the predicted gene value.
  • each gene sample pre-filled with the filling value is input into the missing gene prediction model for model training.
  • the missing gene prediction model combines the dynamic linkage relationship between the missing position of the gene carried by the filled value and the gene sample to calculate the gene value of the missing position of the gene where the filled value is located. Prediction, thereby outputting a complete gene sample filling the predicted gene value, and completing the training of the missing gene prediction model.
  • inputting each pre-filled gene sample into the missing gene prediction model performs gene value prediction on the missing gene position according to the filling value
  • the process of outputting a complete gene sample filled with the predicted gene value includes: The encoder performs feature extraction according to the parameters carried by each of the padding values in the filled gene sample, and outputs a context vector; according to the context vector, N decoders analyze the gene where each padding value is located. Gene value prediction and alignment output are performed on the missing position to obtain the complete gene sample filling the predicted gene value; N ⁇ 3.
  • Figure 2 shows a schematic diagram of the structure of the missing gene prediction model.
  • the missing gene prediction model may adopt a Transformer model, including N encoders 21 and N decoders 23.
  • N can be set to 3.
  • the N encoders 21 in the missing gene prediction model can be parameterized in advance according to the preset multi-scale information, so that the N encoders 21 can be based on the information carried by each filling value in the filled gene sample. Multi-scale feature extraction is performed on the parameters.
  • Each encoder includes a multi-head attention layer. After the filled gene samples are input into N encoders, the N encoders 21 convert the gene samples into vectors and multiply them by the vectors carried according to the padding value.
  • the context vectors output by the N encoders 21 Before the context vectors output by the N encoders 21 are input to the corresponding N decoders 23, the context vectors output by the encoders can be dimensionally compressed through the fully connected neural network 22 to reduce the amount of data.
  • the fully connected neural network 22 may be an FNN fully connected network.
  • the compressed context vector is input to N corresponding decoders 23 to perform gene value prediction and alignment output on the context vector. Since the predicted gene value vector is floating-point data, but the real gene value vector is an integer, it needs to be floating-point The gene value vector of the type is converted into an integer vector, and the complete gene sample filled with the predicted gene value is obtained.
  • the filling value calculation of each gene missing position in the test gene sample needs to be performed.
  • a filling value pre-filled for each gene missing position is generated, and each filling value carries a parameter corresponding to the dynamic linkage relationship.
  • the pre-filled gene samples to be tested are input into the trained missing gene prediction model, and the missing gene prediction model uses the missing gene prediction model to predict the gene value of the missing gene where the filled value is located, thereby outputting the filled predicted gene value
  • the complete gene sample of the system can quickly complete the missing filling of genotype data and increase the filling rate.
  • An embodiment of the present invention provides a method for filling in missing genotype data, which generates several gene samples by obtaining gene data of several different individuals from a gene bank; each of the gene samples includes several gene values that are randomly covered; For each missing position of the gene in the gene sample, according to the dynamic linkage relationship between the missing position of the gene and the gene sample where it is located, a padding value pre-filling the missing position of the gene is generated; the padding value carries the corresponding dynamic linkage relationship The parameter; input each pre-filled gene sample into the missing gene prediction model, perform gene value prediction on the missing gene position according to the pre-filled value, and output a complete gene sample filled with the predicted gene value. Training data is generated by randomly covering the gene value in each gene sample generated from valid gene data.
  • the model can predict the gene value according to the dynamic linkage relationship corresponding to the parameter carried by the padding value, output the complete gene sample filled with the predicted gene value, and complete the training of the model.
  • the missing gene prediction model generated by training is used to predict and fill missing gene values to improve the efficiency of gene filling.
  • the model combines the dynamic linkage relationship carried by the filling value at the missing gene position to predict and fill the missing gene value to improve the accuracy of predicting the gene value.
  • Figure 3 is a schematic flow chart of the method for filling in genotype data missing provided in the second embodiment of the present invention.
  • this embodiment also provides a process of optimizing the parameters carried by the values and the parameters of the missing gene prediction model in the method for filling missing genotype data, so as to further improve the accuracy of predicting gene values.
  • the method specifically includes:
  • S210 Calculate a gradient in reverse according to the original value of the complete gene sample and the corresponding gene sample, and update the parameters of the missing gene prediction model through the gradient;
  • the process of predicting and filling the test gene samples with missing gene data includes two parts.
  • the first part is that for each missing position of the gene to be tested, based on the dynamic linkage relationship between the missing position of the gene and the sample of the gene to be tested, a padding value that pre-fills the missing position of the gene is generated, and the generated padding value carries the corresponding dynamic linkage relationship
  • the second part is to input the pre-filled gene sample to be tested into the trained missing gene prediction model to predict the gene value of the missing gene according to the filling value in the gene sample to be tested, and the output gene missing position is filled with prediction Complete gene sample of gene value, complete the gene deletion filling of the gene sample to be tested.
  • iterative training can also be used to optimize the parameters in the missing gene prediction model.
  • a sufficient number of gene sequences of different individuals are obtained from a known gene bank to generate several gene samples.
  • Each gene sample is randomly covered with several gene values, so that each gene sample has gene data missing as training data.
  • the complete gene sample can be predicted based on the missing gene.
  • the complete gene sample output by the model and the corresponding original value of the gene sample calculate the gradient in reverse, so that the parameters in the missing gene prediction model are updated by calculating the gradient, and the missing gene prediction model is converged.
  • the original value of each gene sample can be correspondingly stored after a number of gene values are randomly covered for each gene sample.
  • S220 Pick out all the predicted gene values in the complete gene sample according to the originally covered gene sample corresponding to the complete gene sample;
  • the gene sample of the missing gene is entered into the missing gene prediction model, for each gene missing position in the gene to be tested, it is necessary to generate a filling value pre-filling the missing position of the gene according to the dynamic linkage relationship between the missing gene position and the gene sample to be tested , And the generated filling value carries the parameter corresponding to the dynamic linkage relationship.
  • the parameters carried by the filling value of the pre-filled gene missing position can also be optimized to improve the accuracy of the missing gene value prediction filling.
  • the missing gene prediction model when the missing gene prediction model performs gene value prediction on any gene sample that includes a padding value in the input training data and outputs a complete gene sample, it needs to change the original gene sample according to the original masked gene sample corresponding to the output complete gene sample. All predicted gene values in the complete gene sample are selected. Then all predicted gene values are compared with the original gene values of the gene missing positions covered by the gene sample, and the weighted average of all predicted gene values and the corresponding original gene values is calculated to update the parameters carried by the filling value.
  • physical dimensions of the sequence parameter ⁇ t gene and the sample dimension parameter ⁇ d in dimension includes a predetermined parameter W ⁇ . All predicted gene values are compared with the original gene values of the gene missing positions covered by the gene sample, and the weighted average of all predicted gene values and the corresponding original gene values is calculated to update the preset parameter W ⁇ .
  • the missing gene prediction model In the prediction stage, the gene sample with the true missing is replaced with the gene vector value actively covered by the input model stage, and then the final predicted gene vector is obtained through the missing gene prediction model. Because the predicted gene vector is floating-point data, but The true vector is an integer. It is necessary to convert the floating-point predicted gene vector into an integer vector, and select the predicted gene value from the truly missing gene sample from the predicted gene value, and then compare it with the gene data that is not covered by the true value. Combine to get the final vector.
  • an embodiment of the present invention also provides a detection device 4, which includes:
  • the gene sample generating module 401 is used to obtain gene data of several different individuals from the gene bank to generate several gene samples; each of the gene samples includes several gene values that are randomly covered;
  • the gene data of several different individuals is obtained from the gene bank to generate several gene samples; when each of the gene samples includes several gene values that are randomly covered, the gene sample generation module 401 includes:
  • the interception unit is used to obtain genetic data of several different individuals from the gene bank, and intercept the genetic data into several genetic samples with the same length through sliding window processing.
  • the padding value generating module 402 is configured to generate padding values for pre-filling the missing positions of the genes according to the dynamic linkage relationship between the missing positions of the genes and the gene samples for each missing position of the gene in the gene sample;
  • the filling value carries the parameter corresponding to the dynamic linkage relationship;
  • Module 402 for each missing position of the gene in the gene sample, according to the dynamic linkage relationship between the missing position of the gene and the gene sample where it is located, when a padding value for pre-filling the missing position of the gene is generated, the padding value is generated Module 402 includes:
  • the value determining unit is configured to determine the values of the physical sequence dimension of the gene dimension and the sample dimension according to the position of the gene deletion;
  • the value determining unit when determining the values of the physical sequence dimension of the gene dimension and the sample dimension according to the gene deletion position, includes:
  • the jump in any one of the sample dimensions in the physical sequence dimension t of the gene dimension and the jump in the physical sequence dimension of any gene in the sample dimension d are both s t is a position corresponding to the physical sequence dimension t of the gene dimension in the preset scale matrix; s d is a position corresponding to the sample dimension d in the preset scale matrix.
  • the physical sequence dimension parameter of the gene dimension and the sample dimension parameter generating unit is used to calculate the data relationship of the gene sample at the gene deletion position in the physical sequence dimension of the gene dimension and the data relationship in the sample dimension to obtain the gene Dimensions of physical sequence dimension parameters and sample dimension parameters;
  • the data relationship of the gene sample with the gene deletion position in the physical sequence dimension of the gene dimension and the data relationship in the sample dimension are calculated to obtain the physical sequence dimension parameter of the gene dimension and the sample dimension parameter
  • the physical sequence dimension parameters and sample dimension parameter generating units of the genetic dimension include:
  • ⁇ t exp ⁇ -max(0,W ⁇ ⁇ t +b ⁇ ) ⁇ ;
  • ⁇ d exp ⁇ -max(0,W ⁇ ⁇ d +b ⁇ ) ⁇ ;
  • ⁇ t is the physical sequence dimension parameter of the gene dimension
  • ⁇ d is the sample dimension parameter
  • W ⁇ is the preset parameter
  • b ⁇ is the offset.
  • the padding value generating unit is configured to generate padding values carrying parameters according to the physical sequence dimension parameters of the gene dimension and the sample dimension parameters to pre-fill the gene missing positions.
  • the gene prediction filling module 403 is used for inputting each pre-filled gene sample into the missing gene prediction model, predicting the gene value of the missing gene position according to the pre-filled filling value, and outputting the complete gene filling the predicted gene value sample.
  • each pre-filled gene sample is input into the missing gene prediction model, the gene value prediction is performed on the missing gene position according to the pre-filled filling value, and when the complete gene sample filled with the predicted gene value is output ,
  • the gene prediction filling module 403 includes:
  • the coding unit is used for feature extraction according to the parameters carried by each of the filling values in the filled gene samples through N encoders, and outputting a context vector; N ⁇ 3;
  • the decoding unit is configured to perform gene value prediction and alignment output on the gene missing position where each filling value is located according to the context vector through N decoders, to obtain the complete gene sample filling the predicted gene value.
  • the filling device further includes:
  • the model update module is used to calculate a gradient in reverse based on the complete gene sample and the corresponding original value of the gene sample, and update the parameters of the missing gene prediction model through the gradient;
  • the predicted gene extraction module is used to select all the predicted gene values in the complete gene sample according to the originally covered gene sample corresponding to the complete gene sample;
  • the parameter update module is used to compare all the predicted gene values with the corresponding original gene values and update the preset parameters.
  • An embodiment of the present invention provides a device for filling missing genotype data, which generates several gene samples by obtaining gene data of several different individuals from a gene bank; each of the gene samples includes several gene values that are randomly covered; For each missing position of the gene in the gene sample, according to the dynamic linkage relationship between the missing position of the gene and the gene sample where it is located, a padding value pre-filling the missing position of the gene is generated; the padding value carries the corresponding dynamic linkage relationship The parameter; input each pre-filled gene sample into the missing gene prediction model, perform gene value prediction on the missing gene position according to the pre-filled value, and output a complete gene sample filled with the predicted gene value. Training data is generated by randomly covering the gene value in each gene sample generated from valid gene data.
  • the model can predict the gene value according to the dynamic linkage relationship corresponding to the parameter carried by the padding value, output the complete gene sample filled with the predicted gene value, and complete the training of the model.
  • the missing gene prediction model generated by training is used to predict and fill missing gene values to improve the efficiency of gene filling.
  • the model combines the dynamic linkage relationship carried by the filling value at the missing gene position to predict and fill the missing gene value to improve the accuracy of predicting the gene value.
  • Fig. 5 is a schematic diagram of the structure of a server provided in the fourth embodiment of the present invention.
  • the server includes: a processor 1, a memory 2, and a computer program 3 stored in the memory 2 and running on the processor 1, such as a program for filling a method for missing genotype data.
  • the processor 1 executes the computer program 3, the steps in the embodiment of the method for filling the missing genotype data are implemented, such as steps S110 to S130 shown in FIG. 1.
  • the computer program 3 may be divided into one or more modules, and the one or more modules are stored in the memory 2 and executed by the processor 1 to complete the application.
  • the one or more modules may be a series of computer program instruction segments capable of completing specific functions, and the instruction segments are used to describe the execution process of the computer program 3 in the server.
  • the computer program 3 can be divided into a filling value generation module, a filling value generation module, and a gene prediction filling module.
  • the specific functions of each module are as follows:
  • the gene sample generation module is used to obtain the gene data of several different individuals from the gene bank to generate several gene samples; each of the gene samples includes several gene values that are randomly covered;
  • the padding value generating module is used to generate padding values for pre-filling the missing positions of the genes according to the dynamic linkage relationship between the missing positions of the genes and the gene samples for each missing position of the gene in the gene sample;
  • the recharge carries the parameters corresponding to the dynamic linkage relationship;
  • the gene prediction filling module is used to input each pre-filled gene sample into the missing gene prediction model, perform gene value prediction on the missing gene position according to the pre-filled filling value, and output a complete gene sample filled with the predicted gene value .
  • the server may include, but is not limited to, a processor 1, a memory 2, and a computer program 3 stored in the memory 2.
  • FIG. 5 is only an example of a server, and does not constitute a limitation on the server. It may include more or less components than those shown in the figure, or a combination of certain components, or different components, such as the
  • the server may also include input and output devices, network access devices, buses, and so on.
  • the processor 1 may be a central processing unit (Central Processing Unit, CPU), other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (ASIC), Ready-made programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc.
  • the general-purpose processor may be a microprocessor or the processor may also be any conventional processor or the like.
  • the storage 2 may be an internal storage unit of the server, such as a hard disk or memory of the server.
  • the memory 2 may also be an external storage device, such as a plug-in hard disk equipped on a server, a smart memory card (Smart Media Card, SMC), a Secure Digital (SD) card, a flash memory card (Flash Card), etc. Further, the storage 2 may also include both an internal storage unit of the server and an external storage device.
  • the memory 2 is used to store the computer program and other programs and data required by the method for filling the missing genotype data.
  • the memory 2 can also be used to temporarily store data that has been output or will be output.
  • the disclosed device/terminal device and method may be implemented in other ways.
  • the device/terminal device embodiments described above are merely illustrative.
  • the division of the modules or units is only a logical function division, and there may be other divisions in actual implementation, such as multiple units.
  • components can be combined or integrated into another system, or some features can be omitted or not implemented.
  • the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
  • the functional units in the various embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
  • the above-mentioned integrated unit can be implemented in the form of hardware or software functional unit.
  • the integrated module/unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable storage medium.
  • the present invention implements all or part of the processes in the above-mentioned embodiments and methods, and can also be completed by instructing relevant hardware through a computer program.
  • the computer program can be stored in a computer-readable storage medium. When the program is executed by the processor, it can implement the steps of the foregoing method embodiments.
  • the computer program includes computer program code, and the computer program code may be in the form of source code, object code, executable file, or some intermediate forms.
  • the computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, U disk, mobile hard disk, magnetic disk, optical disk, computer memory, read-only memory (ROM, Read-Only Memory) , Random Access Memory (RAM, Random Access Memory), electrical carrier signal, telecommunications signal, and software distribution media, etc.
  • ROM Read-Only Memory
  • RAM Random Access Memory
  • electrical carrier signal telecommunications signal
  • software distribution media etc.
  • the content contained in the computer-readable medium can be appropriately added or deleted according to the requirements of the legislation and patent practice in the jurisdiction.
  • the computer-readable medium Does not include electrical carrier signals and telecommunication signals.

Landscapes

  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Biotechnology (AREA)
  • Biophysics (AREA)
  • Genetics & Genomics (AREA)
  • Molecular Biology (AREA)
  • Chemical & Material Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Analytical Chemistry (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Apparatus Associated With Microorganisms And Enzymes (AREA)

Abstract

一种基因型数据缺失的填充方法、装置及服务器,属于基因预测技术领域。该方法包括:从基因库中获取若干不同个体的基因数据生成若干基因样本;每一所述基因样本中包括被随机遮盖的若干基因值(S110);对于所述基因样本中每一基因缺失位置,根据所述基因缺失位置与所在的基因样本的动态连锁关系,生成预填充所述基因缺失位置的填充值;所述填充值携带对应所述动态连锁关系的参数(S120);将每一预填充后的基因样本输入缺失基因预测模型,根据预填充的所述填充值对所述基因缺失位置进行基因值预测,输出填充预测基因值的完整基因样本(S130)。所述方法解决填充效率低下且预测得到的基因填充值错误率高的问题。

Description

基因型数据缺失的填充方法、装置及服务器 技术领域
本发明涉及基因预测的技术领域,尤其涉及一种基因型数据缺失的填充方法、装置及服务器。
背景技术
SNP(单核苷酸多态性标记)芯片测序的过程中导致的基因数据的丢失给全基因组关联分析研究带来很大的挑战,基因型数据的丢失分为遗传性丢失和检测性丢失。我们在基因型缺失的分析过程中,一般讨论的是技术性缺失,而不是人为的缺失,主要有下列原因导致:全基因组重测序导致的缺失、简化基因测序导致的缺失、外显子测序以及目标区域捕获测序导致的缺失以及SNP芯片导致的缺失等。
在现有技术中,普遍通过带有缺失值的基因序列拟合一个参数,学习缺失数据的总体特征,然后根据特征对缺失值进行填充,这种方式需要数据缺失值对数据整体的分布产生一个比较小的影响,但是当下的基因样本数量还不足以支持如此大的数据量。导致填充效率低下且预测得到的基因填充值错误率高。
发明内容
有鉴于此,本发明实施例提供了一种基因型数据缺失的填充方法、装置及服务器,以解决填充效率低下且预测得到的基因填充值错误率高的问题。
本发明实施例的第一方面提供了一种基因型数据缺失的填充方法,包括:
从基因库中获取若干不同个体的基因数据生成若干基因样本;每一所述基因样本中包括被随机遮盖的若干基因值;
对于所述基因样本中每一基因缺失位置,根据所述基因缺失位置与所在的基因样本的动态连锁关系,生成预填充所述基因缺失位置的填充值;所述填充值携带对应所述动态连锁关系的参数;
将每一预填充后的基因样本输入缺失基因预测模型,根据预填充的所述填充值对所述基 因缺失位置进行基因值预测,输出填充预测基因值的完整基因样本。
在一个实施示例中,所述对于所述基因样本中每一基因缺失位置,根据所述基因缺失位置与所在的基因样本的动态连锁关系,生成预填充所述基因缺失位置的填充值,包括:
根据所述基因缺失位置确定基因维度的物理序列维度和样本维度的取值;
计算所述基因缺失位置所在的基因样本在所述基因维度的物理序列维度的数据关系以及在所述样本维度的数据关系,得到基因维度的物理序列维度参数和样本维度参数;
根据所述基因维度的物理序列维度参数和所述样本维度参数,生成携带参数的填充值,以预填充所述基因缺失位置。
在一个实施示例中,所述将每一预填充后的基因样本输入缺失基因预测模型,根据预填充的所述填充值对所述基因缺失位置进行基因值预测,输出填充预测基因值的完整基因样本,包括:
通过N个编码器根据所述预填充后的基因样本中每一所述填充值携带的参数进行特征提取,输出上下文向量;N≥3;
通过N个解码器根据所述上下文向量对每一所述填充值所在的所述基因缺失位置进行基因值预测和对齐输出,得到填充预测基因值的所述完整基因样本。
在一个实施示例中,所述从基因库中获取若干不同个体的基因数据生成若干基因样本,包括:
从基因库中获取若干不同个体的基因数据,通过滑窗处理将所述基因数据截取为长度一致的若干基因样本。
在一个实施示例中,在将每一预填充后的基因样本输入缺失基因预测模型,根据预填充的所述填充值对所述基因缺失位置进行基因值预测,输出填充预测基因值的完整基因样本之后,还包括:
根据所述完整基因样本与对应的基因样本原值反向计算梯度,通过所述梯度更新所述缺失基因预测模型的参数。
在一个实施示例中,所述根据所述基因缺失位置确定基因维度的物理序列维度和样本维度的取值,包括:
若所述基因缺失位置
Figure PCTCN2020128853-appb-000001
为显性,则设置遮盖值
Figure PCTCN2020128853-appb-000002
为1;若所述基因缺失位置
Figure PCTCN2020128853-appb-000003
不为显性,则设置所述遮盖值
Figure PCTCN2020128853-appb-000004
为0;其中,t为基因维度的物理序列维度;d为样本维度的向量维度;
根据设置后的遮盖值确定在基因维度的物理序列维度t的跳跃δ t和在样本维度d的跳跃
Figure PCTCN2020128853-appb-000005
Figure PCTCN2020128853-appb-000006
其中,在基因维度的物理序列维度t中任一所述样本维度的跳跃和在样本维度d中任一基因维度的物理序列维度的跳跃均为
Figure PCTCN2020128853-appb-000007
s t为预设刻度矩阵中基因维度的物理序列维度t对应的一个位点;s d为所述预设刻度矩阵中样本维度d对应的一个位点。
在一个实施示例中,所述计算所述基因缺失位置所在的基因样本在所述基因维度的物理序列维度的数据关系以及在所述样本维度的数据关系,得到基因维度的物理序列维度参数和样本维度参数,包括:
γ t=exp{-max(0,W γδ t+b γ)};
γ d=exp{-max(0,W γδ d+b γ)};
其中,γ t为所述基因维度的物理序列维度参数;γ d为所述样本维度参数;W γ为预设参数;b γ为偏移量。
在一个实施示例中,在将每一预填充后的基因样本输入缺失基因预测模型,根据预填充的所述填充值对所述基因缺失位置进行基因值预测,输出填充预测基因值的完整基因样本之后,还包括:
根据所述完整基因样本对应的原本被遮盖的基因样本,将所述完整基因样本中所有所述预测基因值挑出;
将所有所述预测基因值与对应的原基因值进行比对,更新所述预设参数。
本发明实施例的第二方面提供了一种基因型数据缺失的填充装置,包括:
基因样本生成模块,用于从基因库中获取若干不同个体的基因数据生成若干基因样本;每一所述基因样本中包括被随机遮盖的若干基因值;
填充值生成模块,用于对于所述基因样本中每一基因缺失位置,根据所述基因缺失位置与所在的基因样本的动态连锁关系,生成预填充所述基因缺失位置的填充值;所述填充值携带对应所述动态连锁关系的参数;
基因预测填充模块,用于将每一预填充后的基因样本输入缺失基因预测模型,根据预填充的所述填充值对所述基因缺失位置进行基因值预测,输出填充预测基因值的完整基因样本。
本发明实施例的第三方面提供了一种服务器,包括:存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现第一方面中基因型数据缺失的填充方法。
本发明实施例提供的一种基因型数据缺失的填充方法、装置及服务器,通过从基因库中获取若干不同个体的基因数据生成若干基因样本;每一所述基因样本中包括被随机遮盖的若干基因值;对于所述基因样本中每一基因缺失位置,根据所述基因缺失位置与所在的基因样本的动态连锁关系,生成预填充所述基因缺失位置的填充值;所述填充值携带对应所述动态连锁关系的参数;将每一预填充后的基因样本输入缺失基因预测模型,根据预填充的所述填充值对所述基因缺失位置进行基因值预测,输出填充预测基因值的完整基因样本。通过随机遮盖根据有效基因数据生成的每一基因样本中的基因值生成训练数据。生成携带基因缺失位置与基因样本的动态连锁关系的参数的填充值,并对训练数据进行填充。使得包括填充值的训练数据输入缺失基因预测模型后,模型能够根据填充值所携带的参数对应的动态连锁关系进行基因值预测,输出填充预测基因值的完整基因样本,完成模型的训练。通过训练生成的缺失基因预测模型对缺失的基因值进行预测填充,提高基因填充效率。且模型结合基因缺失位置上的填充值所携带的动态连锁关系对缺失的基因值进行预测填充,提高预测基因值的准确率。
附图说明
为了更清楚地说明本发明实施例中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1是本发明实施例一提供的基因型数据缺失的填充方法的流程示意图;
图2是本发明实施例一提供的缺失基因预测模型的结构示意图;
图3是本发明实施例二提供的基因型数据缺失的填充方法的流程示意图;
图4是本发明实施例三提供的基因型数据缺失的填充装置的结构示意图;
图5是本发明实施例四提供的服务器的结构示意图。
具体实施方式
为了使本技术领域的人员更好地理解本发明方案,下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚地描述,显然,所描述的实施例是本发明一部分的实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都应当属于本发明保护的范围。
本发明的说明书和权利要求书及上述附图中的术语“包括”以及它们任何变形,意图在于覆盖不排他的包含。例如包含一系列步骤或单元的过程、方法或系统、产品或设备没有限定于已列出的步骤或单元,而是可选地还包括没有列出的步骤或单元,或可选地还包括对于这些过程、方法、产品或设备固有的其它步骤或单元。此外,术语“第一”、“第二”和“第三”等是用于区别不同对象,而非用于描述特定顺序。
实施例一
如图1所示,是本发明实施例一提供的基因型数据缺失的填充方法的流程示意图。本实施例可适用于对基因数据中缺失的基因值进行预测填充的应用场景,该方法可以由基因型数据缺失的填充装置中的处理器执行,该装置可为服务器、智能终端、平板或PC等;在本申请实施例中以基因型数据缺失的填充装置作为执行主体进行说明,该方法具体包括如下步骤:
S110、从基因库中获取若干不同个体的基因数据生成若干基因样本;每一所述基因样本中包括被随机遮盖的若干基因值;
在实际对基因样本进行测序时,通过各种技术手段对基因样本进行处理过程中会造成样本中的一些基因数据缺失。且使用具有基因数据缺失的基因样本进行基因组关联分析研究提高了研究难度。现有技术通过带有缺失值的基因序列拟合一个参数,学习缺失数据的总体特征,然后根据特征对缺失值进行填充,这种方式需要数据缺失值对数据整体的分布产生一个比较小的影响,但是当下的已研究测试的基因样本数量还不足以支持如此大的数据量。缺失的基因数据虽然是随机的,但基因数据之间在表达过程中并不是独立进行,而是具有一定规律即连锁动态不平衡。可通过结合参考基因组序列中数据的分布并对基因组数据深度学习构建缺失基因预测模型,实现对基因型数据缺失进行预测填充。
参考基因组序列是对全球各个地区的人进行抽样采集的人类基因组数据集,数据对于分析人类基因组具有统计意义,例如1000基因组计划就选取了26个国家地区的人群分布一共2500多个样本组成基因组数据。具体地,在缺失基因预测模型训练阶段,可从已知的基因库中获取足够多数量的不同个体的基因序列生成若干基因样本。并对每一基因样本随机遮盖若 干基因值,使得每一基因样本中均具有基因数据缺失以作为训练数据。
在一个实施示例中,从基因库中获取若干不同个体的基因数据生成若干基因样本的过程可为:从基因库中获取若干不同个体的基因数据,通过滑窗处理将所述基因数据截取为长度一致的若干基因样本。由于基因序列的长度过长,若在模型训练过程持续对过长的基因序列进行计算会导致计算效率低下。因此可通过切割数据的方式将若干基因序列分割成满足预设标准的若干样本数据可选的,可通过滑窗处理对基因序列进行截取,得到长度一致的若干基因样本。
S120、对于所述基因样本中每一基因缺失位置,根据所述基因缺失位置与所在的基因样本的动态连锁关系,生成预填充所述基因缺失位置的填充值;所述填充值携带对应所述动态连锁关系的参数;
为结合基因样本上每一基因缺失位置与所在的基因样本的动态连锁关系进行基因预测填充,在将生成的若干基因样本输入缺失基因预测模型进行模型训练之前,对于每一基因样本中每一基因缺失位置,根据该基因缺失位置与所在的基因样本的动态连锁关系,生成预填充该基因缺失位置的填充值。使得每一基因样本的基因缺失位置中均具有一个携带对应所述动态连锁关系的参数的填充值。将具有填充值的每一基因样本输入缺失基因预测模型进行模型训练时,缺失基因预测模型可根据填充值携带的动态连锁关系对基因缺失位置的缺失基因进行预测,提高缺失基因的预测准确率。
在一个实施示例中,对于所述基因样本中每一基因缺失位置,根据所述基因缺失位置与所在的基因样本的动态连锁关系,生成填充所述基因缺失位置的填充值的具体过程可为:根据所述基因缺失位置确定基因维度的物理序列维度和样本维度的取值;计算所述基因缺失位置所在的基因样本在所述基因维度的物理序列维度的数据关系以及在所述样本维度的数据关系,得到基因维度的物理序列维度参数和样本维度参数;根据所述基因维度的物理序列维度参数和所述样本维度参数,生成携带参数的填充值,以预填充所述基因缺失位置。
通过基因样本在基因维度的物理序列维度和样本维度的基因数据分布来生成填充值。具体地,根据每一基因缺失位置确定基因维度的物理序列维度和样本维度的取值,从而根据选取的基因维度的物理序列维度和选取的样本维度在该基因缺失位置所在的基因样本中采样不同的基因值。可选的,可通过基因缺失位置的显隐性确定一个遮盖值,从而通过遮盖值确定多个需选取位点的基因维度的物理序列维度和样本维度的取值。
在一个实施示例中,根据所述基因缺失位置确定基因维度的物理序列维度和样本维度的取值的过程包括:对于每一基因缺失位置,根据基因缺失位置的显隐性确定遮盖值。若该基 因缺失位置
Figure PCTCN2020128853-appb-000008
为显性,则设置遮盖值
Figure PCTCN2020128853-appb-000009
为1;若该基因缺失位置
Figure PCTCN2020128853-appb-000010
不为显性,则设置所述遮盖值
Figure PCTCN2020128853-appb-000011
为0;其中,t为基因维度的物理序列维度;d为样本维度的向量维度;根据设置后的遮盖值确定在基因维度的物理序列维度t的跳跃δ t和在样本维度d的跳跃δ d,使得填充装置能够根据确定好的基因维度的物理序列维度t的跳跃δ t和在样本维度d的跳跃δ d在该基因缺失位置所在的基因样本中采样对应的位点。可选的,根据设置后的遮盖值确定在基因维度的物理序列维度t的跳跃δ t和在样本维度d的跳跃δ d的规则如下:
Figure PCTCN2020128853-appb-000012
Figure PCTCN2020128853-appb-000013
其中,在基因维度的物理序列维度t中任一所述样本维度的跳跃和在样本维度d中任一基因维度的物理序列维度的跳跃均为
Figure PCTCN2020128853-appb-000014
st为预设刻度矩阵中基因维度的物理序列维度t的一个位点;s d为预设刻度矩阵中样本维度d的一个位点。根据上述公式确定在基因维度的物理序列维度t的跳跃δ t和在样本维度d的跳跃δ d,当基因缺失位置对应的遮盖值取值为0时,则对应基因维度的物理序列维度采样预设刻度矩阵t-2时刻的位点,对应样本维度采样预设刻度矩阵d-2维度的位点;当基因缺失位置对应的遮盖值取值为1时,则对应基因维度的物理序列维度采样预设刻度矩阵t-1时刻的位点,对应样本维度采样预设刻度矩阵d-1维度的位点。具体地,预设刻度矩阵
Figure PCTCN2020128853-appb-000015
为矩阵中包含的每一行中的元素的值从左往右逐一递增,且每一列中的元素的值从上往下逐一递增。可选的,预设刻度矩阵
Figure PCTCN2020128853-appb-000016
可为:
Figure PCTCN2020128853-appb-000017
当根据设置后的遮盖值确定在基因维度的物理序列维度t的跳跃δ t和在样本维度d的跳跃δ d后,可根据确定好的基因维度的物理序列维度t的跳跃δ t和在样本维度d的跳跃δ d在该基因缺失位置所在的基因样本中采样对应的位点,计算基因缺失位置所在的基因样本在确定的基因维度的物理序列维度的数据关系以及在确定的样本维度的数据关系,得到基因维度的物理序列维度参数和样本维度参数。在一个实施示例中,计算基因缺失位置所在的基因样本在所述基因维度的物理序列维度的数据关系以及在所述样本维度的数据关系,得到基因维度的物理序列维度参数和样本维度参数,可由以下公式计算获得:
γ t=exp{-max(0,W γδ t+b γ)};
γ d=exp{-max(0,W γδ d+b γ)};
其中,γ t为所述基因维度的物理序列维度参数;γ d为所述样本维度参数;W γ为预设参数;b γ为偏移量。
计算得到基因维度的物理序列维度参数和样本维度参数后,可通过基因维度的物理序列维度参数和样本维度参数分别控制基因缺失位置在基因维度的物理序列维度和样本维度上不同碱基的值的带参数的加权和,并与根据确定好的基因维度的物理序列维度t的跳跃δ t和在样本维度d的跳跃δ d在该基因缺失位置所在的基因样本中采样得到的位点进行相加得到携带参数的填充值,以预填充所述基因缺失位置。在一个实施示例中,携带参数的填充值的计算公式可为:
Figure PCTCN2020128853-appb-000018
其中,
Figure PCTCN2020128853-appb-000019
S130、将每一预填充后的基因样本输入缺失基因预测模型,根据预填充的所述填充值对所述基因缺失位置进行基因值预测,输出填充预测基因值的完整基因样本。
在根据每一基因缺失位置与所在的基因样本的动态连锁关系生成预填充基因缺失位置的填充值后,将每一预填充有填充值的基因样本输入缺失基因预测模型进行模型训练。具体地,对于每一基因缺失位置中预填充的填充值,缺失基因预测模型结合填充值所携带的基因缺失位置与所在的基因样本的动态连锁关系对该填充值所在的基因缺失位置进行基因值预测,从而输出填充预测基因值的完整基因样本,完成缺失基因预测模型的训练。
在一个实施示例中,将每一预填充后的基因样本输入缺失基因预测模型根据所述填充值对基因缺失位置进行基因值预测,输出填充预测基因值的完整基因样本的过程包括:通过N 个编码器根据所述填充后的基因样本中每一所述填充值携带的参数进行特征提取,输出上下文向量;通过N个解码器根据所述上下文向量对每一所述填充值所在的所述基因缺失位置进行基因值预测和对齐输出,得到填充预测基因值的所述完整基因样本;N≥3。
如图2所示为缺失基因预测模型的结构示意图。具体地,缺失基因预测模型可采用Transformer模型,包括N个编码器21和N个解码器23。可选的,N可设为3。可预先根据预设的多尺度信息对缺失基因预测模型中的N个编码器21进行参数设置,以使N个编码器21能够根据所述填充后的基因样本中每一所述填充值携带的参数进行多尺度特征提取。每一编码器中包括多头注意力层(multi-head attention layer),填充后的基因样本输入N个编码器后,N个编码器21将基因样本转换为向量并分别乘以根据填充值携带的参数对应设置的不同权重的参数矩阵,得到代表输入向量的键值对的K和V向量组合;并根据K和V向量组合计算当前键值对相对于其他向量的键值对来表征当前向量和其他向量的关系得到上下文向量。
将N个编码器21输出的上下文向量输入对应的N个解码器23之前,可通过全连接神经网络22对编码器输出的上下文向量进行维度压缩,减少数据量。可选的,该全连接神经网络22可为FNN全连接网络。经压缩后的上下文向量输入N个对应的解码器23对上下文向量进行基因值预测和对齐输出,由于预测到的基因值向量是浮点型数据,但是真实基因值向量是整数,需要将浮点型的基因值向量转成整数向量,得到填充预测基因值的所述完整基因样本。
在一个实施示例中,需对真正缺失基因值的待测基因样本进行基因值预测填充时,需对该待测基因样本中每一基因缺失位置进行填充值计算。根据该待测基因样本中每一基因缺失位置与该待测基因样本的动态连锁关系,生成预填充每一基因缺失位置的填充值,且每一填充值携带对应动态连锁关系的参数。然后将预填充好的待测基因样本输入训练好的缺失基因预测模型中,通过缺失基因预测模型根据每一填充值对该填充值所在的基因缺失位置进行基因值预测,从而输出填充预测基因值的完整基因样本,快速完成基因型数据的缺失填充,提高填充速率。
本发明实施例提供的一种基因型数据缺失的填充方法,通过从基因库中获取若干不同个体的基因数据生成若干基因样本;每一所述基因样本中包括被随机遮盖的若干基因值;对于所述基因样本中每一基因缺失位置,根据所述基因缺失位置与所在的基因样本的动态连锁关系,生成预填充所述基因缺失位置的填充值;所述填充值携带对应所述动态连锁关系的参数;将每一预填充后的基因样本输入缺失基因预测模型,根据预填充的所述填充值对所述基因缺 失位置进行基因值预测,输出填充预测基因值的完整基因样本。通过随机遮盖根据有效基因数据生成的每一基因样本中的基因值生成训练数据。生成携带基因缺失位置与基因样本的动态连锁关系的参数的填充值,并对训练数据进行填充。使得包括填充值的训练数据输入缺失基因预测模型后,模型能够根据填充值所携带的参数对应的动态连锁关系进行基因值预测,输出填充预测基因值的完整基因样本,完成模型的训练。通过训练生成的缺失基因预测模型对缺失的基因值进行预测填充,提高基因填充效率。且模型结合基因缺失位置上的填充值所携带的动态连锁关系对缺失的基因值进行预测填充,提高预测基因值的准确率。
实施例二
如图3所示的是本发明实施例二提供的基因型数据缺失的填充方法的流程示意图。在实施例一的基础上,本实施例还提供了优化基因型数据缺失的填充方法中填充值携带的参数和缺失基因预测模型的参数的过程,从而进一步提高预测基因值的准确率。该方法具体包括:
S210、根据所述完整基因样本与对应的基因样本原值反向计算梯度,通过所述梯度更新所述缺失基因预测模型的参数;
由于基因型数据缺失的填充方法对缺失基因数据的待测基因样本进行预测填充的过程包括两个部分。第一部分是,对于待测基因中每一基因缺失位置,根据基因缺失位置与待测基因样本的动态连锁关系,生成预填充该基因缺失位置的填充值,且生成的填充值携带对应动态连锁关系的参数;第二部分是,将预填充后的待测基因样本输入训练好的缺失基因预测模型根据待测基因样本中的填充值对基因缺失位置进行基因值预测,输出基因缺失位置填充有预测基因值的完整基因样本,完成待测基因样本的基因缺失填充。为提高缺失基因预测模型的预测准确率,还可通过迭代训练优化缺失基因预测模型中的参数。
具体地,从已知的基因库中获取足够多数量的不同个体的基因序列生成若干基因样本。并对每一基因样本随机遮盖若干基因值,使得每一基因样本中均具有基因数据缺失以作为训练数据。在每一轮缺失基因预测模型的迭代训练过程中,当缺失基因预测模型对输入的训练数据中任一包括填充值的预填充基因样本进行基因值预测输出完整基因样本后,可根据缺失基因预测模型输出的完整基因样本与对应的基因样本原值反向计算梯度,从而通过计算得到梯度更新缺失基因预测模型中的参数,使缺失基因预测模型得到收敛。可选的,可在对每一基因样本随机遮盖若干基因值后对应存储每一基因样本原值。
S220、根据所述完整基因样本对应的原本被遮盖的基因样本,将所述完整基因样本中所有所述预测基因值挑出;
由于缺失基因的基因样本在输入缺失基因预测模型之前,对于待测基因中每一基因缺失 位置,需根据基因缺失位置与待测基因样本的动态连锁关系,生成预填充该基因缺失位置的填充值,且生成的填充值携带对应动态连锁关系的参数。还可在每一轮缺失基因预测模型的迭代训练后,通过优化预填充基因缺失位置的填充值携带的参数,提高缺失基因值预测填充的准确率。
S230、将所有所述预测基因值与对应的原基因值进行比对,更新所述预设参数。
具体地,当缺失基因预测模型对输入的训练数据中任一包括填充值的基因样本进行基因值预测输出完整基因样本后,需根据输出的完整基因样本对应的原本被遮盖的基因样本,将该完整基因样本中所有的预测基因值挑出。然后将所有的预测基因值与该基因样本被遮盖的基因缺失位置的原基因值进行比对,计算所有预测基因值与对应的原基因值的加权平均值更新填充值携带的参数。
且由于携带参数的填充值根据基因维度的物理序列维度参数γ t和样本维度参数γ d生成,基因维度的物理序列维度参数γ t和所述样本维度参数γ d中均包括一个预设参数W γ。将所有的预测基因值与该基因样本被遮盖的基因缺失位置的原基因值进行比对,计算所有预测基因值与对应的原基因值的加权平均值更新预设参数W γ
在一个实施示例中,由于基因库中的已知基因序列均为定点测序得到的基因数据,为使缺失基因预测模型对连续基因序列中未进行测试的其它基因数据进行学习,在缺失基因预测模型的预测阶段,还将带有真实缺失的基因样本替换掉输入模型阶段主动遮盖的基因向量值,然后通过缺失基因预测模型,得到最后的预测基因向量,由于预测基因向量是浮点型数据,但是真实向量是整数,需要将浮点型的预测基因向量转成整数向量,将得到的预测基因值通过真实缺失的基因样本挑选出预测的基因值,然后与真实值中没有被遮盖掉的基因数据进行组合得到最终向量。
实施例三
如图4所示的是本发明实施例三提供的基因型数据缺失的填充方装置。在实施例一或二的基础上,本发明实施例还提供了一种检测装置4,该装置包括:
基因样本生成模块401,用于从基因库中获取若干不同个体的基因数据生成若干基因样本;每一所述基因样本中包括被随机遮盖的若干基因值;
在一个实施示例中,从基因库中获取若干不同个体的基因数据生成若干基因样本;每一所述基因样本中包括被随机遮盖的若干基因值时,基因样本生成模块401包括:
截取单元,用于从基因库中获取若干不同个体的基因数据,通过滑窗处理将所述基因数据截取为长度一致的若干基因样本。
填充值生成模块402,用于对于所述基因样本中每一基因缺失位置,根据所述基因缺失位置与所在的基因样本的动态连锁关系,生成预填充所述基因缺失位置的填充值;所述填充值携带对应所述动态连锁关系的参数;
在一个实施示例中,对于所述基因样本中每一基因缺失位置,根据所述基因缺失位置与所在的基因样本的动态连锁关系,生成预填充所述基因缺失位置的填充值时,填充值生成模块402包括:
取值确定单元,用于根据所述基因缺失位置确定基因维度的物理序列维度和样本维度的取值;
在一个实施示例中,根据所述基因缺失位置确定基因维度的物理序列维度和样本维度的取值时,取值确定单元包括:
若所述基因缺失位置
Figure PCTCN2020128853-appb-000020
为显性,则设置遮盖值
Figure PCTCN2020128853-appb-000021
为1;若所述基因缺失位置
Figure PCTCN2020128853-appb-000022
不为显性,则设置所述遮盖值
Figure PCTCN2020128853-appb-000023
为0;其中,t为基因维度的物理序列维度;d为样本维度的向量维度;
根据设置后的遮盖值确定在基因维度的物理序列维度t的跳跃δ t和在样本维度d的跳跃
Figure PCTCN2020128853-appb-000024
Figure PCTCN2020128853-appb-000025
其中,在基因维度的物理序列维度t中任一所述样本维度的跳跃和在样本维度d中任一基因维度的物理序列维度的跳跃均为
Figure PCTCN2020128853-appb-000026
s t为预设刻度矩阵中基因维度的物理序列维度t对应的一个位点;s d为所述预设刻度矩阵中样本维度d对应的一个位点。
基因维度的物理序列维度参数和样本维度参数生成单元,用于计算所述基因缺失位置所 在的基因样本在所述基因维度的物理序列维度的数据关系以及在所述样本维度的数据关系,得到基因维度的物理序列维度参数和样本维度参数;
在一个实施示例中,计算所述基因缺失位置所在的基因样本在所述基因维度的物理序列维度的数据关系以及在所述样本维度的数据关系,得到基因维度的物理序列维度参数和样本维度参数时,基因维度的物理序列维度参数和样本维度参数生成单元包括:
γ t=exp{-max(0,W γδ t+b γ)};
γ d=exp{-max(0,W γδ d+b γ)};
其中,γ t为所述基因维度的物理序列维度参数;γ d为所述样本维度参数;W γ为预设参数;b γ为偏移量。
填充值生成单元,用于根据所述基因维度的物理序列维度参数和所述样本维度参数,生成携带参数的填充值,以预填充所述基因缺失位置。
基因预测填充模块403,用于将每一预填充后的基因样本输入缺失基因预测模型,根据预填充的所述填充值对所述基因缺失位置进行基因值预测,输出填充预测基因值的完整基因样本。
在一个实施示例中,将每一预填充后的基因样本输入缺失基因预测模型,根据预填充的所述填充值对所述基因缺失位置进行基因值预测,输出填充预测基因值的完整基因样本时,基因预测填充模块403包括:
编码单元,用于通过N个编码器根据所述填充后的基因样本中每一所述填充值携带的参数进行特征提取,输出上下文向量;N≥3;
解码单元,用于通过N个解码器根据所述上下文向量对每一所述填充值所在的所述基因缺失位置进行基因值预测和对齐输出,得到填充预测基因值的所述完整基因样本。
在一个实施示例中,该填充装置还包括:
模型更新模块,用于根据所述完整基因样本与对应的基因样本原值反向计算梯度,通过所述梯度更新所述缺失基因预测模型的参数;
预测基因提取模块,用于根据所述完整基因样本对应的原本被遮盖的基因样本,将所述完整基因样本中所有所述预测基因值挑出;
参数更新模块,用于将所有所述预测基因值与对应的原基因值进行比对,更新所述预设参数。
本发明实施例提供的一种基因型数据缺失的填充装置,通过从基因库中获取若干不同个 体的基因数据生成若干基因样本;每一所述基因样本中包括被随机遮盖的若干基因值;对于所述基因样本中每一基因缺失位置,根据所述基因缺失位置与所在的基因样本的动态连锁关系,生成预填充所述基因缺失位置的填充值;所述填充值携带对应所述动态连锁关系的参数;将每一预填充后的基因样本输入缺失基因预测模型,根据预填充的所述填充值对所述基因缺失位置进行基因值预测,输出填充预测基因值的完整基因样本。通过随机遮盖根据有效基因数据生成的每一基因样本中的基因值生成训练数据。生成携带基因缺失位置与基因样本的动态连锁关系的参数的填充值,并对训练数据进行填充。使得包括填充值的训练数据输入缺失基因预测模型后,模型能够根据填充值所携带的参数对应的动态连锁关系进行基因值预测,输出填充预测基因值的完整基因样本,完成模型的训练。通过训练生成的缺失基因预测模型对缺失的基因值进行预测填充,提高基因填充效率。且模型结合基因缺失位置上的填充值所携带的动态连锁关系对缺失的基因值进行预测填充,提高预测基因值的准确率。
实施例四
图5是本发明实施例四提供的服务器的结构示意图。该服务器包括:处理器1、存储器2以及存储在所述存储器2中并可在所述处理器1上运行的计算机程序3,例如用于基因型数据缺失的填充方法的程序。所述处理器1执行所述计算机程序3时实现上述基因型数据缺失的填充方法实施例中的步骤,例如图1所示的步骤S110至S130。
示例性的,所述计算机程序3可以被分割成一个或多个模块,所述一个或者多个模块被存储在所述存储器2中,并由所述处理器1执行,以完成本申请。所述一个或多个模块可以是能够完成特定功能的一系列计算机程序指令段,该指令段用于描述所述计算机程序3在所述服务器中的执行过程。例如,所述计算机程序3可以被分割成填充值生成模块、填充值生成模块和基因预测填充模块,各模块具体功能如下:
基因样本生成模块,用于从基因库中获取若干不同个体的基因数据生成若干基因样本;每一所述基因样本中包括被随机遮盖的若干基因值;
填充值生成模块,用于对于所述基因样本中每一基因缺失位置,根据所述基因缺失位置与所在的基因样本的动态连锁关系,生成预填充所述基因缺失位置的填充值;所述填充值携带对应所述动态连锁关系的参数;
基因预测填充模块,用于将每一预填充后的基因样本输入缺失基因预测模型,根据预填充的所述填充值对所述基因缺失位置进行基因值预测,输出填充预测基因值的完整基因样本。
所述服务器可包括,但不仅限于,处理器1、存储器2以及存储在所述存储器2中的计算机程序3。本领域技术人员可以理解,图5仅仅是服务器的示例,并不构成对服务器的限 定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件,例如所述服务器还可以包括输入输出设备、网络接入设备、总线等。
所述处理器1可以是中央处理单元(Central Processing Unit,CPU),还可以是其他通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现成可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。
所述存储器2可以是所述服务器的内部存储单元,例如服务器的硬盘或内存。所述存储器2也可以是外部存储设备,例如服务器上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。进一步地,所述存储器2还可以既包括服务器的内部存储单元也包括外部存储设备。所述存储器2用于存储所述计算机程序以及基因型数据缺失的填充方法所需的其他程序和数据。所述存储器2还可以用于暂时地存储已经输出或者将要输出的数据。
所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,仅以上述各功能单元、模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能单元、模块完成,即将所述装置的内部结构划分成不同的功能单元或模块,以完成以上描述的全部或者部分功能。实施例中的各功能单元、模块可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中,上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。另外,各功能单元、模块的具体名称也只是为了便于相互区分,并不用于限制本申请的保护范围。上述系统中单元、模块的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述或记载的部分,可以参见其它实施例的相关描述。
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本发明的范围。
在本发明所提供的实施例中,应该理解到,所揭露的装置/终端设备和方法,可以通过其它的方式实现。例如,以上所描述的装置/终端设备实施例仅仅是示意性的,例如,所述模块 或单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通讯连接可以是通过一些接口,装置或单元的间接耦合或通讯连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。另外,在本发明各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
所述集成的模块/单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本发明实现上述实施例方法中的全部或部分流程,也可以通过计算机程序来指令相关的硬件来完成,所述的计算机程序可存储于一计算机可读存储介质中,该计算机程序在被处理器执行时,可实现上述各个方法实施例的步骤。其中,所述计算机程序包括计算机程序代码,所述计算机程序代码可以为源代码形式、对象代码形式、可执行文件或某些中间形式等。所述计算机可读介质可以包括:能够携带所述计算机程序代码的任何实体或装置、记录介质、U盘、移动硬盘、磁碟、光盘、计算机存储器、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、电载波信号、电信信号以及软件分发介质等。需要说明的是,所述计算机可读介质包含的内容可以根据司法管辖区内立法和专利实践的要求进行适当的增减,例如在某些司法管辖区,根据立法和专利实践,计算机可读介质不包括电载波信号和电信信号。
以上所述实施例仅用以说明本发明的技术方案,而非对其限制;尽管参照前述实施例对本发明进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本发明各实施例技术方案的精神和范围,均应包含在本发明的保护范围之内。

Claims (10)

  1. 一种基因型数据缺失的填充方法,其特征在于,包括:
    从基因库中获取若干不同个体的基因数据生成若干基因样本;每一所述基因样本中包括被随机遮盖的若干基因值;
    对于所述基因样本中每一基因缺失位置,根据所述基因缺失位置与所在的基因样本的动态连锁关系,生成预填充所述基因缺失位置的填充值;所述填充值携带对应所述动态连锁关系的参数;
    将每一预填充后的基因样本输入缺失基因预测模型,根据预填充的所述填充值对所述基因缺失位置进行基因值预测,输出填充预测基因值的完整基因样本。
  2. 如权利要求1所述的基因型数据缺失的填充方法,其特征在于,所述对于所述基因样本中每一基因缺失位置,根据所述基因缺失位置与所在的基因样本的动态连锁关系,生成预填充所述基因缺失位置的填充值,包括:
    根据所述基因缺失位置确定基因维度的物理序列维度和样本维度的取值;
    计算所述基因缺失位置所在的基因样本在所述基因维度的物理序列维度的数据关系以及在所述样本维度的数据关系,得到基因维度的物理序列维度参数和样本维度参数;
    根据所述基因维度的物理序列维度参数和所述样本维度参数,生成携带参数的填充值,以预填充所述基因缺失位置。
  3. 如权利要求1或2所述的基因型数据缺失的填充方法,其特征在于,所述将每一预填充后的基因样本输入缺失基因预测模型,根据预填充的所述填充值对所述基因缺失位置进行基因值预测,输出填充预测基因值的完整基因样本,包括:
    通过N个编码器根据所述预填充后的基因样本中每一所述填充值携带的参数进行特征提取,输出上下文向量;N≥3;
    通过N个解码器根据所述上下文向量对每一所述填充值所在的所述基因缺失位置进行基因值预测和对齐输出,得到填充预测基因值的所述完整基因样本。
  4. 如权利要求1所述的基因型数据缺失的填充方法,其特征在于,所述从基因库中获取若干不同个体的基因数据生成若干基因样本,包括:
    从基因库中获取若干不同个体的基因数据,通过滑窗处理将所述基因数据截取为长度一致的若干基因样本。
  5. 如权利要求1所述的基因型数据缺失的填充方法,其特征在于,在将每一预填充后 的基因样本输入缺失基因预测模型,根据预填充的所述填充值对所述基因缺失位置进行基因值预测,输出填充预测基因值的完整基因样本之后,还包括:
    根据所述完整基因样本与对应的基因样本原值反向计算梯度,通过所述梯度更新所述缺失基因预测模型的参数。
  6. 如权利要求2所述的基因型数据缺失的填充方法,其特征在于,所述根据所述基因缺失位置确定基因维度的物理序列维度和样本维度的取值,包括:
    若所述基因缺失位置
    Figure PCTCN2020128853-appb-100001
    为显性,则设置遮盖值
    Figure PCTCN2020128853-appb-100002
    为1;若所述基因缺失位置
    Figure PCTCN2020128853-appb-100003
    不为显性,则设置所述遮盖值
    Figure PCTCN2020128853-appb-100004
    为0;其中,t为所述基因维度的物理序列维度;d为所述样本维度的向量维度;
    根据设置后的遮盖值确定在所述基因维度的物理序列维度t的跳跃δ t和在所述样本维度d的跳跃δ d
    Figure PCTCN2020128853-appb-100005
    Figure PCTCN2020128853-appb-100006
    其中,在所述基因维度的物理序列维度t中任一所述样本维度的跳跃和在所述样本维度d中任一所述基因维度的物理序列维度的跳跃均为
    Figure PCTCN2020128853-appb-100007
    s t为预设刻度矩阵中所述基因维度的物理序列维度t对应的一个位点;s d为所述预设刻度矩阵中所述样本维度d对应的一个位点。
  7. 如权利要求6所述的基因型数据缺失的填充方法,其特征在于,所述计算所述基因缺失位置所在的基因样本在所述基因维度的物理序列维度的数据关系以及在所述样本维度的数据关系,得到基因维度的物理序列维度参数和样本维度参数,包括:
    γ t=exp{-max(0,W γδ t+b γ)};
    γ d=exp{-max(0,W γδ d+b γ)};
    其中,γ t为所述基因维度的物理序列维度参数;γ d为所述样本维度参数;W γ为预设参数;b γ为偏移量。
  8. 如权利要求7所述的基因型数据缺失的填充方法,其特征在于,在将每一预填充后的基因样本输入缺失基因预测模型,根据预填充的所述填充值对所述基因缺失位置进行基因值预测,输出填充预测基因值的完整基因样本之后,还包括:
    根据所述完整基因样本对应的原本被遮盖的基因样本,将所述完整基因样本中所有所述预测基因值挑出;
    将所有所述预测基因值与对应的原基因值进行比对,更新所述预设参数。
  9. 一种基因型数据缺失的填充装置,其特征在于,包括:
    基因样本生成模块,用于从基因库中获取若干不同个体的基因数据生成若干基因样本;每一所述基因样本中包括被随机遮盖的若干基因值;
    填充值生成模块,用于对于所述基因样本中每一基因缺失位置,根据所述基因缺失位置与所在的基因样本的动态连锁关系,生成预填充所述基因缺失位置的填充值;所述填充值携带对应所述动态连锁关系的参数;
    基因预测填充模块,用于将每一预填充后的基因样本输入缺失基因预测模型,根据预填充的所述填充值对所述基因缺失位置进行基因值预测,输出填充预测基因值的完整基因样本。
  10. 一种服务器,其特征在于,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序,其特征在于,所述处理器执行所述计算机程序时实现如权利要求1至8任一项所述基因型数据缺失的填充方法的步骤。
PCT/CN2020/128853 2019-11-22 2020-11-13 基因型数据缺失的填充方法、装置及服务器 WO2021098615A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201911152927.6A CN111028884B (zh) 2019-11-22 2019-11-22 基因型数据缺失的填充方法、装置及服务器
CN201911152927.6 2019-11-22

Publications (1)

Publication Number Publication Date
WO2021098615A1 true WO2021098615A1 (zh) 2021-05-27

Family

ID=70206347

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/128853 WO2021098615A1 (zh) 2019-11-22 2020-11-13 基因型数据缺失的填充方法、装置及服务器

Country Status (2)

Country Link
CN (1) CN111028884B (zh)
WO (1) WO2021098615A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117272704A (zh) * 2023-11-23 2023-12-22 湖南华自卓创智能技术有限责任公司 一种面向多源异构数据的数字孪生驱动的数据处理系统

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111028884B (zh) * 2019-11-22 2023-08-25 中国科学院深圳先进技术研究院 基因型数据缺失的填充方法、装置及服务器
CN112069809B (zh) * 2020-08-11 2022-05-24 桂林电子科技大学 一种缺失文本生成方法及系统
CN113851191A (zh) * 2021-09-06 2021-12-28 中科曙光国际信息产业有限公司 基因填充方法、装置、计算机设备和存储介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050186609A1 (en) * 2004-02-21 2005-08-25 Oh Ji-Young Method and system of replacing missing genotyping data
CN106202969A (zh) * 2016-08-01 2016-12-07 东北大学 一种肿瘤分子分型预测系统
CN106779076A (zh) * 2016-11-18 2017-05-31 栾图 基于生物信息的选育良种系统及其算法
CN107833636A (zh) * 2017-12-04 2018-03-23 浙江鸿赋堂健康管理有限公司 一种肿瘤预测方法
CN109994151A (zh) * 2019-01-23 2019-07-09 杭州师范大学 基于复杂网络与机器学习方法的肿瘤驱动基因预测系统
CN111028884A (zh) * 2019-11-22 2020-04-17 中国科学院深圳先进技术研究院 基因型数据缺失的填充方法、装置及服务器

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170228496A1 (en) * 2014-07-25 2017-08-10 Ontario Institute For Cancer Research System and method for process control of gene sequencing
CN110211631B (zh) * 2018-02-07 2024-02-09 深圳先进技术研究院 一种全基因组关联分析方法、系统及电子设备
CN109754843B (zh) * 2018-12-04 2021-02-19 志诺维思(北京)基因科技有限公司 一种探测基因组小片段插入缺失的方法及装置
CN110468207B (zh) * 2019-09-02 2021-03-23 北京师范大学 基于Taqman低密度芯片的胶质瘤EM/PM分子分型方法及其应用

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050186609A1 (en) * 2004-02-21 2005-08-25 Oh Ji-Young Method and system of replacing missing genotyping data
CN106202969A (zh) * 2016-08-01 2016-12-07 东北大学 一种肿瘤分子分型预测系统
CN106779076A (zh) * 2016-11-18 2017-05-31 栾图 基于生物信息的选育良种系统及其算法
CN107833636A (zh) * 2017-12-04 2018-03-23 浙江鸿赋堂健康管理有限公司 一种肿瘤预测方法
CN109994151A (zh) * 2019-01-23 2019-07-09 杭州师范大学 基于复杂网络与机器学习方法的肿瘤驱动基因预测系统
CN111028884A (zh) * 2019-11-22 2020-04-17 中国科学院深圳先进技术研究院 基因型数据缺失的填充方法、装置及服务器

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117272704A (zh) * 2023-11-23 2023-12-22 湖南华自卓创智能技术有限责任公司 一种面向多源异构数据的数字孪生驱动的数据处理系统
CN117272704B (zh) * 2023-11-23 2024-01-26 湖南华自卓创智能技术有限责任公司 一种面向多源异构数据的数字孪生驱动的数据处理系统

Also Published As

Publication number Publication date
CN111028884B (zh) 2023-08-25
CN111028884A (zh) 2020-04-17

Similar Documents

Publication Publication Date Title
WO2021098615A1 (zh) 基因型数据缺失的填充方法、装置及服务器
US20220223233A1 (en) Display of estimated parental contribution to ancestry
CN110929807B (zh) 图像分类模型的训练方法、图像分类方法及装置
US20170329902A1 (en) Estimation of admixture generation
JP2022529178A (ja) 人工知能推奨モデルの特徴処理方法、装置、電子機器、及びコンピュータプログラム
EP3913532A2 (en) Object area measurement method, apparatus, storage medium and computer product
CN108255706A (zh) 自动化测试脚本的编辑方法、装置、终端设备及存储介质
Jay et al. An ABC method for whole-genome sequence data: inferring paleolithic and neolithic human expansions
CN112035549A (zh) 数据挖掘方法、装置、计算机设备及存储介质
CN112508199A (zh) 针对跨特征联邦学习的特征选择方法、装置及相关设备
CN111209930B (zh) 一种生成授信策略的方法、装置和电子设备
CN113268485B (zh) 数据表关联分析方法、装置、设备及存储介质
CN113887551B (zh) 基于话单数据的目标人分析方法、终端设备及存储介质
CN110674397A (zh) 年龄点预测模型的训练方法及装置、设备与可读介质
CN109670141A (zh) 预测方法、系统、介质和电子设备
CN113762421B (zh) 分类模型的训练方法、流量分析方法、装置及设备
CN110442674A (zh) 标签传播的聚类方法、终端设备、存储介质及装置
CN116168403A (zh) 医疗数据分类模型训练方法、分类方法、装置及相关介质
CN115760775A (zh) 番茄病害检测方法、装置、设备及存储介质
CN114741697A (zh) 恶意代码分类方法、装置、电子设备和介质
CN114897607A (zh) 产品资源的数据处理方法及装置、电子设备、存储介质
CN111739584B (zh) 一种用于pgt-m检测的基因分型评估模型的构建方法及装置
CN114446393A (zh) 用于预测肝癌特征类型的方法、电子设备和计算机存储介质
Peng et al. A fast likelihood approach for estimation of large phylogenies from continuous trait data
KR102535267B1 (ko) 심층 강화학습 기반 최적 snp 세트 정보 생성 장치 및 그 방법

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20890358

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20890358

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 16/01/2023)

122 Ep: pct application non-entry in european phase

Ref document number: 20890358

Country of ref document: EP

Kind code of ref document: A1