CN114864004A

CN114864004A - Deletion mark filling method based on sliding window sparse convolution denoising self-encoder

Info

Publication number: CN114864004A
Application number: CN202210384234.5A
Authority: CN
Inventors: 刘毅; 王欣
Original assignee: Yangzhou University
Current assignee: Yangzhou University
Priority date: 2022-04-13
Filing date: 2022-04-13
Publication date: 2022-08-05

Abstract

The invention provides a deletion mark filling method based on a sliding window sparse convolution denoising self-encoder. And then calculating the filling precision of the missing marks, and adjusting the hyper-parameters of the neural network according to the feedback result of the early training. In practical application, gene filling results of multiple species such as corn and rice show that the filling precision of the method is obviously higher than that of traditional algorithms such as KNN and SVD. The method has the advantages of high filling precision, simple model structure and high training efficiency, and has wide application prospect in the field of gene sequence analysis.

Description

Deletion mark filling method based on sliding window sparse convolution denoising self-encoder

Technical Field

The invention belongs to the technical field of computers and bioinformatics, and particularly relates to a deletion mark filling method based on a sliding window sparse convolution denoising self-encoder.

Background

Genotype filling is a key link of human and animal and plant genome sequence analysis and can be used for whole genome correlation analysis and whole genome prediction. There are numerous deletions prevalent in gene sequencing data due to a variety of reasons, including low call rates, deviations from hardy-weinberg equilibrium, and rare or low frequency variations in the sample. The working principle of genotype filling is to infer deletion values in the genotype distribution by calculation, usually using correlation or linkage information of the unpatterned variants and markers near the genotype. The genetic variation whose genotype is filled in is mainly Single Nucleotide Polymorphism (SNP), and other types of genetic variation can be filled in as long as they are in Linkage Disequilibrium (LD) with the SNP variation.

The existing filling methods are divided into two types, namely a reference panel and a reference-free template. The first method requires the use of haplotype information for samples that have the same or similar population background, but many times this condition is difficult to satisfy. The other category is a missing mark filling method of the reference-free panel, which comprises a line average method, a distance or similarity-based method of KNN and the like, Singular Value Decomposition (SVD), an algorithm (SCDA) based on a sparse convolution denoising self-encoder and the like. The line-average method is one of the simplest methods, and replaces the deletion value with the average of all non-deletion markers or the value that occurs most frequently in the same site. Distance or similarity based methods typically use data similar or close to the deletion marker to infer the deletion value, but neither KNN algorithm nor SVD nor SCDA can effectively learn local linkage information on genomic sequences.

Deep learning has demonstrated tremendous efficacy in applications such as machine vision and bioinformatics. In many fields, an automatic encoder can effectively solve the problem of data loss. One example is the multi-layer perceptron-based de-noising autoencoder method [ Duan, y.; lv, y.; liu, y.l.; wang, F.Y.an influence recommendation of deep learning for traffic data input.T ransp.Res.part C employee.T echnol.2016, 72, 168-181. DOI:10.1016/j.trc.2016.09.015] is used for traffic data, and the performance of the traffic data is equivalent to that of the SVD method. Article [ Beaulieu-Jones, B.K.; moore, J.H.missing data input in the electronic health record using deep learning autocodes.Pac.Symp.Biocomplit.2017, 22, 207-218.DOI:10.1142/9789813207813_0021] was applied to interpolate electronic health records, comparing the effectiveness of popular multi-fill strategies with deep learning autocodes using the aggregate resource open access ALS clinical trial database (PRO-ACT). One effective technique for encoding data correlations is to use a convolution network that can learn underlying structures and relationships in genotype data by using convolution kernels that can learn various local patterns in a filter window. There are different linkage characteristics between markers in each segment of the genome sequence of an organism, and the closer the distance between the markers, the greater the linkage strength. If a large number of markers on a chromosome are coded and modeled only by using a general denoising self-encoder, problems of dimension disaster, overfitting and the like can be caused, so that the close linkage relationship between close-range markers cannot be effectively utilized.

Disclosure of Invention

In order to solve the problems, the invention provides a new learning mode aiming at genome sequence analysis, which is called a deletion mark filling method based on a sliding window sparse convolution denoising autoencoder and is used for genotype filling of a reference-free panel. The method comprises the steps of firstly carrying out numerical value conversion and unique thermal coding on known gene data, dividing a training set and a verification set, then building a sparse convolution neural network model, filling deletion marks on a gene sequence in a segmented sliding window mode, obtaining a central region prediction result with sufficient data characteristics through window overlapping, splicing, and discarding filling results of edge regions. And then calculating the filling precision of the missing marks, and adjusting the hyper-parameters of the neural network according to the feedback result of the early training. In practical application, gene filling results of multiple species such as corn and rice show that the filling precision of the method is obviously higher than that of traditional algorithms such as KNN and SVD. The method has the advantages of high filling precision, simple model structure and high training efficiency, and has wide application prospect in the field of gene sequence analysis.

In order to achieve the purpose, the technical scheme of the invention is as follows:

the method for filling the missing mark of the self-encoder based on the sliding window sparse convolution denoising comprises the following steps:

step 1, firstly, selecting genome sequence data of a biological sample, and preprocessing the selected genome data to obtain a required target sequence set;

step 2, processing original data, extracting data of the original vcf file by using a PyVcf toolkit, finding a sample class corresponding to each locus, and extracting GT data in the sample class as genotype sample data; carrying out numerical value coding on the gene data to obtain a numerical value data set;

step 3, carrying out data transformation on the numerical data set through modification sites and individuals, customizing codes again aiming at different gene data, and converting the codes into a matrix format to obtain a sorted sample set;

step 4, dividing the sample set obtained in the step 3 into a training set and a verification set to obtain a training set and a verification set sample set;

step 5, hiding the training set samples and the verification set samples through data with different proportions, and converting the training set samples and the verification set samples into a data set used by the final training model;

step 6, constructing a decoder and an encoder model of the sparse convolution self-encoder, and selecting an initial sliding window model interval, neuron size and interval parameters;

step 7, segmenting the gene data to be predicted according to the initial sliding window model interval in the step 6, respectively passing the data set of each segment through the input layer of the model, adopting a coder-decoder structure to perform feature extraction on the linkage relation of the genotype data, and finally respectively obtaining each new group of output data sets;

8, discarding edge parts of each group of output data sets obtained in the step 7, and splicing each group of data according to the predicted sequence to obtain a predicted genotype matrix; extracting a predicted value and an actual value of the genotype of the hidden locus for the genotype matrix, comparing the predicted value and the actual value, and calculating filling precision;

step 9, repeating the steps 7 and 8, optimizing the self-encoder parameters and the sliding window interval according to the filling precision obtained each time, and finding out an optimal parameter model;

and step 10, preprocessing a missing gene data set to be predicted, converting the missing gene data set into a data type matched with the model, putting the data type into the optimized model in the step 9 for predicting the genotype, and outputting a genotype data prediction value.

Further, the step 2 specifically includes the following sub-steps:

step 2-1, determining specific chromosome information and site information of a selected plant;

step 2-2, calling a PyVcf toolkit, and circularly reading the chromosome gene information of each site one by one;

step 2-3, extracting genotype information GT data in samples as genotype sample data from displayed information entries CHROM, POS, ID, REF, ALT, QUAL, FILTER, INFO, FORMAT, sample _ indexes and samples;

2-4, converting the genotype sample data into the most original genotype matrix, and recoding different genotype information aiming at the genotype matrix to define the coded data convenient for training;

step 2-5, encoding the variables of different genotypes into three numbers;

and 2-6, performing sparsification operation on the numerical value set samples processed in the step 2-5 by using single-hot coding, and sorting all gene segments to serve as a numerical value data set.

Further, in the steps 2 to 5, the encoding specifically includes: the genotype corresponding to "0 | 0" is encoded as "1", the genotypes corresponding to "0 | 1" and "1 | 0" are encoded as "2", and the genotype corresponding to "1 | 1" is encoded as "3".

Further, the step 4 specifically includes the following sub-steps:

step 4-1, traversing biological sample information to obtain the number of biological individuals, and storing the individual names of the biological samples into a file;

step 4-2, reading the biological individual name file in the step 4-1, calling a random () function, and counting the number of individuals according to the following ratio of 9: 1, establishing a training set individual name index table and a verification set individual index table, respectively storing the two index tables into files, and checking the proportional size of the individual number of the two files after classification is finished;

step 4-3, using a bcftools tool in a Linux system environment, dividing a biological data set into two individual files of the training set and the verification set generated in the step 4-2 according to a mode of extracting an individual index: training and verifying sets;

4-4, repeating the steps five times to generate five groups of data sets of a training set and a verification set under the random index table;

and 4-5, converting the training set and the verification set generated by each group from a VCF format to a CSV format by using a data processing means of an R language, optimizing reading progress and data processing, and improving the running speed of a program.

Further, the step 5 specifically includes the following sub-steps:

step 5-1, dividing three hiding ratios respectively corresponding to the genotype conditions of low deletion rate, medium deletion rate and high deletion rate;

step 5-2, randomly generating hidden positions corresponding to the deletion rate according to three different hiding ratios, and storing the hiding sequence of the hidden positions to ensure that the hidden positions of the positions filled by the four prediction models are the same;

step 5-3, hiding the gene locus of the training set sample and the verification set sample according to the hidden locus sequence, wherein the numerical value of the hidden locus corresponds to one of the codes according to the coding mode of the step 2-5, and five groups of repeated experiment operations are respectively carried out corresponding to three hiding rates;

and 5-4, arranging the training set formed in the step 5-3 and the sample set formed in the verification set to form a final sample set.

Further, the step 6 specifically includes the following sub-steps:

step 6-1, selecting and determining an initial seven-layer sparse convolution frame model and an initial model with a sliding window interval of 100 and a filter of 20;

step 6-2, determining an encoder mechanism, compressing a reference sample and a sample to be filled into a multi-channel feature vector, and extracting time and space information by using an attention mechanism;

step 6-3, constructing a decoder for decoding, performing up-sampling reconstruction on the multi-channel feature vector, and performing context prediction by using the feature information of time and space to finally obtain a filled genotype;

step 6-4, adopting a multi-classification loss function to simultaneously evaluate the loss of the reconstructed missing region and the loss of the non-missing region;

6-5, using a maximum pool and upsampling in the model architecture; maxpooling is a downsampling process to reduce dimensionality, which applies a max filter to the non-overlapping sub-regions of the previous layer; upsampling is in contrast to maxporoling, where upsampling is used to increase dimensionality by repeating data along an axis.

Further, the formula and parameters of the self-encoder in step 6-2 specifically include:

h＝f _θ (x)＝Φ(Wx+b)

where θ is { W, b }, W is the weight matrix of the mxn encoder, b is the bias vector of the encoder, and Φ is the encoder activation function, such as linear or corrected linear unit (relu). Hidden token h is also referred to as a potential token. The encoder adopts a hidden representation h to map the hidden representation h into a reconstructed vector Rn;

z＝g _θ′ (h)＝Φ′(W′h+b′)

wherein θ ' ═ { W ', b ' }; w' is an n m decoder weight matrix; b' is a bias vector of the decoder; phi' is an activation function of the decoder, the same as phi; the parameters θ and θ' of the auto-encoder will be optimized to minimize the average reconstruction error, with the likelihood function as follows:

L(x ⁽ⁱ⁾ ，z ⁽ⁱ⁾ )＝-(ylog(p)+(1-y)log(1-p))

the purpose of the auto-encoder is to reconstruct z, which is (i) x (i), Lx by minimizing a loss function L ⁽ⁱ⁾ ，Lz ⁽ⁱ⁾ Which may be defined as the mean square error of continuous data or the cross entropy of discrete data.

Further, the step 7 specifically includes the following sub-steps:

7-1, dividing the final sample set formed in the step 5 into five groups of training sets and test set sets in three hiding modes, dividing each group of data set into a plurality of sections according to an initial sliding window model interval 100 preset in the step 6, finally forming 100 sections at 10000 sites, sequentially training the data sets of each section through an input layer of the model, and inputting the training sets into an encoder through the input layer to perform multi-channel vector compression;

7-2, after the expected training samples enter the encoder, the expected training samples enter a training model individually, and self-adaptive parameters are adjusted through continuous training, wherein the dimensionality of initial data is (200, 1);

step 7-3, after characteristic values of the data are extracted, Linear characteristic extraction is carried out through a Conv1D layer and a Linear activation function, the relation among the characteristics is extracted, the data dimension is (200, 128), and at the moment, Param is 20608;

7-4, after feature amplification of the MaxPolling 1D layer, further enhancing the stability of the model through a Dropout layer, wherein the data dimension is (100, 128);

7-5, after data pass through a Conv1D _1 layer, increasing the robustness of the model by using a nonlinear activation function Relu, wherein the data dimension is (100, 64), and at the moment, Param is 327744;

7-6, performing feature extraction again on the data through a max _ posing 1d _1 layer, and further enhancing the stability of the model through a Dropout layer, wherein the data dimension is (50, 64);

7-7, after the data pass through a Conv1D _2 layer, further extracting model characteristic parameters by using a Linear activation function, wherein the data dimension is (50, 32), Param is 81952, and the specific extraction of the coding layer is finished;

7-8, enabling data to enter a decoding layer, enabling the data to pass through a Conv1D _3 layer, then using a Relu function again, and increasing the nonlinear characteristic effect of the model, wherein the data dimension is (50, 64), and Param is 81984;

7-9, performing dimension amplification on the data through a up _ sampling1d layer, and then passing through a Dropout layer, wherein the dimension is (100, 64);

step 7-10, the data passes through conv1d _4 and up _ sampling2d layers, the dimension is changed from (100, 64) to (100, 128) in step 7-9, then to (200, 128), Param is 327808;

and 7-11, after the data passes through the conv1d _4 layer and the convergence of the model is accelerated, the output of the model is mapped to the interval of (0, 1) through a classification function SoftMax.

Further, the filling precision in the step 8 is the proportion of the number of correctly filled marks to the total number of filled marks.

Further, the process of optimizing the self-encoder parameters and the sliding window interval in step 9 and finding out the optimal parameter model specifically includes the following substeps:

step 9-1, comparing the filling precision obtained by the five-layer convolution frame model and the seven-layer convolution frame model, and preferentially selecting and determining a coding convolution kernel model for constructing the seven-layer frame;

step 9-2, selecting the size of a filter of the sparse encoder, wherein the formula and parameters of the sparse network model specifically comprise:

where O (i) is the output of the i tag in the input vector i, F is the convolution filter, and k is an odd number representing the size of the convolution filter; the convolution operation in the equation is performed for each position of the input vector I, thereby performing the convolution operation for each genetic marker:

each convolutional layer is composed of n convolutional filters, each filter having a depth of D, where D is the input depth; convolution between the input I ═ I1, I2, ·, ID } and a set of n convolution filters { F1, F2, ·, FD } produces a set of n activation maps yielding an activation map of depth n;

step 9-3, determining an optimal model sliding window interval: and finally determining a fixed sliding window interval with the optimal sliding window size according to the precision comparison obtained by repeated experiments.

The invention has the beneficial effects that:

the invention develops a deletion marker filling method based on a sliding window sparse convolution denoising self-encoder, realizes effective learning of linkage characteristics of each segment of a biological gene marker, obviously improves the filling precision of the gene marker compared with the traditional method, and has more outstanding advantages especially under the condition of higher deletion rate of the marker. The method has the advantages of high filling precision, simple model structure, high training efficiency and wide application prospect in the field of gene sequence analysis, and can be used for subsequent biological whole genome correlation analysis and whole genome prediction.

Drawings

FIG. 1 is a flowchart of an operation of a method for filling a missing mark in a self-encoder based on sliding window sparse convolution denoising.

FIG. 2 is a schematic diagram of a self-encoder structure of a deletion marker filling method of a sliding-window sparse convolution denoising self-encoder.

FIG. 3 is a schematic diagram of a sliding window of a missing mark filling method of a self-encoder based on sliding window sparse convolution denoising.

FIG. 4 is a schematic diagram of comparison results of filling accuracy of four methods under different marker deletion rates of rice.

FIG. 5 is a schematic diagram showing the comparison result of filling precision of four methods under different marker deletion rates of corn.

Detailed Description

The technical solutions provided by the present invention will be described in detail below with reference to specific examples, and it should be understood that the following specific embodiments are only illustrative of the present invention and are not intended to limit the scope of the present invention. Additionally, the steps illustrated in the flow charts of the figures may be performed in a computer system such as a set of computer-executable instructions and, although a logical order is illustrated in the flow charts, in some cases, the steps illustrated or described may be performed in an order different than here.

The invention provides a deletion mark filling method based on a sliding window sparse convolution denoising self-encoder, the whole flow of which is shown in figure 1, and the method specifically comprises the following steps:

step 1, firstly, respectively selecting the first 10000 sites of a first chromosome of rice and corn, and preprocessing the selected genome data to obtain a required target sequence set.

And 2, processing the original data, extracting the data of the original vccf file by using a PyVcf toolkit, finding a sample class corresponding to each locus, and extracting GT data in the sample class to be used as genotype sample data. Carrying out numerical value coding on the gene data to obtain a numerical value data set; the method specifically comprises the following substeps:

step 2-5, encoding the variables of different genotypes into three numbers, for example, the genotype corresponding to "0 | 0" is encoded into "1", "the genotype corresponding to" 0|1 "and" 1|0 "is encoded into" 2 ", and the genotype corresponding to" 1|1 "is encoded into" 3 ";

and 2-6, performing sparsification operation on the numerical value set sample processed in the step 2-5 by using one-hot, and sorting all gene segments to serve as a numerical value data set.

step 4, dividing the sample set obtained in the step 3 into a training set and a verification set to obtain a training set and a verification set sample set; the method specifically comprises the following substeps:

Step 5, hiding the training set samples and the verification set samples through data with different proportions, and converting the training set samples and the verification set samples into a data set used by the final training model; the method specifically comprises the following substeps:

step 5-1, dividing three hiding proportions: 10%, 50%, 90%, corresponding to the genotype conditions at low, medium and high deletion rates, respectively;

step 5-3, hiding the gene locus of the training set sample and the verification set sample according to the hidden locus sequence, wherein the numerical value of the hidden locus corresponds to '0' according to the coding mode of the step 2-5; five groups of random site concealment experiments were performed in three (10%, 50%, 90%) different concealment regimes (ratio of concealed sites to total sites).

Step 6, constructing a decoder and an encoder model of the sparse convolution self-encoder, selecting an initial sliding window model interval, neuron size and interval parameters, wherein the whole structure of the self-encoder is shown in FIG. 2; the method specifically comprises the following substeps:

and 6-1, selecting and determining an initial seven-layer sparse convolution framework model and an initial model with a sliding window interval of 100 and a filter of 20.

preferably, the formula and parameters relating to the self-encoder specifically include:

h＝f _θ (x)＝Φ(Wx+b)

where θ is { W, b }, W is the weight matrix of the mxn encoder, b is the bias vector of the encoder, and Φ is the encoder activation function, such as linear or corrected linear unit (relu). The hidden token h is also referred to as the potential token. The encoder adopts a hidden representation h to map the hidden representation h into a reconstructed vector Rn;

z＝g _θ′ (h)＝Φ(Wh+b)

wherein θ ' ═ { W ', b ' }; w' is an n m decoder weight matrix; b' is a decoder bias vector; phi' is the activation function of a decoder, the same as phi. The parameters θ and θ' are optimized during the training of the auto-encoder to minimize the average reconstruction error, with the likelihood function as follows:

L(x ⁽ⁱ⁾ ，z ⁽ⁱ⁾ )＝-(ylog(p)+(1-y)log(1-p))

the purpose of the auto-encoder is to reconstruct z by minimizing the loss function L such that z (i) ≈ x (i), Lx ⁽ⁱ⁾ ，Lz ⁽ⁱ⁾ It can be defined as the mean square error of continuous data or the cross entropy of discrete data. In genotype filling, we minimize the cross-entropy loss between input x and reconstructed z, as defined above, because genotype values are discrete;

step 6-5, maximum pool and upsampling are used in the model architecture. Maxpooling is a downsampling process to reduce dimensionality, which applies a max filter to the non-overlapping sub-regions of the previous layer, with upsampling being the opposite of Maxpooling. Upsampling is used to increase the dimensionality by repeating the data along an axis;

step 7, segmenting the gene data to be predicted according to the initial sliding window model interval in the step 6, respectively passing the data set of each segment through the input layer of the model, adopting a coder-decoder structure to perform feature extraction on the linkage relation of the genotype data, and finally respectively obtaining each new group of output data sets; the method specifically comprises the following substeps:

7-1, dividing the final sample set formed in the step 5 into five groups of training sets and test set sets in three hiding modes, wherein each group of data set is divided into a plurality of sections according to an initial sliding window model interval (100) preset in the step 6, 10000 sites finally form 100 sections, the data sets of each section are respectively passed through an input layer of the model and are sequentially trained, and the training sets are input into an encoder through the input layer to be subjected to multi-channel vector compression;

step 7-6, the data passes through a max _ posing 1d _1 layer, feature extraction is carried out again, the stability of the model is further enhanced through a Dropout layer, and the data dimension is (50, 64);

And 8, discarding the edge part of each group of output data sets obtained in the step 7, and splicing each group of data according to the predicted sequence to obtain a predicted genotype matrix (complete predicted genotype data). Extracting a predicted value and an actual value of the genotype of the hidden locus for the genotype matrix, comparing the predicted value and the actual value, and calculating filling precision (the proportion of the number of correctly filled markers to the total number of filled markers);

step 9, repeating the steps 7 and 8, optimizing parameters (the number of layers of convolutional codes, the size of a filter, the number of times of circulating operation and the like) of a self-encoder and a sliding window interval according to the filling precision obtained each time, and finding out an optimal parameter model; the process of optimizing the self-encoder parameters (the number of layers of convolutional codes, the size of a filter, the number of times of cyclic operation, and the like) and the sliding window interval and finding out the optimal parameter model in the step specifically comprises the following substeps:

and 9-1, comparing the filling precision obtained by the five-layer convolution frame model and the seven-layer convolution frame model, and preferentially selecting and determining a coding convolution kernel model for constructing the seven-layer frame.

where O (i) is the output of the i tag in the input vector i. F is the convolution filter and k is an odd number representing the size of the convolution filter. In this study we have performed experiments with odd filters, varying in size from 5 to 50. The convolution operation in the equation is performed for each position of the input vector I, thereby performing the convolution operation for each genetic marker.

Each convolutional layer consists of n convolutional filters, each filter having a depth D, where D is the input depth. The convolution between the input I ═ I1, I2, ·, ID } and a set of n convolution filters { F1, F2, ·, FD } produces a set of n activation maps yielding an activation map of depth n. Preferably, an auto-encoder architecture network with a filter (filter) of 40 is finally employed.

Step 9-3, determining an optimal model sliding window interval: and finally determining a fixed sliding window interval with the sliding window size of 200 according to the precision comparison obtained by repeated experiments.

Step 10, preprocessing a missing gene data set to be predicted, converting the missing gene data set into a data type matched with the model, putting the data type into the optimized model in the step 9 for predicting the genotype, and outputting a genotype data prediction value;

in this example, the model uses L in training ₁ Regular functionThe formula is as follows:

l is defined in the equation ₁ Normalization introduces all convolution filters. L is ₁ Norm regularization works by imposing penalties on layer weights in the model optimization process. These penalties are incorporated into the loss function according to which the model will be optimized. L is ₁ The norm penalizes or reduces to zero the small weights to improve the robustness of the model.

As a specific example, in one of the embodiments, the invention is further described. The depth model based on the sliding window sparse convolution denoising self-coding comprises the following contents:

(1) firstly, a first chromosome of rice and a first chromosome of corn are respectively selected, gene segments of the selected chromosomes are selected, the first 10000 loci are extracted, locus information is screened, the former 10000 loci are ensured to be the information, and the obtained sample is the initial sample set.

(2) Calling PyVcf toolkit, reading chromosome gene information of each locus one by one in a circulating way, extracting genotype information GT data in samples as genotype sample data in displayed information entries CHROM, POS, ID, REF, ALT, QUAL, FILTER, INFO, FORMAT, sample _ indexes and samples, converting the genotype sample data into the most original genotype matrix, recoding different genotype information into coding data convenient for training aiming at the genotype matrix, coding variables of different genotypes into three numbers, for example, the genotype corresponding to '0 | 0' is coded into '1', '0 | 1' and '1 | 0' is coded into '2', and the genotype corresponding to '1 | 1' is coded into '3'.

(3) And (3) carrying out sparsification operation on the processed numerical value set sample by using the one-hot coding, and sorting all gene segments to be used as a numerical value data set. And carrying out data transformation on the numerical data set through modification sites and individuals, customizing codes again aiming at different gene data, converting the codes into a matrix format, and obtaining a sorted sample set.

(4) Traversing the corn and rice sample information to obtain the individual number: the number of corn individuals was 1210, and the number of rice individuals was 3240. And storing the individual names of the two plant samples into files, respectively reading the two individual name files, calling a random () function, and counting the number of the individuals according to the following ratio of 9: 1, establishing a training set individual name index table and a verification set individual index table of rice and corn, respectively storing the two index tables into files, after the classification is finished, checking the individual number of two plant training set files and verification set files, wherein the individual number of the rice training set files and the verification set files is 2916 and 324, the individual number of the corn training set files and the verification set files is 1089 and 121, a bcftools tool is used in a Linux system environment, and a biological data set is divided into two generated individual files of a training set and a verification set according to a mode of extracting individual indexes: training set and validation set. The above operations are repeated five times to respectively obtain five groups of corn and rice training sets and validation set data. And by using a data processing means of R language, converting the training set and the verification set generated by each group from VCF format to CSV format, optimizing reading progress and data processing, and improving the running speed of the program.

(5) Dividing three hiding proportions: 10 percent, 50 percent and 90 percent, respectively corresponding to the conditions of marking low deletion rate, medium deletion rate and high deletion rate, randomly generating hidden sites corresponding to the deletion rate according to three different hiding ratios, storing the hiding sequence of the hidden sites, ensuring that the hidden positions of the deleted sites processed by the four prediction models are consistent, hiding gene sites of a training set sample and a verification set sample according to a hidden site sequence, respectively carrying out five groups of repeated experiment operations according to the coding mode of the steps 2-5 and the numerical value of the hidden sites corresponding to '0', and sorting sample sets formed by the training set and the verification set formed by the steps 5-3 to form a final sample set.

(6) Constructing a decoder-encoder model of a sparse convolution self-encoder, selecting and determining a coding convolution kernel for constructing a seven-layer framework, and adopting a self-encoder framework network initial model with a filter (filter) of 20 and a sliding window size of 100; determining an encoder mechanism, compressing a reference sample and a sample to be filled into a multi-channel feature vector, and extracting time and space information by using an attention mechanism; constructing a decoder for decoding, performing up-sampling reconstruction on the multi-channel feature vector, and performing context prediction by using feature information of time and space to finally obtain a filled genotype; and (3) simultaneously evaluating the loss of the reconstructed deletion region and the loss of the non-deletion region by adopting a multi-classification loss function, and grouping the processed gene data files according to sites to form 200 groups of one batch, wherein the total number of the groups is 50.

The self-encoding function model is as follows:

h＝f _θ (x)＝Φ(Wx+b)

where θ is { W, b }, W is the weight matrix of the mxn encoder, b is the offset vector of the encoder, and Φ is the activation function of the encoder, such as Linear or a correction Linear unit (Relu). Hidden token h is also referred to as a potential token. The encoder adopts a hidden representation h to map the hidden representation h into a reconstructed vector Rn;

z＝g _θ′ (h)＝Φ(Wh+b)

wherein θ ' ═ { W ', b ' }; w' is a weight matrix of the nxm decoder; b' is a decoder bias vector; phi' is an activation function of the decoder, identical to phi. The parameters θ and θ' of the auto-encoder will be optimized to minimize the average reconstruction error, with the likelihood function as follows:

L(x ⁽ⁱ⁾ ，z ⁽ⁱ⁾ )＝-(ylog(p)+(1-y)log(1-p))

parameter θ learned from data ^* θ' takes the following values:

(7) the final sample set formed in the step (5) is five training sets and test set sets in three hiding modes, each set of data set is divided into a plurality of sections according to an initial sliding window model interval (100) preset in the step (6), 10000 sites finally form 100 sections, the first section and the tail end are 200 site intervals, the first section and the tail section are 300 site intervals, and the data sets of each section are respectively trained sequentially through an input layer of the model; after the expected training samples enter the encoder, a single training sample enters the training model, the initial data has the dimensionality of (200, 1), the data passes through a Conv1D layer and a Linear activation function to perform Linear feature extraction, the relation between features is extracted, the data passes through a MaxPooling1D layer feature amplification and a Dropout layer to further enhance the stability of the model, the data passes through a Conv1D _1 layer, Relu is used as an activation function to increase the robustness of the model, the data passes through a max _ posing 1d _1 layer, the feature extraction is performed again, the data passes through a Dropout layer, the data passes through a Conv1D _2 layer and then uses a Linear activation function again to further extract model feature parameters, the coding layer is specially extracted completely, the data enters the training layer, the data passes through a Conv1D _3 layer and then uses the Relu function again to increase the nonlinear feature effect of the model, the data passes through an up _ sampling1d layer, the data passes through a Conv1 _ 1d layer and a sampling layer 2d, after the data passes through the conv1d _4 layer and the convergence of the model is accelerated, the output of the model is mapped to the section of (0, 1) through the classification function SoftMax.

(8) Discarding the edge part of each group of data sets (i.e. the site interval of 300 is only the interval of the middle 100 sites, and the 100 site intervals of both sides) of each group of output data sets obtained in (7), splicing each group of data according to the predicted sequence to obtain a predicted genotype matrix (complete predicted genotype data), extracting the predicted value and the actual value of the genotype of the hidden locus from the genotype matrix, performing data splicing to reform a new chromosome gene matrix, wherein the size format of the corn is (121, 10000), the format of the rice is (324, 10000), and comparing, calculating filling precision (the proportion of the number of correctly filled marks to the total number of filled marks), and continuously repeating the experiment operation to finally determine that the optimal parameter of the sparse convolution model is a self-encoder model with 7 layers of convolution kernel filters with the size of 7 and a sliding window interval of 200.

(9) Preprocessing a missing gene data set to be predicted, converting the missing gene data set into a data type matched with the model, putting the data type into the optimized model in the step 8 for predicting the genotype, outputting a genotype data predicted value, and finally obtaining the prediction effects of the four methods by comparing the precision obtained by the sliding window sparse convolution denoising self-coding training model under the same condition with the precision of the KNN algorithm, the SVD algorithm and the SCDA algorithm, as shown in fig. 4 and 5.

The above examples show that the invention has the characteristics of simple model and high precision, can be trained under the conditions of different sites and different hiding ratios, and improves the precision of mark filling by thinning and sliding window processing of the sample data set. Compared with the traditional gene filling methods KNN, SVD and SCDA, the method has obvious precision advantage under any deletion rate. The method adopts a segmented sliding window processing mode, can solve the learning problem of local linkage mode which cannot be processed by other algorithms, and can perform better feature extraction, thereby realizing accurate filling of the deletion genotype.

It should be noted that the above-mentioned contents only illustrate the technical idea of the present invention, and the protection scope of the present invention is not limited thereby, and it is obvious to those skilled in the art that several modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations fall within the protection scope of the claims of the present invention.

Claims

1. The method for filling the missing mark of the self-encoder based on the sliding window sparse convolution denoising is characterized by comprising the following steps of:

2. The method for filling missing marks in a self-encoder based on sliding-window sparse convolution denoising as claimed in claim 1, wherein the step 2 comprises the following sub-steps:

step 2-5, encoding the variables of different genotypes into three numbers;

3. The method for filling missing marks in a self-encoder based on sliding-window sparse convolution denoising as claimed in claim 2, wherein in the step 2-5, the encoding is specifically: the genotype corresponding to "0 | 0" is encoded as "1", the genotypes corresponding to "0 | 1" and "1 | 0" are encoded as "2", and the genotype corresponding to "1 | 1" is encoded as "3".

4. The method for filling missing marks in a self-encoder based on sliding-window sparse convolution denoising as claimed in claim 1, wherein the step 4 comprises the following sub-steps:

4-4, repeating the steps five times to generate data sets of a training set and a verification set under five groups of random index tables;

5. The method for filling missing marks in a self-encoder based on sliding-window sparse convolution denoising as claimed in claim 1, wherein the step 5 comprises the following sub-steps:

6. The method for filling missing marks in a self-encoder based on sliding-window sparse convolution denoising as claimed in claim 1, wherein the step 6 comprises the following sub-steps:

7. The method for filling missing marks in an auto-encoder based on sliding-window sparse convolution denoising as claimed in claim 6, wherein the formula and parameters of the auto-encoder in step 6-2 specifically include:

h＝f _θ (x)＝Φ(Wx+b)

z＝g _θ′ (h)＝Φ′(W′h+b′)

wherein θ ' ═ { W ', b ' }; w' is a weight matrix of the nxm decoder; b' is a bias vector of the decoder; phi' is an activation function of the decoder, the same as phi; the parameters θ and θ' of the auto-encoder will be optimized to minimize the average reconstruction error, with the likelihood function as follows:

L(x ⁽ⁱ⁾ ,z ⁽ⁱ⁾ )＝-(ylog(p)+(1-y)log(1-p))

the purpose of the auto-encoder is to reconstruct z by minimizing the loss function L such that z (i) ≈ x (i), Lx ⁽ⁱ⁾ ，Lz ⁽ⁱ⁾ Can be defined as the mean square error of continuous data orCross entropy of discrete data.

8. The method for filling missing marks in a self-encoder based on sliding-window sparse convolution denoising as claimed in claim 1, wherein the step 7 comprises the following sub-steps:

9. The method of claim 1, wherein the filling precision in step 8 is a ratio of the number of correctly filled marks to the total number of filled marks.

10. The method for filling missing marks in a self-encoder based on sliding-window sparse convolution denoising as claimed in claim 1, wherein the step 9 of optimizing the self-encoder parameters and the sliding-window interval and finding out the optimal parameter model specifically comprises the following sub-steps: