CN117095747B

CN117095747B - Method for detecting group inversion or transposon endpoint genotype based on linear ubiquitin genome and artificial intelligence model

Info

Publication number: CN117095747B
Application number: CN202311095082.8A
Authority: CN
Inventors: 王健; 胡海飞; 赵均良; 聂帅; 马雅美; 董景芳; 杨梯丰; 杨武; 周炼; 陈建松
Original assignee: Rice Research Institute Guangdong Academy Of Agricultural Sciences
Current assignee: Rice Research Institute Guangdong Academy Of Agricultural Sciences
Priority date: 2023-08-29
Filing date: 2023-08-29
Publication date: 2024-04-30
Anticipated expiration: 2043-08-29
Also published as: CN117095747A

Abstract

The invention discloses a method for detecting group inversion or transposon endpoint genotypes based on a linear ubiquitin genome and an artificial intelligent model, which comprises the steps of mounting group high-depth sequencing data on the linear ubiquitin genome; detecting the condition that each window is completely covered by the sequencing sequence according to the coverage condition of the second generation sequencing sequence of each sample in the population, and recording window position information and the number of reads completely covering the window; constructing an artificial intelligent model, and training the model through simulation data to judge whether a continuous window coverage area contains inversion or transposon endpoints; scanning all areas on one chromosome by using the model to obtain all inversion or transposon endpoint information on the chromosome, scanning a plurality of chromosomes in sequence, and summarizing the inversion or transposon endpoint information on all the chromosomes to form one sample; population-level inversion or transposon endpoint genotype matrices are sorted and screened based on the different samples. The invention realizes the detection of transposon and inversion end point genotypes by using the second generation sequencing data.

Description

Method for detecting group inversion or transposon endpoint genotype based on linear ubiquitin genome and artificial intelligence model

Technical Field

The invention relates to the technical field of natural population genotype detection, in particular to a method for detecting population inversion or transposon endpoint genotypes based on a linear ubiquitin genome and an artificial intelligent model.

Background

In the field of population genetic research, acquisition of population genotypes is a cornerstone of population research. Traditional population genotyping assays, including Single Nucleotide Polymorphism (SNP) assays and InDel (InDel) assays, have been studied to show that most of the phenotypic variation is due to structural variation and that Structural Variation (SV) assays have evolved in recent years. Structural variations include inversions, translocations, repeats, large fragment insertions or deletions. The current third generation sequencing price is still high, which greatly limits the mining of structural variations. The price of second generation sequencing is continuously reduced, but the reading length of second generation sequencing is shorter, which brings great difficulty to analysis of structural variation, in particular translocation and repetition caused by inversion and transposon jumping.

When a certain sequence of the genotype is inverted, the sequence is still contained in the genome, so that the method of mounting the short sequence to the reference genome by using the second generation sequencing technology is difficult to detect the existence of the inversion. Currently, a plurality of assembled high-quality genomes are mainly used for comparison among genomes to find out inversion positions, but huge cost is required for obtaining the high-quality genomes, so that the detected inversion patterns are limited to certain popular research species and limited individuals. Transposons can be active in the genome and may contain multiple repeats, so that second generation sequencing short sequences within a transposon may be simultaneously mounted to multiple sites, but may contain only one repeat in a sample.

There are a great deal of public sequencing data for many species at present, but the research progress of population transposons and inversion is slow due to the defect of short and long reading of second generation sequencing. Therefore, how to provide a method for detecting transposons and inversion of a population, and further analyze genotypes for subsequent research is a problem to be solved by those skilled in the art.

Disclosure of Invention

In view of the above, the present invention provides a method for detecting group inversion or transposon end point genotypes based on a linear ubiquitin genome and an artificial intelligence model, so as to solve the problems in the background art described above, and realize the detection of transposon and inversion end point genotypes by using second generation sequencing data.

In order to achieve the above purpose, the present invention adopts the following technical scheme:

The invention provides a method for detecting group inversion or transposon endpoint genotypes based on a linear ubiquitin genome and an artificial intelligence model, which comprises the following steps:

Step one: using alignment software to mount population high depth sequencing data onto a linear ubiquitin genome;

Step two: detecting the complete coverage of each window by the sequencing sequence by utilizing a window dividing mode according to the coverage condition of the second generation sequencing sequences (reads) of each sample in the population, recording window position information and the number of reads completely covering the window, and generating a data frame file;

step three: constructing an artificial intelligent model (artificial neural network model), training the model through simulation data to judge whether the area (continuous window coverage area) contains inversion or transposon endpoints according to the fixed number, the continuous window position and the sequence number information obtained in the step two;

Step four: scanning all areas on one chromosome by using the model in the third step to obtain all inversion or transposon endpoint position information on the chromosome, sequentially scanning a plurality of chromosomes, and summarizing the inversion or transposon endpoint information on all the chromosomes into one sample;

Step five: based on the inversion or transposon end point information of different samples, a population-level inversion or transposon end point genotype matrix is integrated.

In the first step, the sequencing depth of the population high-depth sequencing data is greater than 20×.

In the second step, the situation that each window is completely covered by the sequencing sequence is detected by using a window dividing mode, the window refers to a 39bp region on the genome, the positions of the window are named by combining the chromosome and the middle position of the window (the chromosome position is as on a numerical axis, the left side is small and the right side is large), and the step length of the window dividing is 20bp.

In the second step, the sequence is completely covered, which means that the leftmost position of the sequencing sequence should be at the left side of the window (the left position of the sequencing number is smaller than the left position of the window), and the sum of the number of bases on the genome of the whole genome matched with the sequence and the number of detected deletion bases is larger than the right position of the window.

In the second step, the window position information and the number of sequencing sequences which completely cover the window are recorded, so that a data frame containing two columns of data can be obtained, wherein the first column of the data frame is the window position, the second column is the number of sequences which completely cover the window, and the data frame is stored in a csv file; because all samples are compared to the same linear genome, the window is based on the linear genome, the size and the step length of the window are consistent, the number of lines of the csv file of the data frame of all samples are consistent, the first column information is identical, and the second column sequence number information is different.

As a preferable technical scheme of the above technical scheme, the artificial neural network model in the third step is a 4-layer full-connection layer network, the last layer of activation function is a sigmoid function, and the other activation functions are ReLU functions; the input of the model is an array with a length of 25, the array is derived from the data frame csv file in the second step, the second column number in the csv file is subjected to window dividing operation, the window size is 25, the step length is 22, and the length of each window corresponding to the genome sequence is 39bp+ (25-1) 20=519 bp; the output of the model is a value that is used to determine whether the 519bp region contains an inversion or transposon end point.

As a preferable technical scheme of the above technical scheme, the data frame csv file in the second step is preprocessed before being input into the model, a number of less than or equal to 3 in the second column of the csv file is defined as 0, which indicates no complete coverage of the sequence, and a number of greater than 3 is defined as 1, which indicates complete coverage of the sequence.

As a preferred embodiment of the above-described technical solution, in the third step, the model is trained by using simulation data, and the simulation data is generated by artificial simulation, and includes a case where there is obviously an inversion or transposon end point (for example, 10 consecutive 1 s, 11 th bit is 0, and all 12 th to 25 th bits are 1) and a case where there is obviously no inversion or transposon end point (for example, the first 150 s, and the last 101 s), and a data set with characteristic values corresponding to the tag is created for model training.

In the fourth step, the chromosome and the sample are processed respectively, and the data calculation is independent and can be operated in parallel;

Compared with the prior art, the technical scheme has the following technical effects:

1) The invention creatively provides identification degree coverage of the number of sequencing sequences (reads) which completely cover a certain area, and the previous identification of the coverage of the certain area is realized by firstly calculating the number of single base sites covered by the reads and then averaging the number of the plurality of single base coverage reads. The existing method cannot detect the inverted position or the transposon end point, because the inverted position or the transposon end point is actually a site, after sequencing data is installed on a reference genome, the installed sequence spans 1-10bp of the inverted position or the transposon end point, the end point and the nearby base positions are covered by a plurality of reads, the coverage average value of a section of area containing the end point is also non-0, and the coverage average value results of other continuous coverage and non-end point area are similar, so that the end point cannot be detected by the existing method. The invention provides the number of reads which completely covers 39bp, and the window is drawn by using the step length of 20, so that the factor of error matching of the mounting sequence is removed, and if the endpoints exist, the number of reads which completely covers the 39bp area containing the endpoints can be detected to be 0.

2) The invention creatively researches the inversion or transposon from the population level, searches the endpoint of the inversion or transposon (also called as breakpoint in the invention) by combining the second generation sequencing data with the linear genome, can obtain more abundant endpoint position information, previously assembles the genome by three generations of sequencing, and obtains the inversion or transposon position by comparing high quality genomes, has huge cost and has limited number of the inversion or transposon due to limited number of the high quality genomes. The invention can utilize the disclosed mass second generation sequencing data and has obvious advantages.

3) The invention innovatively applies the artificial intelligent model to identify the inversion or transposon end points (break points), and the artificial intelligent model can extract abstract features, process complex conditions and accurately identify the end points. The material sequencing data containing the endpoints are mounted on a linear ubiquitin genome, and the comparison result has the obvious characteristics that: the endpoint (39 bp interval containing the endpoint) is not completely covered by reads, and both the left and right sides of the endpoint can be completely covered by multiple reads. The traditional method is easier to judge a certain position or a certain area (a value), the judgment of the end point needs to combine a plurality of values, the arrangement sequence of the values has influence on the judgment result, and the traditional method is harder to process. Artificial intelligence models are adept at addressing such non-linear, feature extraction issues.

4) The computation in the window of the invention is independent, and can use multithread parallel acceleration operation, thereby reducing the running time of the program.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present invention, and that other drawings can be obtained according to the provided drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic flow chart of a method of detecting group inversion or transposon end point genotypes based on a linear flood genome and an artificial intelligence model according to the present invention;

FIG. 2 is a schematic diagram of the reads of the present invention fully covering a certain area;

FIG. 3 is a schematic diagram of the windowing method of the present invention for detecting genome-level coverage (complete coverage);

FIG. 4 is a block diagram of a training data set generation and artificial intelligence model of the present invention;

FIG. 5 is a schematic diagram of a windowing method in accordance with the present invention transforming a data structure for an artificial intelligence model.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Example 1

Referring to fig. 1-4, the present embodiment provides a method for detecting population inversion or transposon end point genotypes based on a linear ubiquitin genome and an artificial intelligence model, comprising:

A. statistics of genome-level coverage of second-generation sequencing data

A1. extracting DNA of the group sample.

A2. DNA was randomly disrupted using ultrasound and pooled, followed by high depth sequencing (20 x or more) using second generation DNA sequencing techniques. The current reading length of the second generation sequencing data is 150bp.

A3. High depth sequencing of all samples was aligned separately onto the linear pan genome using alignment software. The comparison software can be BWA or Samtools without Chinese names, and is directly expressed in English in industry for extracting target sequence.

A4. And taking 39bp as a window, and counting the number of reads completely covered by the window. Wherein reads refer to a sequencing sequence generated in the secondary sequencing process, is a 150bp base sequence, and can be directly expressed by English in the industry.

As shown in fig. 2, if the leftmost end of a read is to the left of the window and the rightmost end of the read is to the right of the window, this indicates that the read completely covers (spans) the window, which is completely covered by the number of reads +1. In actual operation, the right-most end of the read should be the left-most end plus the number of bases matched to the linear ubiquity genome plus the sum of the number of missing bases detected.

A5. And (3) performing window-dividing operation on the linear genome by using a window-dividing method, wherein the window size is 39bp, the step length is 20bp, and analyzing whether each window 39bp is completely covered by reads. And combining the position information of the window and the number of the complete coverage reads into a data frame corresponding to 2 columns. To reduce the cost of subsequent operations, the number of fully covered reads is compared to a threshold (3 in this example, if the sequencing depth is deeper, it can be increased appropriately), if greater than 3, the 39bp window is fully covered by reads by 1; if 3 or less, 0 indicates that the window is not completely covered. The window position information and the numbers 0 and 1 indicating the coverage are combined to form a data frame including two columns (fig. 3).

B. artificial intelligence model training to identify inversion or transposon end points (breakpoints)

B1. A training set is created. The sequencing reads coverage at the inverted or transposon end point (breakpoint) is of significant nature: the area in front of the breakpoint has a large number of reads fully covered, and the area behind the breakpoint has a large number of reads fully covered, however, the 39bp window containing the breakpoint cannot be fully covered by the reads. Thus, in the data frame generated in A5, if a segment area (519 bp, which contains 25 small windows) has a plurality of continuous 1 s, only one 0 is present, and then a plurality of 1 s are present, which indicates that the segment area has a breakpoint. Visualization of the three generations of sequencing data from the actual genome reveals that the breakpoint is mostly a 1 base position, but it is also possible to have several consecutive bases or a small sequence. Thus, after a number of 1 s have consecutively occurred, it is possible for 1-10 0s to occur, followed by all 1 s, again indicating that the region contains a breakpoint. Using the windowed approach, the middle 0(s) may appear in different locations of the window, and thus the situation is complex and difficult to judge by conventional algorithms. Creating all data sets containing breakpoints (fig. 4): the number of windows is set to 25, and the feature value is an array containing 25 numbers. Simulating the situation that 10 exists, wherein 0 can occur at any position of an array (25 numbers); the simulation has 2 cases of 0, which may occur anywhere in the middle of the array … …. Creating all data sets which do not contain break points, and when one or more 0s appear at the leftmost end or the rightmost end of the array, indicating that the window is the boundary of large fragment insertion deletion; if 0 exceeds 10 in the array, this window is said to contain an indel. The data of 25 consecutive 0 or 1 are set as the characteristic value, whether the breakpoint (1 or 0) exists or not is set as the label value, and the characteristic value and the label value are corresponding to form a data set.

B2. An artificial intelligent model is built, an artificial neural network model comprising 3 hidden layers and 1 output layer is built, the contact numbers of the hidden layers are 64, 16 and 4 respectively, an activation function is a ReLU, and the output layer outputs a value by using a sigmoid function. The model input is an array containing 25 eigenvalues, matching the training set.

B3. Training an artificial intelligent model by using a training set, inputting a data set into the artificial intelligent model, calculating a characteristic value and a model parameter, outputting a result through a multi-layer network, comparing the output result with a real label value, improving the model parameter through a gradient descent algorithm, and continuously reducing the difference between the output result and the real label value. Multiple sets of data are input, 4000 times of training are circulated, and therefore the artificial intelligent model can accurately (99.9%) identify the existence of the breakpoint.

C. Identifying population inversion or transposon end points (breakpoints) using artificial intelligence models

C1. for each sample, a data matrix representing the whole genome coverage is generated after step A5 (first column is window position information, second column is whether the window is completely covered). The data matrix is windowed using a windowing method, with 25 values as a window and 22 steps (fig. 4). The data matrix is converted into a matrix containing 25 eigenvalues (column 25, row name renamed with interval position containing the 25 data).

In the process of windowing, it is determined that every 25 small windows are on the same chromosome, if the last window on one chromosome cannot meet the requirement of containing 25 small windows (for example, the 1 st chromosome is windowed to the last, only the last 10 numbers are left), and the window is discarded.

C2. the number of 25 lines of the data matrix is input into the artificial intelligent model, and whether the region (the section where the 25 small windows are located) contains break points or not is output.

C3. and arranging the artificial intelligence output result, outputting the position information and the breakpoint information correspondingly, arranging the result according to the chromosome, and outputting the result to a file, wherein each sample outputs a file result.

D. Merging and screening of breakpoint data

D1. Since all samples use the same linear ubiquity as the reference genome, the first column of the output result file of the step A5 is identical and contains the position information of the window on the chromosome. After the window dividing method of C1, the output result file of each sample still has the same line number, and the line names of each line are the same.

D2. The breakpoint files of all samples were combined longitudinally.

D3. The combined file contains a matrix of location information (a 519bp region) of the breakpoint in the chromosome, listed as each sample name. The matrix is the preliminary breakpoint genotype. Each row of samples was screened for Minimal Allele Frequency (MAF), removing rows with a MAF less than 0.05. And obtaining the screened breakpoint genotype and outputting the genotype to a file.

The principle of application of the invention is described in detail below with reference to the attached drawings:

example 2 acquisition of Single sample sequencing reads genome level coverage based on high depth sequencing

The DNA of leaves of rice population material was extracted using a chemical decontamination method (CTAB method, cetyltrimethylammonium Bromide) and high depth sequencing of the DNA was accomplished using a IlluminaNovaSeq6000 sequencing platform, producing approximately 20Gb of data per sample. The rice genome size was 400Mb and the sample depth was 50X. In this example, irrRI2K_91 is taken as an example, and the compressed double-ended sequencing data of irRI2K_91 has a size of 6.8Gb (R1.fq.gz) and 7.0Gb (R2.fq.gz).

And comparing the double-end sequencing data of the sample with the linear reference genome of the rice by using BWA software to obtain a comparison file in a bam format, and sequencing, marking repeated positions and establishing indexing operations on the bam file by using picard software, wherein the bam size is about 8.1Gb.

The self-programming a0_read_one_bam_buffer_depth1.09. Py is run, setting the window size parameter for full coverage, 39bp in this example, and the step size parameter for 20bp. The program takes a bam file as an input file, outputs a csv file, wherein the csv file comprises two rows of data, the two rows of data are separated by commas, the first row records the position of each window on a chromosome, the second row records the number of complete coverage reads corresponding to the position of the first row window, and the size of the output file IRRI2K_91absolute_depth1.09.Csv is 347M, which is in total 21,771,459 rows.

Example 3 training of Artificial Intelligence model to discriminate inverted or transposon end points (break points)

Running the self-programming a1_create_tracking_data.py, setting the window number parameter as 25, setting the breakpoint to be 10 continuous 0s at maximum, generating a training set in the running process of the program, and outputting the training set with a csv file, wherein the training set comprises a plurality of rows of characteristic values and corresponding tag values.

The self-programming will use the python call tensorflow module to set the network structure of the artificial intelligent model, which contains 3 hidden layers, the number of contacts is 64, 16, 4, the activation function is ReLU, and the output layer outputs a value using the sigmoid function. The model input is an array containing 25 eigenvalues, matching the training set generated by the program. The self-programming inputs the training set into the model for training the model, sets the circulation times, and after 4000 times of training, the model has the prediction accuracy of more than 99.9% for the breakpoint of the training set, and the trained model is stored.

Example 4 Artificial Intelligence model detection of inverted or transposon end points (breakpoint)

The irri2k_91soluble_depth1.09.csv file is processed by using the self-programming a2_depth_shape_cast.py, the minimum coverage threshold is set to be 3, if the minimum coverage threshold is less than or equal to 3 in the second column of irri2k_91soluble_depth1.09.csv, the irri2k_91soluble_depth1.09.csv is converted to 0, and if the minimum coverage threshold is greater than 3, the irri2k_91soluble_depth1.09.csv is converted to 1. The self-programming program uses a window dividing method to convert the data into matrix data (figure 4) containing 25 characteristic values based on two columns of data, outputs the matrix data to a file IRRI2K 91absolute_depth_shape25.Csv, loads a trained model by using the self-programming a3_detecting_break_point.py, detects break points of each line in the IRRI2K 91absolute_depth_shape25.Csv file, and outputs position information and a prediction result to one IRRI2K 91_break point.

The same procedure was performed sequentially on multiple samples in the population to obtain IRRI2k_ _ break point. Because the execution of different samples does not depend on other samples, the step can utilize multi-process parallel operation to accelerate the analysis process; since multiple samples were all based on the same linear pan genome as the reference genome, the positional information in the final IRRI2k_ break point. And reading all IRRI2K_ _ break point.csv by using a pandas module in python, combining the information into a genotype matrix, and outputting the information to a file by listing the information as each sample name. The genotype matrix was screened for Minimal Allele Frequency (MAF) using a self-programming a4_filter_position_break point.py to obtain the screened population inversion or transposon endpoint genotypes.

30 Positions of detected break points are randomly selected from the screened genotypes, the 30 regions are manually observed and detected by using IGV (IGV) visual software, and the 30 positions completely accord with the characteristics of inversion or transposon end points after detection.

In the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, and identical and similar parts between the embodiments are all enough to refer to each other. For the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the description of the method section.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method for detecting population inversion or transposon end point genotypes based on a linear flood genome and an artificial intelligence model, comprising the steps of:

Step two: detecting the situation that each window is completely covered by the sequencing sequence in a window dividing mode according to the coverage situation of the second generation sequencing sequences of each sample in the population, recording window position information and the number of the sequencing sequences completely covering the window, and generating a data frame file; the window refers to a 39bp region on the genome, and the position of the window is named by combining the chromosome where the window is positioned and the middle position of the window; the step length of the window is 20bp; the complete coverage means that the leftmost part of the sequencing sequence is at the left side of the window, and the left position of the sequencing sequence is added with the sequence matched with the sum of the number of bases on the genome and the number of detected deletion bases to be greater than the right position of the window;

Step three: constructing an artificial intelligent model, training the model through simulation data to judge whether the continuous window coverage area contains inversion or transposon endpoints according to the fixed number, the continuous window position and the sequence number information obtained in the second step;

Step five: based on the inversion or transposon end point information of different samples, population-level inversion or transposon end point genotype matrices are sorted and screened.

2. The method according to claim 1, characterized in that: in the first step, the sequencing depth of the group high-depth sequencing data is larger than 20x.

3. The method according to claim 1, characterized in that: and in the second step, window position information and the number of sequencing sequences which completely cover the window are recorded and are used for obtaining a data frame containing two columns of data, wherein the first column of the data frame is the window position, the second column is the number of sequences which completely cover the window, and the data frame is stored in a csv file.

4. A method according to claim 3, characterized in that: the artificial intelligent model in the third step is an artificial neural network model, which comprises a 4-layer full-connection layer network, wherein the last layer of activation function is a sigmoid function, and the other activation functions are ReLU functions; the input of the model is an array with the length of 25, and the model is derived from the data frame csv file in the second step;

Carrying out window dividing operation on a second column number in the csv file, wherein the window size is 25, the step length is 22, and the length of a genome sequence corresponding to each window is 39bp+ (25-1) 20=519 bp; the output of the model is a value that is used to determine whether the 519bp region contains an inversion or transposon end point.

5. The method according to claim 4, wherein: preprocessing the csv file of the data frame of the second step before inputting the csv file into the model, and defining the number of the second column smaller than or equal to 3 in the csv file as 0 to indicate no sequence to completely cover; a number greater than 3 is defined as 1, indicating complete coverage of the sequence.

6. The method according to claim 4, wherein: training a model through simulation data in the third step:

The simulation data are generated by artificial simulation, and comprise the conditions that the inversion or transposon end points are obviously present and the conditions that the inversion or transposon end points are not obviously present, and a data set with characteristic values corresponding to the labels is manufactured for model training.

7. The method according to claim 1, characterized in that: in the fourth step, the chromosome and the sample are processed respectively, and a mode of mutually independent parallel operation is adopted.