CN114881131A

CN114881131A - Biological sequence processing and model training method

Info

Publication number: CN114881131A
Application number: CN202210446243.2A
Authority: CN
Inventors: 明朝燕; 陈湘竣
Original assignee: Zhejiang University City College ZUCC
Current assignee: Zhejiang University City College ZUCC
Priority date: 2022-04-26
Filing date: 2022-04-26
Publication date: 2022-08-09

Abstract

The invention provides a biological sequence processing and model training method, which comprises the following steps: s1, acquiring data of the biological gene sequence and integrating the data; s2, preprocessing the data, traversing the read biological gene sequence, and filtering out the biological gene sequence meeting the requirements; s3, constructing a data set required by the training model, and finely adjusting the data set according to the number of each class of data in the data set to ensure that the scales of each class of data in the data set are approximately equal; s4, carrying out quantity balance and gene data length balance processing on the data of the data set to obtain a training set; s5, training the model with the reverse complementary network by using the training set. The method provided by the invention can save time on the basis that the accuracy reaches a similar level as that of the traditional gene classification and identification method, and can correctly predict genes which cannot be correctly classified by part of traditional biological methods.

Description

Biological sequence processing and model training method

Technical Field

The invention relates to the technical field of computer processing of biological gene sequences, in particular to a biological sequence processing and model training method.

Background

The pneumonia epidemic situation caused by the novel coronavirus threatens the health and safety of human beings for a while, the novel coronavirus is only one of various viruses which are very common in human beings in history, and the viruses which are abused worldwide at present comprise influenza virus, AIDS virus, liver disease virus and the like. The virus is present in the world and is evolving and updated with the development of human beings, and the tiny pathogen is not recognized for the first time until the late 19 th century. The viruses which can not be seen by naked eyes can not affect the health of people all the time.

For a newly discovered virus, it is necessary to use conventional biological methods to identify its source, which requires a large amount of resources. Because the results obtained by the traditional method are mainly based on comparison with the data in the existing database, for some special viruses, if the database has no similar records, the classification is not accurate. Deep learning has a wide range of application scenarios, such as the fields of computer vision, natural language processing, speech analysis, and the like.

The virus gene data is similar to text data and is a string of highly serialized data, and some potential characteristic information is contained in the highly serialized data; and the DNA sequence of an organism is a double-stranded helix, which means that there is a specific base-complementary relationship between the DNA strands, such that the bases on both strands can bind to each other. The method for processing the virus gene sequence by using the natural language processing method can be used for digging out potential information in the sequence and well utilizing the complementary relation between base pairs, thereby training a model to process the classification of the virus gene. This is a viable process and has unique advantages over traditional biological methods. Among conventional biological gene classification methods, BLAST, which is the most common and most effective method, operates on the basis of a large biological gene database, and compares genes to be classified and identified with data in the gene database, thereby classifying the genes having the highest similarity among the comparison results into the same class. Because the genes to be classified and identified need to be carefully compared with some genes in the database, the BLAST method has a large time overhead for identifying the classified genes.

Disclosure of Invention

In view of the shortcomings in the prior art, a first object of the present invention is to provide a biological sequence processing and model training method. The method provided by the invention can save time on the basis that the accuracy reaches a similar level as that of the traditional gene classification and identification method, and can correctly predict genes which cannot be correctly classified by part of traditional biological methods.

In order to solve the technical problems, the invention is realized by the following technical scheme:

a biological sequence processing and model training method is characterized in that: the method comprises the following steps:

s1, acquiring data of the biological gene sequence and integrating the data;

s2, preprocessing the data, traversing the read biological gene sequence, and filtering out the biological gene sequence meeting the requirements;

s3, constructing a data set required by the training model, and finely adjusting the data set according to the number of each class of data in the data set to ensure that the scales of each class of data in the data set are approximately equal;

s4, carrying out quantity balance and gene data length balance processing on the data of the data set to obtain a training set;

s5, training the model with the reverse complementary network by using the training set.

Further: in step S1, acquiring genbank id of the biological gene sequence; the method for obtaining the biological gene sequence GenBank ID comprises the following steps: setting a file containing GenBank ID of a plurality of biological gene sequences, inquiring and downloading biological information corresponding to the index in a public biological gene database according to the GenBank ID of the biological gene sequences listed in the file, and storing the biological information in a FASTA file, or directly obtaining the FASTA file containing the biological gene sequence information.

Further: reading the FASTA file, sorting the information contained in the FASTA file into a table according to a specified format, arranging the biological information of the same gene in the same column of the table to obtain a local database containing all required data, removing duplication of the data in the local database, and storing the data as a CSV file.

And further: in step S2, the data preprocessing method includes: calling a CSV file, traversing each piece of data in the CSV file, analyzing the data, and replacing an unconventional base contained in the data with a base N; and when N in a single gene sequence appears more than 20 times continuously or does not appear continuously but occupies 5% of all the bases of the whole gene sequence, removing the gene sequence data.

Further: in step S3, the data set required for training the model is constructed according to the rule of class balance:

firstly, determining the types to be predicted by the prediction task of the training model, and counting the number of biological gene sequences contained in each type according to the types; each of the classes serving as constituent data sets has approximately the same number of sequences, and thus the same number of biological gene sequences are randomly extracted from the other classes in accordance with the number of biological gene sequences contained in the class having the smallest number of biological gene sequences; when the difference value between the biological gene sequence number of the class with the minimum biological gene sequence number and the biological gene number of the sequences of other classes is larger than the sequence number of the class with the minimum biological gene sequence number, selecting another class with a proper sequence number as a reference, and extracting data from other classes; when the data set is divided, when the data volume is larger than a certain value, the data set is divided according to a certain proportion, otherwise, the data set is divided according to the rule that the number of training sets > the number of test sets > the number of verification sets;

after the number of the training sets, the test sets and the verification sets are divided, the three sets are randomly disordered, the data and the corresponding categories are written into different CSV files respectively, and the CSV files are stored in the same folder and called when model training is provided.

Further: in the step S4, in the above step,

(a) copying and filling genes with the length smaller than the length sequence in the first 5% of the local database to the required length: randomly selecting a certain base on the gene with the length smaller than the length sequence of the first 5% position in the local database as the initial position of the self-replicating segment, wherein the gene sequence from the initial position to the last base of the gene sequence is used as a self-replicating filling gene sequence segment, and filling the gene segment to the tail of the original gene sequence; repeating the above operations until the length of the gene reaches the required length;

(b) and (3) carrying out data set expansion on the class with insufficient data in the training set: copying a part of an existing biological gene sequence, and regarding the part as an independent biological gene sequence capable of representing the biological gene sequence, thereby achieving the effect of balancing the data set;

(c) the full length gene was segmented: a sliding window sampling method is adopted, a gene segment is sampled at certain intervals, the gene segment is used as input data during model training and is also called as a subsequence of a gene sequence, and when the interval length is small enough, sufficient gene segments can be sampled.

Further: in the step S5, in the above step,

s01: on the expression that a biological gene sequence is converted into a digital code, a one-hot code, Skip-Gram, CBOW or Elmo model is adopted to pre-train the biological gene, and the vector representation of each base of the biological gene is output and is used as the input data of the method;

s02: adopting sequence reverse complementary processing, inputting a DNA chain and a complementary chain thereof into a model for training while training the model, respectively processing two different data by using two independent branch network structures in parallel, sharing weight parameters between each pair of identical network layers between the two networks, and merging the data of the two chains before outputting the data from the last layer to output a final prediction result;

s03: when the model is trained, the parameters are flexibly adjusted and different models are trained according to different subsequence lengths, the interval length of the extracted subsequence in the sliding window and the type of the deep learning network used by the training model; and in the training process, the model parameters with the best performance are saved so as to call the best model when the model is tested.

A second object of the present invention is to provide an electronic device, characterized in that: the method comprises the following steps:

one or more processors;

a storage device for storing one or more programs,

when executed by the one or more processors, cause the one or more processors to implement a method as in any one of the above.

A third object of the present invention is to provide a computer-readable medium having a computer program stored thereon, characterized in that: which program, when executed by a processor, carries out the method as claimed in any one of the preceding claims.

Compared with the prior art, the invention has the following advantages and beneficial effects:

the invention combines biological sequence data processing, model training and biological sequence classification, wherein the problem that the data volume of a certain class in the data set is too small can be effectively relieved by expanding the data set, so that the model can learn the characteristics of each class in a fair environment. The invention uses the subsequence of the biological sequence to replace the complete sequence to train the model, can limit the sequence length of the input model and simultaneously increase the scale of the data set, and simultaneously uses the reverse complementary relation of the DNA sequence to improve the performance of the model, thereby leading the model to fully extract the content information of the gene sequence and better developing the potential of the deep learning model. The deep learning model is used for replacing the traditional biological gene sequence classification method, so that the time cost is reduced, and the classification accuracy is improved. The model of the invention can achieve good effect when being used for experiments on data sets formed by different data, so that the model can be easily expanded to other biological sequence classification tasks.

Drawings

FIG. 1 is an overall framework of the present invention;

FIG. 2 is a schematic diagram of data preprocessing in the present invention;

FIG. 3 is a block diagram of a reverse complement network employed by the present invention;

FIG. 4 is a flow chart of the present invention.

Detailed Description

In order that those skilled in the art will better understand the technical solutions of the present invention, the following description of the preferred embodiments of the present invention is provided in conjunction with the specific examples, but it should be understood that the drawings are for illustrative purposes only and should not be construed as limiting the present invention; for the purpose of better illustrating the embodiments, certain features of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product; it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted. The positional relationships depicted in the drawings are for illustrative purposes only and are not to be construed as limiting the invention.

The invention is further illustrated by the following figures and examples, which are not to be construed as limiting the invention.

As shown in fig. 1 to 4, a biological sequence processing and model training method includes the following steps:

s1, acquiring data of the biological gene sequence and integrating the data;

s4, carrying out quantity balance of various data in the data set and length balance of gene data on the data of the data set to obtain a training set;

In step S1, acquiring genbank id of the biological gene sequence; the method for obtaining the biological gene sequence GenBank ID comprises the following steps: setting a file containing GenBank ID of a plurality of biological gene sequences, inquiring and downloading biological information corresponding to the index in a public biological gene database according to the GenBank ID of the biological gene sequences listed in the file, and storing the biological information in a FASTA file, or directly obtaining the FASTA file containing the biological gene sequence information.

In step S1, gene data is downloaded from a public database of GenBank IDs to be downloaded (GenBank is an open-access sequence database that collects and annotates all publicly available nucleotide sequences and their translated proteins.

The download channel includes: logging in an NCBI webpage, and downloading according to the webpage prompting step; or downloaded using an API built into the BioPython package. Wherein, downloading biological gene sequences requires writing the GenBank IDs of the biological sequences in a txt file in sequence, each line can only write one GenBank ID, and the downloaded gene sequence data is stored in a FASTA file. In this embodiment, the data set to be acquired includes alphavirus (Alpha Virus) data, Flavivirus (Flavivirus) data and novel coronavirus (COVID-19) data. In the present invention, the viral data is classified according to the host or the transmission medium it infects, so that alphaviruses can be classified into the following categories: "bama forest virus", "chikungunya virus", "eastern equine encephalitis virus", "gata virus", "motoria virus", "equine pirovirus", "sindbis virus", "venezuelan equine encephalitis virus" and "western equine encephalitis virus". Flaviviruses and new coronaviruses can also be classified by such methods.

Reading a FASTA file, respectively sorting information contained in the FASTA file into a table according to a specified format, arranging biological information of the same gene in the same column of the table, and obtaining a local database containing all required data, wherein the local database is used for the following model training and testing; and data in the local database is deduplicated and is stored as a CSV file so as to be convenient for later calling and searching. In addition, the CSV file can also be used as a backup of training model data, so that the database can be modified conveniently when the model is modified in the future.

In step S2, the data preprocessing method includes: calling a CSV file, traversing each piece of data in the CSV file, analyzing the data, and replacing the non-conventional bases contained in the data with bases N if the data contain the non-conventional bases, namely A, T, C, G, U and bases other than unknown bases N; when N in a single gene sequence appears more than 20 times continuously or does not appear more than 20 times continuously, but the number of N in the single gene sequence occupies 5 percent of all bases of the whole gene sequence, the data is not suitable for training a model. If the influence of deleting the gene on the data set scale is within 1%, the data that N continuously appears for more than 20 times or N does not continuously appear for more than 20 times in a single gene sequence and the quantity of the data occupies 5% of all bases of the whole gene sequence can be deleted, so that the training effect of the model is optimized.

In step S3, the data set required for training the model is constructed according to the rule of class balance:

firstly, determining the types to be predicted by the prediction task of the training model, and counting the number of biological gene sequences contained in each type according to the types; each of the classes serving as constituent data sets has approximately the same number of sequences, and thus the same number of biological gene sequences are randomly extracted from the other classes in accordance with the number of biological gene sequences contained in the class having the smallest number of biological gene sequences; when the difference value between the biological gene sequence number of the class with the minimum biological gene sequence number and the biological gene number of the sequences of other classes is larger than the sequence number of the class with the minimum biological gene sequence number, selecting another class with a proper sequence number as a reference, and extracting data from other classes; when the data set is divided, when the data volume is larger than a certain value, generally the value of the threshold is about 10, dividing according to a certain proportion, wherein the commonly used division proportion of the data set is 6:2:2 or 7:2:1, otherwise, dividing the data set according to the rule that the number of training sets is larger than the number of test sets is larger than the number of verification sets;

In this embodiment, after the data is filtered, the data set required by the training model needs to be constructed according to the rule of class balance, that is, if there are many subclass branches in a large class of data, the subclass needs to be selected equally when the data set is selected. As mentioned above, "eastern equine encephalitis virus", which can be classified into 3 classes according to the pathogenicity, in order to make the information learned by the model as complete and fair as possible, the same number of genes with different pathogenicity should be extracted when extracting the data of "eastern equine encephalitis virus".

According to the type of host infected by alphavirus, the alphavirus can be divided into 9 classes, and the number of sequences which are extracted from each class and form a data set should be similar. In the alphavirus dataset, the class with the largest amount of data has 726 pieces of gene data, and the least class has only 18 pieces of gene data, so that the same number of biological gene sequences are randomly extracted from other classes, that is, 18 pieces are extracted per class, according to the number of biological gene sequences contained in the class with the smallest number of biological gene sequences. If the biological gene sequence number of the class with the least biological gene sequence number is greatly different from the biological gene number of the sequences of other classes and the gene sequence number of the class is less than 10, selecting another class with a proper sequence number as a reference, randomly extracting data of the classes with more data than the class, expanding a data set of the classes with less data than the class, and dividing the data set.

After the three data sets are divided, the data sets are randomly disordered, then the data and the corresponding categories are written into different CSV files which are named as x _ train, y _ train, x _ validation, y _ validation, x _ test and y _ test respectively, and the CSV files are stored in the same folder and called when model training is provided.

In step S4, (a) genes with lengths smaller than the top 5% of the length sequence in the local database are copied and filled to a required length (since the lengths of the biological gene sequences are different from each other, it is necessary to copy and fill genes with too small lengths to a certain length): randomly selecting a certain base on the gene with the length smaller than the length sequence of the first 5% position in the local database as the initial position of the self-replicating segment, wherein the gene sequence from the initial position to the last base of the gene sequence is the gene sequence segment used for self-replicating filling, and filling the gene segment to the tail of the original gene sequence; repeating the above operations until the length of the gene reaches the required length;

(b) and (3) carrying out data set expansion on the class with insufficient data in the training set: copying a part of an existing biological gene sequence, and regarding the part as an independent biological gene sequence capable of representing the biological gene sequence, thereby achieving the effect of balancing the data set; for the training set used as the training model, the different number of the biological gene sequences in each category may cause the training model to have a preference for the category with a large amount of data in the training set, i.e., the data is more easily classified into this category, thereby generating an unfair phenomenon, and therefore, the data set expansion needs to be performed on the category with insufficient data in the training set.

(c) The full length gene was segmented: a sliding window sampling method is adopted, a gene segment is sampled at certain intervals, the gene segment is used as input data during model training and is also called as a subsequence of a gene sequence, and when the interval length is small enough, sufficient gene segments can be sampled. As shown in fig. 2, data was extracted by a sliding window method, and data of 5 bases in length was extracted every 3 bases as one piece of independent gene sequence data, also referred to as a subsequence. On the premise of ensuring sufficient data training models and not introducing data which are not from original gene sequences, the scale of data input into the models can be reduced, the potential of deep learning models can be fully excavated, and the models with higher fitting degree can be obtained.

The input data of the deep learning model is limited in length, and the input of a whole gene into the model results in excessive parameter quantity, so that the whole gene needs to be cut into a plurality of small gene segments. Different gene fragment lengths also have an impact on model training. In order to ensure that the segmented gene segments are all derived from the corresponding original genes in the data set and obtain as many gene segments as possible, the complete length of the genes needs to be segmented.

In the step S5, S01: on the expression that a biological gene sequence is converted into a digital code, a one-hot code, Skip-Gram, CBOW or Elmo model is adopted to pre-train the biological gene, and the vector representation of each base of the biological gene is output and is used as the input data of the method; the pre-training model is used for pre-training, the original base sequence can be better translated into a digital sequence which can be recognized by a computer, different high-dimensional vectors are used for representing different bases and are superior to the traditional unique hot code representation, and the model performance obtained by training is improved.

As shown in figure 1, a pre-training model is used for pre-training, a dictionary linked between a base (pair) and a vector is constructed, the vector of the dictionary is used for replacing an unique hot code, an original base sequence can be better translated into a digital sequence which can be recognized by a computer, different high-dimensional vectors are used for representing different bases and are better than the representation of the traditional unique hot code, and the performance of the model obtained by training is improved.

S02: the reverse complementary network is used because the DNA of an organism is in the shape of a spatially regular double helix structure in which two single strands are held together by a base complementary binding pair, i.e., the A and T, C and G bases can be bound together by a complementary binding pair, so that the two single strands eventually bind to the double helix structure. In order to solve the problem, the invention adopts sequence reverse complementary processing, DNA chains and complementary chains thereof are input into a model for training while the model is trained, two independent branch network structures are used in parallel to respectively process two different data, weight parameters are shared between each pair of identical network layers between the two networks, and the data of the two chains are merged and output to a final prediction result before the data is output from the last layer;

The method is used for training a model by inputting a data set of the alphavirus, the length of the subsequence is set to be 200, the interval of extracting the subsequence by a sliding window is set to be 1, a pre-training model is used for vectorizing base pairs, a reverse complementary model is used, 9 classification tasks are carried out on test data on a test set, the classification accuracy is up to 94.38%, and the result obtained by testing without the method is improved by 5.68%. When the length of the subsequence is reduced to 120, the accuracy of the 9-classification test is up to 97.32%, and the adjustment of the length of the subsequence has an influence on the prediction accuracy of the model. Therefore, parameters can be flexibly adjusted, and a parameter which is most suitable for a data set used by the training model is found.

Through the above description of the embodiments, those skilled in the art will clearly understand that the facility of the present invention can be implemented by means of software plus a necessary general hardware platform. Embodiments of the invention may be implemented using an existing processor, or by a special purpose processor for this or other purposes in a suitable system, or by a hardwired system. Embodiments of the present invention also include non-transitory computer-readable storage media including machine-readable media for carrying or having machine-executable instructions or data structures stored thereon; such machine-readable media can be any available media that can be accessed by a general purpose or special purpose computer or other machine with a processor. By way of example, such machine-readable media can comprise RAM, ROM, EPROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to carry or store desired program code in the form of machine-executable instructions or data structures and which can be accessed by a general purpose or special purpose computer or other machine with a processor. When information is transferred or provided over a network or another communications connection (either hardwired, wireless, or a combination of hardwired or wireless) to a machine, the connection is also viewed as a machine-readable medium.

According to the description of the present invention and the accompanying drawings, those skilled in the art can easily make or use a biological sequence processing and model training method of the present invention, and can produce the positive effects described in the present invention.

The above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention in any way, and all simple modifications and equivalent variations of the above embodiments according to the technical spirit of the present invention are included in the scope of the present invention.

Claims

1. A biological sequence processing and model training method is characterized in that: the method comprises the following steps:

s1, acquiring data of the biological gene sequence and integrating the data;

2. A biological sequence processing and model training method as claimed in claim 1, wherein: in step S1, acquiring genbank id of the biological gene sequence; the method for obtaining the biological gene sequence GenBank ID comprises the following steps: setting a file containing GenBank ID of a plurality of biological gene sequences, inquiring and downloading biological information corresponding to the index in a public biological gene database according to the GenBank ID of the biological gene sequences listed in the file, and storing the biological information in a FASTA file, or directly obtaining the FASTA file containing the biological gene sequence information.

3. A biological sequence processing and model training method as claimed in claim 2, wherein: reading the FASTA file, sorting the information contained in the FASTA file into a table according to a specified format, arranging the biological information of the same gene in the same column of the table to obtain a local database containing all required data, removing duplication of the data in the local database, and storing the data as a CSV file.

4. A biological sequence processing and model training method as claimed in claim 3, wherein: in step S2, the data preprocessing method includes: calling a CSV file, traversing each piece of data in the CSV file, analyzing the data, and replacing an unconventional base contained in the data with a base N; and when N in a single gene sequence appears more than 20 times continuously or does not appear more than 20 times continuously but occupies 5% of all the bases of the whole gene sequence, removing the gene sequence data.

5. A biological sequence processing and model training method as claimed in claim 1, wherein: in step S3, the data set required for training the model is constructed according to the rule of class balance:

6. The method of claim 1, wherein the method comprises: in the step S4, in the above step,

7. A biological sequence processing and model training method as claimed in claim 1, wherein: in the step S5, in the above step,

8. An electronic device, characterized in that: the method comprises the following steps:

one or more processors;

a storage device for storing one or more programs,

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-7.

9. A computer-readable medium having a computer program stored thereon, characterized in that: the program when executed by a processor implementing the method as claimed in any one of claims 1-7.