CN113658641A

CN113658641A - Phage classification method, device, equipment and storage medium

Info

Publication number: CN113658641A
Application number: CN202110819164.7A
Authority: CN
Inventors: 朱怀球; 吴姝芳; 方臻成; 谭洁; 李墨; 王春晖; 郭倩; 徐聪敏; 江小青
Original assignee: Peking University
Current assignee: Peking University
Priority date: 2021-07-20
Filing date: 2021-07-20
Publication date: 2021-11-16

Abstract

The application provides a phage classification method, a phage classification device, phage classification equipment and a storage medium, and relates to the technical field of biological sequencing. According to the method, the short sequence sample extracted from the phage genome data is subjected to unique hot coding, the deep neural network constructed in advance is trained by directly using base information, and a classifier capable of identifying whether the phage belongs to a temperate phage or a virulent phage based on the short sequence information is obtained. In view of the fact that the length span of a phage sequence obtained by sequencing of a metagenome is large, the deep neural network corresponding to four groups of intervals with different lengths is arranged, the deep neural network under each group of length intervals learns sequence features with different lengths, and the prediction performance is improved. In addition, a weight-based prediction method is set for sequences with lengths exceeding four groups of length intervals, so that the phage sequences with any length can be classified. The short sequence sample is subjected to sequence complementing, training data are expanded, and the classification performance of the deep neural network is improved.

Description

Phage classification method, device, equipment and storage medium

Technical Field

The application relates to the technical field of biological sequencing, in particular to a phage classification method, a device, equipment and a storage medium.

Background

Phages play an important role in microbial communities as the most abundant and species-diverse group of viruses. Compared with the traditional virus discovery method based on culture, the metagenomic technology can obtain the sequence information of all microorganisms including the phage in the environment by once sequencing, avoids any deviation related to culture, can accurately and quickly determine the sequence information of the new phage, and greatly accelerates the discovery speed of the new phage.

Phages are classified into two types, temperate and virulent. After the temperate phage infects a bacterial host, the genome sequence of the temperate phage is inserted into the genome of the bacterial host, and when the environmental conditions are appropriate, the genome sequence is dissociated from the genome of the bacterial host, so that the bacterial host is killed. The bacterial host is killed directly after the virulent phage infects the bacterial host. The application of temperate and virulent bacteriophages, respectively, is in different fields, for example, horizontal gene transfer exploration relates to the study of temperate bacteriophages, and phage therapy relates to the study of virulent bacteriophages. Therefore, the method is convenient for further research and needs to quickly judge the type of the newly found phage.

However, the length of the phage sequence found based on the metagenomic technology is not regular, and relatively complete genetic information cannot be obtained, so that the type of the newly found phage cannot be judged easily.

Disclosure of Invention

The embodiment of the application provides a phage classification method, a device, equipment and a storage medium, which directly use basic gene information to classify phage so as to achieve the purpose of identifying the type of phage with shorter sequence information.

A first aspect of embodiments of the present application provides a method of phage classification, the method comprising:

obtaining at least one preset classification network corresponding to the sequence length according to the sequence length of the phage fragment to be classified;

carrying out unique hot coding on all bases of the phage fragments to be classified to obtain base sequence characteristics;

inputting the base sequence features into the at least one predetermined classification network;

and determining the type of the phage fragments to be classified according to the output scores of the at least one preset classification network.

Optionally, obtaining at least one preset classification network corresponding to the sequence length according to the sequence length of the phage fragment to be classified, including:

when the sequence length is within the range of 100bp-400bp, acquiring a first preset classification network;

when the sequence length is within the range of 400bp-800bp, acquiring a second preset classification network;

when the sequence length is within the range of 800bp-1200bp, acquiring a third preset classification network;

when the sequence length is within the range of 1200bp-1800bp, a fourth preset classification network is obtained;

and when the sequence length is more than 1800bp, acquiring the fourth preset classification network.

Optionally, in a case that the sequence length is greater than 1800bp, obtaining at least one preset classification network corresponding to the sequence length according to the sequence length of the phage fragment to be classified, including:

dividing the phage fragments to be classified according to a 1800bp scale;

according to the segmented shortest sequence segment, taking one of the first preset classification network, the second preset classification network, the third preset classification network and the fourth preset classification network as a supplementary classification network;

and taking the fourth preset classification network and the supplementary classification network as the preset classification networks of the phage fragments to be classified.

when the sequence length is less than 100bp, acquiring a sequence complementing length according to the input range of the first preset classification network;

and adopting preset codes to complement the phage fragments to be classified to the length of the sequence complement, and acquiring the first preset classification network.

Optionally, determining the type of the phage fragment to be classified according to the output score of the at least one preset classification network includes:

when the sequence length is larger than 1800bp, determining the length of an input sequence of each preset classification network in the at least one preset classification network;

calculating a weighted average of the output scores of the at least one preset classification network according to the length of each input sequence;

and determining the type of the phage fragments to be classified according to the weighted average value.

Optionally, the method further comprises:

acquiring first full genome data of the temperate phage and second full genome data of the virulent phage;

simulating metagenome sequencing to obtain the characteristics of sequence information, and setting a plurality of length intervals with different lengths;

extracting a plurality of first short sequence samples respectively corresponding to the length intervals from the first full genome data;

extracting a plurality of second short sequence samples respectively corresponding to the plurality of length intervals from the second full genome data;

and obtaining the deep neural networks respectively corresponding to the length intervals, and utilizing the first short sequence samples and the second short sequence samples to respectively train the deep neural networks respectively corresponding to the length intervals for multiple times to obtain the at least one preset classification network.

Optionally, after extracting a plurality of first short sequence samples from the first full genome data, the method further comprises:

carrying out reverse sequence complementing on the plurality of first short sequence samples to obtain a plurality of third short sequence samples;

after extracting a plurality of second short sequence samples from the second genome-wide data, the method further comprises:

performing reverse sequence complementing on the plurality of second short sequence samples to obtain a plurality of fourth short sequence samples;

utilizing the plurality of first short sequence samples and the plurality of second short sequence samples to respectively train the deep neural networks corresponding to the plurality of length intervals for a plurality of times to obtain the at least one preset classification network, including:

and training the deep neural networks respectively corresponding to the length intervals for multiple times by using the first short sequence samples, the second short sequence samples, the third short sequence samples and the fourth short sequence samples to obtain the at least one preset classification network.

A second aspect of embodiments of the present application provides a phage sorting apparatus, the apparatus including:

the acquisition module is used for acquiring at least one preset classification network corresponding to the sequence length according to the sequence length of the phage fragment to be classified;

the coding module is used for carrying out unique-hot coding on all basic groups of the phage fragments to be classified to obtain basic group sequence characteristics;

an input module, configured to input the base sequence features into the at least one preset classification network;

and the output module is used for determining the type of the phage fragment to be classified according to the output score of the at least one preset classification network.

Optionally, the obtaining module includes:

the first obtaining submodule is used for obtaining a first preset classification network when the sequence length is within the range of 100bp-400 bp;

the second obtaining submodule is used for obtaining a second preset classification network when the sequence length is within the range of 400bp-800 bp;

the third obtaining submodule is used for obtaining a third preset classification network when the sequence length is in the range of 800bp-1200 bp;

the fourth obtaining submodule is used for obtaining a fourth preset classification network when the sequence length is within the range of 1200bp-1800 bp;

and the fifth obtaining submodule is used for obtaining the fourth preset classification network when the sequence length is greater than 1800 bp.

Optionally, the obtaining module further includes:

the segmentation submodule is used for segmenting the phage fragments to be classified according to the scale of 1800 bp;

a supplementary sub-module, configured to use one of the first preset classification network, the second preset classification network, the third preset classification network, and the fourth preset classification network as a supplementary classification network according to the segmented shortest sequence segment;

and the submodule is used for taking the fourth preset classification network and the supplementary classification network as the preset classification network of the phage fragments to be classified.

Optionally, the obtaining module includes:

a sixth obtaining submodule, configured to obtain a sequence complementing length according to an input range of the first preset classification network when the sequence length is smaller than 100 bp;

and the sequence complementing sub-module is used for complementing the sequences of the phage fragments to be classified to the sequence complementing length by adopting preset codes and acquiring the first preset classification network.

Optionally, the output module includes:

the first determining submodule is used for determining the length of an input sequence of each preset classification network in the at least one preset classification network when the sequence length is greater than 1800 bp;

the calculation submodule is used for calculating the weighted average value of the output scores of the at least one preset classification network according to the length of each input sequence;

and the second determining submodule is used for determining the type of the phage fragment to be classified according to the weighted average value.

Optionally, the apparatus further comprises:

the seventh acquisition submodule is used for acquiring the first full genome data of the temperate phage and the second full genome data of the virulent phage;

the simulation submodule is used for simulating the characteristics of sequence information obtained by metagenome sequencing and setting a plurality of length intervals with different lengths;

a first extraction submodule, configured to extract, from the first full-genome data, a plurality of first short sequence samples respectively corresponding to the plurality of length intervals;

a second extraction submodule, configured to extract, from the second full-genome data, a plurality of second short sequence samples respectively corresponding to the plurality of length intervals;

and the training sub-module is used for acquiring the deep neural networks respectively corresponding to the length intervals, and training the deep neural networks respectively corresponding to the length intervals for multiple times by using the first short sequence samples and the second short sequence samples to obtain the at least one preset classification network.

Optionally, the apparatus further comprises:

the first sequence complementing module is used for carrying out reverse sequence complementing on the plurality of first short sequence samples to obtain a plurality of third short sequence samples;

the second sequence complementing module is used for performing reverse sequence complementing on the plurality of second short sequence samples to obtain a plurality of fourth short sequence samples;

the training submodule includes:

and the training subunit is configured to perform multiple training on the deep neural networks respectively corresponding to the multiple length intervals by using the multiple first short sequence samples, the multiple second short sequence samples, the multiple third short sequence samples, and the multiple fourth short sequence samples, so as to obtain the at least one preset classification network.

A third aspect of embodiments of the present application provides a readable storage medium, on which a computer program is stored, which, when executed by a processor, implements the steps in the method according to the first aspect of the present application.

A fourth aspect of the embodiments of the present application provides an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor executes the computer program to implement the steps of the method according to the first aspect of the present application.

The short sequence sample extracted from the phage genome data is subjected to unique hot coding, the deep neural network constructed in advance is trained by directly using base information, a classifier capable of identifying whether the phage belongs to a temperate phage or a virulent phage based on the short sequence information is obtained, and then the type of the phage is judged directly according to the base information of the phage fragment. Meanwhile, four groups of length intervals which accord with the characteristics of the sequencing result of the metagenome are further arranged, training samples corresponding to the four groups of length intervals are collected, deep neural networks corresponding to the different length intervals are trained respectively, preset classification networks corresponding to the four groups of different length intervals are obtained, the method is suitable for phage fragments with different lengths, the problem that the length span of the phage fragments is large based on the sequencing of the metagenome is solved, the deep neural networks corresponding to the specific length intervals learn the sequence characteristics with different lengths in a targeted manner, and the accuracy of predicting the types of the phage fragments with specific lengths is higher. Under the condition that the length of the phage fragment found based on metagenomic sequencing is irregular, the method and the device for judging the type of the phage fragment detect the phage fragment by using the preset classification network corresponding to the sequence length of the phage fragment, so that the phage fragment can be analyzed by combining at least one preset classification network, and finally, the type of the phage is judged by synthesizing the analysis result of the at least one preset classification network, thereby improving the classification accuracy.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the description of the embodiments of the present application will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive exercise.

FIG. 1 is a schematic structural diagram of a deep neural network constructed according to an embodiment of the present application;

FIG. 2 is a flow chart of the steps of a method for classifying phages set forth in the examples of the present application;

FIG. 3 is a functional block diagram of a phage sorting apparatus according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The traditional method for judging the type of the phage comprises the following steps:

1. and classifying the newly discovered phage according to the protein sequence information of the newly discovered phage. Specifically, a proteome contained in a sequence of the newly discovered phage is compared with a pre-constructed protein set to obtain a similarity vector, the similarity vector is input into a random forest classifier, and the random forest classifier outputs the type of the newly discovered phage to be a temperate phage or a virulent phage according to the characteristics carried by the similarity vector. The essence of the method for judging the type of the phage is to detect whether the input proteome of the newly found phage contains important protein genes beneficial to classification, but the method for judging the type of the phage has poor effect because the sequence information of the phage found based on metagenome sequencing is short and complicated and the protein genes supporting type judgment are difficult to extract.

2. And extracting the k-mer frequency of the newly found phage sequence, inputting the k-mer frequency into a Markov model of the temperate phage, inputting the Markov model of the virulent phage, outputting the dissimilarity degree of the newly found phage and the phage sequence of the temperate phage by the Markov model of the temperate phage, and outputting the dissimilarity degree of the newly found phage and the phage sequence of the virulent phage by the Markov model of the virulent phage. And determining the type of the newly found phage according to the degree of dissimilarity. However, the k-mer frequency characteristics are global statistics, the k-mer frequency characteristics of short sequences can have great noise, and the sequence information of the phage discovered based on the metagenome cluster sequence is short, so that the accurate k-mer characteristic frequency cannot be extracted from the newly discovered phage, and the accuracy of prediction of the newly discovered phage is influenced.

In view of the above problems, the embodiments of the present application provide a phage classification method, in which a short sequence sample extracted from phage genome data is subjected to unique hot coding, a deep neural network constructed in advance is trained directly using base information, a classifier capable of identifying whether a phage belongs to a temperate phage or a virulent phage based on the short sequence information is obtained, and then the type of the phage is determined directly according to the base information of a phage fragment. Meanwhile, four groups of length intervals which accord with the characteristics of the sequencing result of the metagenome are further arranged, training samples corresponding to the four groups of length intervals are collected, deep neural networks corresponding to the different length intervals are trained respectively, preset classification networks corresponding to the four groups of different length intervals are obtained, the method is suitable for phage fragments with different lengths, the problem that the length span of the phage fragments is large based on the sequencing of the metagenome is solved, the deep neural networks corresponding to the specific length intervals learn the sequence characteristics with different lengths in a targeted manner, and the accuracy of predicting the types of the phage fragments with specific lengths is higher.

In order to more intelligently implement the phage classification method provided by the application and enable the application range of the method to be wider, the applicant constructs a deep neural network and trains the constructed deep neural network by adopting a training sample collected in advance to obtain a preset classification network.

Fig. 1 is a schematic structural diagram of a deep neural network constructed according to an embodiment of the present invention, and as shown in fig. 1, the deep neural network includes a convolutional layer, two pooling layers, two normalization layers, a random deactivation layer, and two fully-connected layers. The convolution layer of the first layer is convolved with 64 convolution kernels of length 6 to the whole sequence of the input, and finally, ReLu (rectified Linear Unit) is used as the activation function. The output next to the convolutional layer is passed into a pooling layer and a normalization layer, which function to optimize the training process. The normalization layer is connected with the random inactivation layer which can be used for preventing the overfitting phenomenon, and the global normalization layer, the first full-connection layer and the batch normalization layer are sequentially connected behind the random inactivation layer. The last layer is a second full-link layer with the activation function sigmoid, and the second full-link layer finally outputs a prediction score with the numerical value between 0 and 1.

After the deep neural network is constructed, training samples are collected, and the method for specifically collecting the training samples comprises the following steps:

step S11: first complete genome data of the temperate phage and second complete genome data of the virulent phage are obtained.

The first and second full genome data may be obtained from phacts (phase Classification Tool set) tools. 227 phage whole genomes carrying accurate type information are extracted from the PHACTS tool, wherein the 227 phage whole genomes comprise 148 phage whole genomes belonging to temperate phages and 79 phage whole genomes belonging to virulent phages. Because the virulent phage mycobacteriophage D29 (access: NC-001900) and the temperate phage L5 have similarity, the virulent phage lactococcus bacteriophage ul36 (access: NC-004066) and the temperate phage Tuc2009 have homology of 46.6%, and simultaneously, because the virulent phage D29 and the virulent phage ul36 contain functional integrases, the virulent phage D29 and the virulent phage ul36, which are caused, can insert own genomes into a host like the temperate phage, and in order not to influence the training effect on the deep neural network, the application reserves phage whole genomes of other virulent phages except the virulent phage D29 and the virulent phage ul36 as second whole genome data.

Step S12: and (3) simulating the metagenome sequencing to obtain the characteristics of sequence information, and setting a plurality of length intervals with different lengths.

The method for simulating the metagenome sequencing to obtain the characteristics of sequence information includes the following steps: because the length span of the phage fragment obtained by metagenome sequencing is large, and the length of the phage fragment may exist in the range from hundreds of bp to thousands of bp, therefore, the length distribution characteristics of the phage fragment obtained by metagenome sequencing in the embodiment of the application set four groups of length intervals with different sequence lengths to include: group A corresponds to 100bp-400bp, Group B corresponds to 400bp-800bp, Group C corresponds to 800bp-1200bp, and Group D corresponds to 1200bp-1800 bp.

Further, the present embodiments divide the first full genome data and the second full genome data into five-fold design five-fold cross-validation data sets, respectively. In a specific example, the total genome of 77 bacteriophage belonging to virulent bacteriophage is randomly divided into 16, 16 and 13, and in the process of training the deep neural network, the total genome of {16, 16} bacteriophage belonging to virulent bacteriophage is used as a training set, and the total genome of 13 bacteriophage belonging to virulent bacteriophage is used as a testing set. The 148 phage whole genomes belonging to the temperate phages are randomly divided into 118, 118 and 120, and in the process of training the deep neural network, the {118, 118} phage whole genomes belonging to the temperate phages are used as a training set, and the 148 phage whole genomes belonging to the temperate phages are used as a test set.

Step S13: extracting a plurality of first short sequence samples respectively corresponding to the plurality of length intervals from the first full genome data.

Step S14: and extracting a plurality of second short sequence samples respectively corresponding to the length intervals from the second full genome data.

The extraction of a plurality of first short sequence samples respectively corresponding to a plurality of length intervals from the first complete genome data and the extraction of a plurality of second short sequence samples respectively corresponding to a plurality of length intervals from the second complete genome data refer to the extraction of short sequence fragments with different lengths from the complete genome. Specifically, metagenome sequencing can be simulated by using MetaSim software, and a plurality of short sequence fragments can be generated by simulation by using a whole genome as a reference.

Extracting the plurality of first short sequence samples respectively corresponding to the plurality of length intervals from the first full genome data means extracting a plurality of first short sequence samples with a length within a range of 100bp to 400bp from the first full genome data, extracting a plurality of first short sequence samples with a length within a range of 400bp to 800bp from the first full genome data, extracting a plurality of first short sequence samples with a length within a range of 800bp to 1200bp from the first full genome data, and extracting a plurality of first short sequence samples with a length within a range of 1200bp to 1800bp from the first full genome data.

Extracting a plurality of second short sequence samples respectively corresponding to a plurality of length intervals from the second full genome data means: extracting a plurality of second short sequence samples with the length ranging from 100bp to 400bp from the second complete genome data, extracting a plurality of second short sequence samples with the length ranging from 400bp to 800bp from the second complete genome data, extracting a plurality of second short sequence samples with the length ranging from 800bp to 1200bp from the second complete genome data and extracting a plurality of second short sequence samples with the length ranging from 1200bp to 1800bp from the second complete genome data.

In the embodiment of the present application, after the training samples are acquired by the method in steps S11-S14, the deep neural networks corresponding to the plurality of length intervals are acquired, and the deep neural networks corresponding to the plurality of length intervals are trained for a plurality of times by using the plurality of first short sequence samples and the plurality of second short sequence samples, respectively, to obtain the at least one preset classification network.

Four constructed deep neural networks as shown in fig. 1 were obtained: the method comprises the steps that a first deep neural network, a second deep neural network, a third deep neural network and a fourth deep neural network are trained by utilizing a plurality of first short sequence samples with the length within the range of 100bp-400bp and a plurality of second short sequence samples with the length within the range of 100bp-400bp to obtain a first preset classification network; training a second deep neural network by utilizing a plurality of first short sequence samples with the length within the range of 400bp-800bp and a plurality of second short sequence samples with the length within the range of 400bp-800bp to obtain a second preset classification network; training a third deep neural network by utilizing a plurality of first short sequence samples with the length within the range of 800bp-1200bp and a plurality of second short sequence samples with the length within the range of 800bp-1200bp to obtain a third preset classification network; and training a fourth deep neural network by utilizing a plurality of first short sequence samples with the length within the range of 1200bp-1800bp and a plurality of second short sequence samples with the length within the range of 1200bp-1800bp to obtain a fourth preset classification network.

In an example of the application, a sequence with a score higher than 0.5 output by a full connection layer is judged as a positive sample (virulent phage), a sequence with a score lower than 0.5 is judged as a negative sample (mild phage), and parameters of the deep neural network are adjusted according to the degree of difference between the judgment result and the real type of the input short sequence sample. The method can support the calculation of the degree of the phase difference between the judgment result and the input real type of the short sequence sample in a mode of labeling the first short sequence sample and the second short sequence sample.

Inputting the first short sequence sample or the first short sequence sample into a deep neural network, extracting and learning the characteristics of a convolutional layer, performing optimization training on a pooling layer and a normalization layer, performing random inactivation layer action to prevent overfitting, performing relearning and training on a global normalization layer, a full-connection layer and a batch normalization layer, and scoring the prediction of the output value of the full-connection layer of sigmoid between 0 and 1 by an activation function. During training, Adam was used as the optimizer, the learning rate was set to 0.0001, and the number of training was set to 100.

In another embodiment of the present application, another method for training a deep neural network is provided, in which a sequence is supplemented to a first short sequence sample and a second short sequence sample, so as to expand a data set for training the deep neural network and improve the classification accuracy of a preset classification network.

After extracting a plurality of first short sequence samples from the first full genome data, the method further comprises: carrying out reverse sequence complementing on the plurality of first short sequence samples to obtain a plurality of third short sequence samples; after extracting a plurality of second short sequence samples from the second genome-wide data, the method further comprises: and carrying out reverse complement on the plurality of second short sequence samples to obtain a plurality of fourth short sequence samples.

In one example of the present application, assuming that the base information of the first short sequence sample is ACGCTATTGCACCG, the base information is subjected to reverse complementary sequencing to obtain a third short sequence sample: TGCGATAACGTGGC are provided.

Utilizing the plurality of first short sequence samples and the plurality of second short sequence samples to respectively train the deep neural networks corresponding to the plurality of length intervals for a plurality of times to obtain the at least one preset classification network, including: and training the deep neural networks respectively corresponding to the length intervals for multiple times by using the first short sequence samples, the second short sequence samples, the third short sequence samples and the fourth short sequence samples to obtain the at least one preset classification network.

After training to obtain at least one preset classification network corresponding to the four groups of different sequence lengths, the application describes steps for performing a phage classification method based on the at least one preset classification network.

FIG. 2 is a flowchart of the steps of the phage sorting method proposed in the examples of the present application, as shown in FIG. 2, the steps are as follows:

step S21: and acquiring at least one preset classification network corresponding to the sequence length according to the sequence length of the phage fragment to be classified.

under the condition that the sequence length is more than 1800bp, segmenting phage fragments to be classified according to the scale of 1800bp, and taking one of the first preset classification network, the second preset classification network, the third preset classification network and the fourth preset classification network as a supplementary classification network according to the segmented shortest sequence fragments; and taking the fourth preset classification network and the supplementary classification network as the preset classification networks of the phage fragments to be classified.

In an example of the present application, assuming that the sequence length of the phage fragment a to be classified is 2400bp, the phage fragment to be classified is divided into a sequence fragment 1 with a length of 1800bp and a sequence fragment 2 with a length of 600bp, and the second preset classification network and the fourth preset classification network are used as the preset classification networks of the phage fragment a to be classified.

When the sequence length of the phage fragments to be classified is less than 100bp, acquiring a sequence complementing length according to the input range of the first preset classification network; and adopting preset codes to complement the phage fragments to be classified to the length of the sequence complement, and acquiring the first preset classification network.

The preset code may be [0, 0, 0, 0 ]. In one example of the present application, assuming that the sequence length of the phage fragment B to be classified is 40bp, the complementary sequence length is determined to be 60bp according to the minimum 100bp of the input range of the first predetermined classification network. And supplementing the phage fragment B to be classified to 100bp by adopting a 60bp preset code [0, 0, 0, 0] to obtain a phage fragment B ' to be classified, carrying out unique hot coding on the phage fragment B ' to be classified, and inputting the phage fragment B ' to be classified into a first preset classification network.

Step S22: and carrying out unique hot coding on all bases of the phage fragments to be classified to obtain base sequence characteristics.

The use of a unique heat matrix to encode a base sequence allows each base to be successively reflected as a whole. The one-hot code for base A is [0, 0, 0, 1], the one-hot code for base T is [0, 0, 1, 0], the one-hot code for base C is [0, 1, 0, 0], and the one-hot code for base G is [1, 0, 0, 0 ].

In one example of the present application, assuming that the phage fragment C to be classified is ACGCTATTGCACCG, the base sequence characteristics obtained after unique thermal coding of the phage fragment C to be classified are shown in Table 1.

1	0	0	0	0	1	0	0	0	0	1	0	0	0
														0	1	0	1	0	0	0	0	0	1	0	1	1	0
0	0	1	0	0	0	0	0	1	0	0	0	0	1
														0	0	0	0	1	0	1	1	0	0	0	0	0	0

TABLE 1

Step S23: inputting the base sequence features into the at least one predetermined classification network;

inputting the base sequence characteristics into at least one preset classification network means that when the sequence length of the phage fragments to be classified is less than 100bp, the phage fragments to be classified are supplemented to the length of 100bp, and the supplemented phage fragments to be classified are subjected to independent thermal coding, and then the obtained base sequence characteristics are input into a first preset classification network;

when the sequence length of the phage fragments to be classified is within the length interval of 100bp-400bp, inputting the base sequence characteristics into a first preset classification network;

when the sequence length of the phage fragments to be classified is within the length interval of 400bp-800bp, inputting the base sequence characteristics into a second preset classification network;

when the sequence length of the phage fragments to be classified is within the length interval of 800bp-1200bp, inputting the base sequence characteristics into a third preset classification network;

and when the sequence length of the phage fragments to be classified is within the length interval of 1200bp-1800bp, inputting the base sequence characteristics into a fourth preset classification network.

When the sequence length of the phage fragments to be classified is more than 1800bp, inputting base sequence characteristics obtained by coding the sequence fragments with the length of 1800bp after segmentation into a fourth preset classification network; inputting base sequence characteristics obtained by coding sequence fragments with the segmented length within the range of 800bp-1200bp into a third preset classification network; inputting base sequence characteristics obtained by coding sequence fragments with the segmented length within the range of 400bp-800bp into a second preset classification network; and inputting the base sequence characteristics obtained by coding the segmented sequence fragments with the lengths within the range of 100bp-400bp into a first preset classification network.

Step S24: and determining the type of the phage fragments to be classified according to the output scores of the at least one preset classification network.

The threshold score can be set according to the training data of the deep neural network, for the condition that the preset classification network is one, when the score output by the preset classification network is greater than the threshold score, the type of the phage fragment to be classified is determined to be a virulent phage, and when the score output by the preset classification network is not greater than the threshold score, the type of the phage fragment to be classified is determined to be a temperate phage.

And for the condition that the preset classification network is more than one, namely when the sequence length of the phage fragments to be classified is more than 1800bp, determining the length of each sequence fragment obtained by segmenting the phage fragments to be classified. The length of each sequence segment may be the length of the input sequence of each of the at least one predetermined classification network.

Calculating a weighted average of the output scores of the at least one preset classification network according to the length of each input sequence; and determining the type of the phage fragments to be classified according to the weighted average value.

Calculating the weighted average of the output scores of at least one preset classification network means that the output score of a specific preset classification network is multiplied by the upper limit value of the length interval corresponding to the specific preset classification network to obtain the weighted score of the specific preset classification network, the weighted scores of all the preset classification networks in the at least one preset classification network are summed, and then the ratio of the summed value to the upper limit value of the length interval corresponding to each preset classification network in the at least one preset classification network is calculated to obtain the weighted average.

In an example of the present application, the predetermined classification networks determined to participate in determining the type of the phage fragment D to be classified include a first predetermined classification network and a fourth predetermined classification network, the length of the phage fragment D to be classified is 2000bp, and the phage fragment D to be classified is segmented to obtain a sequence fragment 3 with a length of 1800bp and a sequence fragment 4 with a length of 200 bp. Assuming that the score of the first preset classification network output is 0.67, the score of the fourth preset classification network output is 0.8, and the weighted average is (400 × 0.67+1800 × 0.8)/(400+ 1800).

Based on the same inventive concept, the embodiment of the application provides a phage sorting device. FIG. 3 is a functional block diagram of a phage sorting apparatus according to an embodiment of the present application. As shown in fig. 3, the apparatus includes:

the acquisition module 31 is configured to acquire at least one preset classification network corresponding to a sequence length of a phage fragment to be classified according to the sequence length;

the coding module 32 is used for carrying out unique-hot coding on all bases of the phage fragments to be classified to obtain base sequence characteristics;

an input module 33, configured to input the base sequence features into the at least one preset classification network;

and the output module 34 is configured to determine the type of the phage fragment to be classified according to the output score of the at least one preset classification network.

Optionally, the obtaining module includes:

Optionally, the obtaining module further includes:

Optionally, the obtaining module includes:

Optionally, the output module includes:

Optionally, the apparatus further comprises:

the training submodule includes:

Based on the same inventive concept, another embodiment of the present application provides a readable storage medium, on which a computer program is stored, which when executed by a processor, implements the steps in the phage classification method according to any of the above-mentioned embodiments of the present application.

Based on the same inventive concept, another embodiment of the present application provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor, when executing, implements the steps of the phage classification method according to any of the above embodiments of the present application.

For the device embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.

The embodiments in the present specification are described in a progressive or descriptive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

As will be appreciated by one of skill in the art, embodiments of the present application may be provided as a method, apparatus, or computer program product. Accordingly, embodiments of the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

Embodiments of the present application are described with reference to flowchart illustrations and/or block diagrams of methods, apparatus, and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing terminal to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing terminal to cause a series of operational steps to be performed on the computer or other programmable terminal to produce a computer implemented process such that the instructions which execute on the computer or other programmable terminal provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present application have been described, additional variations and modifications of these embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including the preferred embodiment and all such alterations and modifications as fall within the true scope of the embodiments of the application.

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or terminal that comprises the element.

The above detailed description is provided for the method, apparatus, device and storage medium for classifying bacteriophage, and the above description of the embodiments is only used to help understanding the method and the core concept of the present application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims

1. A method of phage classification, the method comprising:

2. The method according to claim 1, wherein obtaining at least one preset classification network corresponding to the sequence length of the phage fragments to be classified according to the sequence length comprises:

3. The method according to claim 2, wherein in the case that the sequence length is greater than 1800bp, obtaining at least one preset classification network corresponding to the sequence length according to the sequence length of the phage fragment to be classified comprises:

dividing the phage fragments to be classified according to a 1800bp scale;

4. The method according to claim 2, wherein obtaining at least one preset classification network corresponding to the sequence length of the phage fragments to be classified according to the sequence length comprises:

5. The method according to claim 1, wherein determining the type of the phage fragments to be classified according to the output score of the at least one preset classification network comprises:

6. The method of claim 1, further comprising:

7. The method of claim 6, wherein after extracting a plurality of first short sequence samples from the first full genome data, the method further comprises:

8. A phage sorting apparatus, comprising:

an input module for inputting the base sequence features into the at least one predetermined classification network;

9. A readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 7.

10. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor when executed implements the steps of the method according to any of claims 1-7.