CN112397138A

CN112397138A - AI technology-based method for drawing strain protein two-dimensional spectrum

Info

Publication number: CN112397138A
Application number: CN202010995311.1A
Authority: CN
Inventors: 张辉; 王利
Original assignee: Inner Mongolia University for Nationlities
Current assignee: Minnan Normal University
Priority date: 2020-09-21
Filing date: 2020-09-21
Publication date: 2021-02-23
Anticipated expiration: 2040-09-21
Also published as: CN112397138B

Abstract

The invention discloses a method for drawing a strain protein two-dimensional spectrum based on AI technology, which aims at the characteristics of strain protein sequences, structures and music in expression form, and realizes a method for generating a two-dimensional music score from the strain protein structures based on AI technology, thereby establishing the one-to-one corresponding relation between the strain protein sequences and the music to assist the analysis and research of the strain proteins. After the strain protein is expressed in a two-dimensional spectrum manner, when the strain protein is researched, the difference of different strain proteins can be visually and intuitively seen through the two-dimensional spectrum, the two-dimensional spectrum can also be played into music, the difference of different strain proteins can be perceived in an auditory sense, and a novel method is provided for the research of the strain protein.

Description

AI technology-based method for drawing strain protein two-dimensional spectrum

Technical Field

The invention relates to the technical field of artificial intelligence application, in particular to a method for drawing a strain protein two-dimensional spectrum based on an AI (artificial intelligence) technology.

Background

In the field of life sciences, AI technology has also gradually opened an irreplaceable position for data analysis. Protein as an important component of a living body has sequence diversity and functional structure complexity, so that protein research still remains a life field which is difficult to completely overcome by scientists.

At present, whether the protein characterization is performed in other forms or not mainly by the amino acid sequence, the spatial structure and the like of the protein is adopted to improve the visualization effect of the protein, so that the protein characterization is convenient to analyze and becomes the focus of research of people.

Disclosure of Invention

In view of the above, the invention provides a method for drawing a two-dimensional spectrum of strain proteins based on an AI technology, wherein the strain proteins are characterized in a two-dimensional spectrum form by the AI technology, and different strain proteins correspond to different music while the visualization effect of the proteins is increased, so as to assist the analysis and research of the proteins in the visual and auditory aspects.

The technical scheme provided by the invention is specifically a method for drawing a strain protein two-dimensional spectrum based on AI technology, which is characterized by comprising the following steps:

s1: acquiring a primary structure and a secondary structure of a strain protein sample;

s2: regarding amino acid sequences in the primary structure of the strain protein sample as linear arrangement to form one-dimensional single-channel data;

s3: projecting four main chain atoms in a secondary structure of the strain protein sample in each coordinate system of a three-dimensional space to form three-channel data of main chain skeleton atoms;

s4: constructing a protein generation two-dimensional spectrum model based on a generation type countermeasure network, adopting a plurality of strain protein samples, respectively taking a music style and a protein sequence as constraint conditions, and training the protein generation two-dimensional spectrum model to obtain model parameters;

s5: under the model parameters obtained in step S4, a two-dimensional spectrum of strain proteins is drawn using the protein generation two-dimensional spectrum model.

Preferably, in step S2, regarding the amino acid sequences in the primary structure of the strain protein sample as linear arrangement, forming one-dimensional single-channel data, specifically:

setting the values of 20 amino acids forming the protein as s according to the value range of the image gray value of 0-255₁～s₂₀；

And forming one-dimensional single-channel data according to the sequence of the amino acids in the strain protein sample and the corresponding numerical values of the amino acids.

Further preferably, in step S3, the projection of four main chain atoms in the secondary structure of the strain protein sample in each coordinate system of the three-dimensional space forms three-channel data of main chain skeleton atoms, specifically:

setting a main chain amino acid skeleton atom C according to the value range of the image gray value of 0-255_αThe values of C, N, O are k₁、k₂、k₃And k₄；

And projecting the four main chain atoms in each coordinate system of the three-dimensional space to form a three-channel distribution gray image of the main chain skeleton atoms, wherein the data is three-channel data.

Further preferably, the plurality of strain protein samples comprise: natural strain protein samples and strain protein samples with increased production-type antagonistic networks.

Further preferably, in step S4, the generating a two-dimensional spectrum model based on the protein constructed by the generative confrontation network includes:

a two-dimensional spectrum generator, a music generation discriminator, a music style discriminator, a protein inverse generator and a protein discriminator;

the training process of the protein generation two-dimensional spectrum model comprises the following steps:

s401: inputting single-channel data of a primary structure in a strain protein sample, three-channel data of a secondary structure in the strain protein sample and single-channel data of music style constraint into a two-dimensional spectrum generator to generate a two-dimensional spectrum and output a music work;

s402: judging the difference between the music generated by the two-dimensional spectrum generator and the real music through the music discriminator;

s403: judging whether the generated music accords with the specified style constraint or not by the music style discriminator;

s404: adjusting model parameters corresponding to the two-dimensional spectrum generator and the music discriminator according to discrimination results of the step S402 and the step S403 until the model parameters meet the threshold requirement;

s405: generating, by the protein reverse generator, an artificial protein sequence with the two-dimensional spectrum generated by the two-dimensional spectrum generator and a protein sequence constraint as its inputs;

s406: and (3) distinguishing the difference between the artificial protein sequence and the real protein sequence through the protein discriminator, if the difference exceeds a threshold value, adjusting model parameters corresponding to the two-dimensional spectrum generator and the music discriminator, and repeating the steps S401 to S405 until the difference between the artificial protein sequence and the real protein sequence meets the threshold value requirement.

Further preferably, the first layer of the two-dimensional spectrum generator is a mixed convolution layer composed of one channel and three channels, and different lines are used for performing convolution processing on input data, wherein for input protein primary structure data, according to amino acid distribution characteristics, 20 types of 3 × 1 convolution kernels are used, and the step length is 3; for input protein secondary structure data, 4 types of 3 × 3 × 3 convolution kernels are set corresponding to main framework atoms according to the three-dimensional distribution characteristics of amino acids, and the step length is 3; for the music style constraint data, adopting m 3 × 1 convolution kernels corresponding to the music style constraint, wherein the step length is 1;

in the middle level, a CycleGAN model is referred, but various characteristics are reserved for the maximum program, a pooling layer is not adopted, and an LReLu function is adopted for each layer of activation function;

and the output layer is synthesized by adopting Softmax, and the two-dimensional spectrum drawing is completed.

Further preferably, the objective function of the protein generating two-dimensional spectrum model is:

T＝min(G+L₁+L₂+F+L₃)；

wherein G is an objective function of a two-dimensional spectrum generator (G1):

G(X1,C)＝max(E_p[f_p(X1,C)])；

L₁generating an objective function of an arbiter (D1) for music:

L₁(D,G)＝min_Gmax_D(E_x[log(D(X2,c))]+E_y[log(1-D(G(X1,G)))])；

L₂for the objective function of the music style discriminator (D2):

L₂(Y,C)＝max(E_P[f_p(Y,C)])；

f is the objective function of the protein inverse generator (F1):

F(Y,L)＝max(E_p[f_p(Y,L,X3)])；

L₃for the objective function of the protein discriminator (D3):

L₃(Z,X1)＝max(E_p[f_p(Z,X1)])；

wherein, X1 is protein one, secondary structure mixed channel data, Z is artificial protein sequence generated by F1, X2 is real existing music data, X3 is discrimination result of Z, C music style constraint, L protein sequence characteristic constraint, and two-dimensional spectrum music data generated by Y two-dimensional spectrum generator G1.

Further preferably, the strain protein is a novel coronavirus protein.

The method for drawing the strain protein two-dimensional spectrum based on the AI technology realizes a method for generating a two-dimensional music score from a strain protein structure based on the AI technology aiming at the characteristics of the strain protein sequence, the structure and the music in the expression form, thereby establishing the one-to-one correspondence relationship between the strain protein sequence and the music to assist the analysis and research of the strain protein. After the strain protein is expressed in a two-dimensional spectrum manner, when the strain protein is researched, the difference of different strain proteins can be visually and intuitively seen through the two-dimensional spectrum, the two-dimensional spectrum can also be played into music, the difference of different strain proteins can be perceived in an auditory sense, and a novel method is provided for the research of the strain protein.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention as disclosed.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.

In order to more clearly illustrate the embodiments or technical solutions in the prior art of the present invention, the drawings used in the description of the embodiments or prior art will be briefly described below, and it is obvious for those skilled in the art that other drawings can be obtained based on these drawings without creative efforts.

FIG. 1 is a block flow diagram of a method for drawing a strain protein two-dimensional spectrum based on AI technology according to an embodiment of the disclosure;

FIG. 2 is a schematic diagram of a specific flow chart of a method for drawing a strain protein two-dimensional spectrum based on AI technology, provided by the disclosed embodiment of the invention;

FIG. 3 is a model diagram of a two-dimensional spectrum generated based on a protein constructed by a generative countermeasure network in a method for drawing a strain protein two-dimensional spectrum based on AI technology provided by the disclosed embodiment of the invention;

fig. 4 is a training flowchart of a two-dimensional spectrum model generated for a protein in a method for drawing a strain protein two-dimensional spectrum based on an AI technique according to an embodiment of the disclosure.

Fig. 5 is a schematic model structure diagram of a two-dimensional spectrum generator G1 in a method for drawing a strain protein two-dimensional spectrum based on an AI technique according to an embodiment of the disclosure.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present invention. Rather, they are merely examples of methods consistent with certain aspects of the invention, as detailed in the appended claims.

In order to realize characterization of strain proteins from another perspective to assist in protein analysis and study, the present embodiment provides a method for drawing a two-dimensional spectrum of strain proteins based on AI technology, wherein the basic constituent elements of strain proteins are generally 20 amino acids, and the basic constituent units of music are seven scales, which can match the basic elements by designing mapping methods although the number of the basic elements is different.

On the basis that 20 amino acids are combined according to different arrangements to form a primary structure, the protein can also construct various spatial conformations through covalent bonds and non-covalent bonds to form biological macromolecules with various shapes and functions. The music forms basic tunes on the basis of different scale permutation and combination. And then integrating the synthesis and regulation of rhythm, harmony, dynamics, tone, curved style, texture and tone to form styles and melodies with different characteristics, thereby giving people different sensory experiences.

Aiming at the characteristics of the protein sequence, the structure of the strain and the representation form of music, the AI technology can be used for realizing the method for generating the two-dimensional score from the protein structure of the strain, thereby establishing the relationship between the protein sequence of the new coronavirus and the music.

The method for drawing the strain protein two-dimensional spectrum based on the AI technology mainly comprises the following steps: establishing a training and testing data set for generating a two-dimensional music score by using viral proteins, and aiming at the shortage of the strain protein sample amount, increasing the viral samples by using a generating type countermeasure network; designing a mapping relation based on an amino acid structure and music reconconformation, establishing the relevance of different expression results, and establishing a two-dimensional spectrum music generation method based on a generative confrontation network technology; the primary and secondary structure of protein and music style are used as constraint and input into generator; generating a two-dimensional spectrum which accords with the specific lewy wind, wherein the two-dimensional spectrum can generate proteins which accord with the structure of the new coronavirus through a protein generator; the generated music is taken as a new input and is respectively sent to a music discriminator and a protein generator, the music discriminator is used for judging whether the generated music accords with the composition rule, and the protein generator is used for generating a protein-like secondary and tertiary structure and comparing the protein-like secondary and tertiary structure with the original protein to ensure the relevance of the music and the protein.

Referring to fig. 1, an overall framework flow of a method for drawing a two-dimensional spectrum of a strain protein based on an AI technology is shown in fig. 2, under the guidance of the framework flow, the method for drawing a two-dimensional spectrum of a strain protein based on an AI technology provided in the present embodiment specifically includes the following steps:

In the above method, the primary structure of the strain protein sample, the secondary structure of the strain protein sample, the musical style constraint and the protein sequence constraint are used as input data.

Wherein the content of the first and second substances,

primary structure data of strain protein samples: the method takes the primary amino acid sequence of protein data as input, and takes the amino acid sequence in the primary structure of the protein as linear arrangement because the primary structure of the protein is formed by the dehydration of amino acid to form front and back linear connection, and the data forms one-dimensional single-channel data, namely one-dimensional data. Setting the values of 20 amino acids forming the protein as s according to the value range of the image gray value of 0-255₁～s₂₀。

Secondary structure data of strain protein samples: the protein sequence secondary structure is in three-dimensional space distribution due to the existence of alpha helix and beta folding and the connection of a plurality of times of linked amino acids on the main chain. Neglecting secondary link amino acids with small influence on spatial characteristics, and taking values according to image gray values of 0-255Range, set up as backbone amino acid skeleton atom C_αThe values of C, N, O are k₁、k₂、k₃And k₄Each represents k₁％、k₂％、k₃% and k₄% black, the projection of four main chain atoms in each coordinate system of the three-dimensional space forms a three-channel distribution gray image of the main chain skeleton atoms, and the data is three-channel data, namely three-dimensional data.

Constraint conditions of music style: the tone, chord and rhythm of different specific music styles are combined into music style constraint according to the categories of music works, the rules of music creation and the like.

Protein sequence constraints: the comprehensive characteristics of one-dimensional and three-dimensional property abstractions of the amino acid sequences of the primary and secondary structures of the new coronavirus are used as protein sequence constraints of the model.

Because a large amount of sample data of strain proteins are needed in the training process of generating a two-dimensional spectrum model of proteins, and for some viruses, the training requirements cannot be met due to the small sample size, for example: the new coronavirus can be amplified through a strain protein sample in an artificial mode to meet the training requirement of a model, multiple amplification methods can be selected for the strain protein sample, and the strain protein sample is increased by adopting a deep learning mode of a generative confrontation network in the scheme.

Referring to fig. 3, for the protein generation two-dimensional spectrum model constructed based on the generative confrontation network provided in the present embodiment, the model is based on the generative confrontation model, and a protein-to-music generation model is designed by taking a strain protein generation two-dimensional score and a specific style of music as research objects, and the model includes: the two-dimensional spectrum generator G1, the music generation discriminator D1, the music style discriminator D2, the protein inverse generator F1 and the protein discriminator D3 input the primary and secondary structure X1 of the input protein, the music style constraint C and the protein sequence constraint L, and then the two-dimensional spectrum can be generated and the musical composition can be output.

The training process for generating the two-dimensional spectral model for the above-mentioned proteins, see fig. 4, includes:

s401: inputting mixed channel data X1 formed by single channel data of a primary structure in a strain protein sample and three channel data of a secondary structure in the strain protein sample and single channel data C constrained by music style into a two-dimensional spectrum generator G1 to generate a two-dimensional spectrum and output a music piece;

s402: judging the difference between the music generated by the two-dimensional spectrum generator G1 and the real music through a music discriminator D1;

s403: judging whether the generated music accords with the specified style constraint or not through a music style discriminator D2, and controlling the music style generated by G1 to be consistent with the specific virus disease sequence;

s404: according to the judgment results of the step S402 and the step S403, adjusting model parameters corresponding to the two-dimensional spectrum generator G1 and the music discriminator D1 until the model parameters meet the threshold requirement;

s405: generating an artificial protein sequence X3 by a protein reverse generator F1 with the two-dimensional spectrum generated by the two-dimensional spectrum generator G1 and the protein sequence constraint L as its inputs;

s406: and (3) distinguishing the difference between the artificial protein sequence X3 and the real protein sequence X1 through a protein discriminator D3, if the difference exceeds a threshold value, adjusting model parameters corresponding to a two-dimensional spectrum generator G1 and a music discriminator D1, and then repeating the steps S401 to S405 until the difference between the artificial protein sequence X3 and the real protein sequence X1 meets the threshold value requirement.

The input data of the two-dimensional spectrum generator G1 includes: the single-channel data composed of mixed channel data X1 composed of single-channel data of a primary structure of protein and three-channel data generated by a secondary structure and music style constraint C, namely X1+ C, constitutes mixed multi-channel data.

Referring to fig. 5, a schematic diagram of a model structure of a two-dimensional spectrum generator G1 is shown, where a first layer of the two-dimensional spectrum generator G1 is a mixed convolution layer composed of one channel and three channels, and corresponds to a primary protein structure, a music style constraint, and a secondary protein space structure, respectively, to collect a primary virus protein feature, and to perform convolution processing on input data by using different lines according to data, where, to extract protein amino acid features, for the input primary protein structure data, 20 types of 3 × 1 convolution kernels are used according to amino acid distribution characteristics, and a step length is 3; for input protein secondary structure data, 4 types of 3 × 3 × 3 convolution kernels are set corresponding to main framework atoms according to the three-dimensional distribution characteristics of amino acids, and the step length is 3; for the music style constraint data, adopting m 3 × 1 convolution kernels corresponding to the music style constraint, wherein the step length is 1;

The objective function of the protein generating two-dimensional spectrum model is as follows:

T＝min(G+L₁+L₂+F+L₃)；

G(X1,C)＝max(E_p[f_p(X1,C)])；

L₁generating an objective function of an arbiter (D1) for music:

L₁(D,G)＝min_Gmax_D(E_x[log(D(X2,c))]+E_y[log(1-D(G(X1,G)))])；

L₂for the objective function of the music style discriminator (D2):

L₂(Y,C)＝max(E_P[f_p(Y,C)])；

f is the objective function of the protein inverse generator (F1):

F(Y,L)＝max(E_p[f_p(Y,L,X3)])；

L₃for the objective function of the protein discriminator (D3):

L₃(Z,X1)＝max(E_p[f_p(Z,X1)])；

The method for drawing the strain protein two-dimensional spectrum based on the AI technology provided by the embodiment is particularly suitable for researching and using new coronavirus proteins.

Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.

It is to be understood that the present invention is not limited to what has been described above, and that various modifications and changes may be made without departing from the scope thereof. The scope of the invention is limited only by the appended claims.

Claims

1. A method for drawing a strain protein two-dimensional spectrum based on AI technology is characterized by comprising the following steps:

2. The AI-based method for drawing a two-dimensional spectrum of strain proteins as claimed in claim 1, wherein in step S2, the amino acid sequences in the primary structure of the strain protein sample are regarded as linear arrangement to form one-dimensional single-channel data, specifically:

3. The AI-technology-based method for drawing a two-dimensional spectrum of strain proteins as claimed in claim 1, wherein in step S3, the projections of four main chain atoms in the secondary structure of the strain protein sample in each coordinate system of a three-dimensional space form three-channel data of main chain skeleton atoms, specifically:

4. The AI-based technique for drawing a two-dimensional spectrum of strain proteins as recited in claim 1, wherein the plurality of strain protein samples comprise: natural strain protein samples and strain protein samples with increased production-type antagonistic networks.

5. The AI-based method for drawing a two-dimensional spectrum of strain proteins as claimed in claim 1, wherein in step S4, the generation of the two-dimensional spectrum model based on the protein constructed by the generative confrontation network comprises:

a two-dimensional spectrum generator (G1), a music generation discriminator (D1), a music style discriminator (D2), a protein inverse generator (F1), and a protein discriminator (D3);

s401: inputting single-channel data of a primary structure in a strain protein sample, three-channel data of a secondary structure in the strain protein sample and single-channel data of music style constraint into a two-dimensional spectrum generator (G1), generating a two-dimensional spectrum, and outputting a musical piece;

s402: determining, by the music discriminator (D1), a difference between the music generated by the two-dimensional spectrum generator (G1) and real music;

s403: determining, by the music style discriminator (D2), whether the generated music complies with a specified style constraint;

s404: according to the discrimination results of the step S402 and the step S403, adjusting model parameters corresponding to the two-dimensional spectrum generator (G1) and the music discriminator (D1) until the model parameters meet the threshold requirement;

s405: generating, by the protein reverse generator (F1), an artificial protein sequence (X3) with the two-dimensional spectrum generated by the two-dimensional spectrum generator (G1) and a protein sequence constraint (L) as inputs thereof;

s406: and (3) distinguishing the difference between the artificial protein sequence (X3) and the real protein sequence (X1) through the protein discriminator (D3), if the difference exceeds a threshold value, adjusting model parameters corresponding to the two-dimensional spectrum generator (G1) and the music discriminator (D1), and then repeating the steps S401 to S405 until the difference between the artificial protein sequence (X3) and the real protein sequence (X1) meets the threshold value requirement.

6. The AI-based method for drawing a two-dimensional spectrum of strain proteins as claimed in claim 5, wherein the first layer of the two-dimensional spectrum generator (G1) is a mixed convolution layer composed of one channel and three channels, and different circuits are used for convolution processing of input data, wherein for input primary structure data of proteins, 20 types of 3 × 1 convolution kernels are used according to amino acid distribution characteristics, and the step length is 3; for input protein secondary structure data, 4 types of 3 × 3 × 3 convolution kernels are set corresponding to main framework atoms according to the three-dimensional distribution characteristics of amino acids, and the step length is 3; for the music style constraint data, adopting m 3 × 1 convolution kernels corresponding to the music style constraint, wherein the step length is 1;

7. The AI-based technique for creating a two-dimensional spectrum of proteins from a strain of claim 5, wherein the objective function of the two-dimensional spectrum model generated by the proteins is as follows:

T＝min(G+L₁+L₂+F+L₃)；

G(X1,C)＝max(E_p[f_p(X1,C)])；

L₁generating an objective function of an arbiter (D1) for music:

L₁(D,G)＝min_Gmax_D(E_x[log(D(X2,c))]+E_y[log(1-D(G(X1,G)))])；

L₂for the objective function of the music style discriminator (D2):

L₂(Y,C)＝max(E_P[f_p(Y,C)])；

f is the objective function of the protein inverse generator (F1):

F(Y,L)＝max(E_p[f_p(Y,L,X3)])；

L₃for the objective function of the protein discriminator (D3):

L₃(Z,X1)＝max(E_p[f_p(Z,X1)])；

8. The AI-based technique for profiling two-dimensional strain proteins according to claim 1, wherein the strain proteins are novel coronavirus proteins.