CN114496103A - Method and device for analyzing movement process of transcription factor protein on DNA - Google Patents

Method and device for analyzing movement process of transcription factor protein on DNA Download PDF

Info

Publication number
CN114496103A
CN114496103A CN202210128910.2A CN202210128910A CN114496103A CN 114496103 A CN114496103 A CN 114496103A CN 202210128910 A CN202210128910 A CN 202210128910A CN 114496103 A CN114496103 A CN 114496103A
Authority
CN
China
Prior art keywords
conformation
simulation
dna
protein
round
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210128910.2A
Other languages
Chinese (zh)
Inventor
鄂超
伍庭晨
谢潇
贾慧彤
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Digsur Science And Technology Co ltd
Original Assignee
Beijing Digsur Science And Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Digsur Science And Technology Co ltd filed Critical Beijing Digsur Science And Technology Co ltd
Priority to CN202210128910.2A priority Critical patent/CN114496103A/en
Publication of CN114496103A publication Critical patent/CN114496103A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C10/00Computational theoretical chemistry, i.e. ICT specially adapted for theoretical aspects of quantum chemistry, molecular mechanics, molecular dynamics or the like
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B15/00ICT specially adapted for analysing two-dimensional or three-dimensional molecular structures, e.g. structural or functional relations or structure alignment
    • G16B15/20Protein or domain folding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B5/00ICT specially adapted for modelling or simulations in systems biology, e.g. gene-regulatory networks, protein interaction networks or metabolic networks

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • Medical Informatics (AREA)
  • Chemical & Material Sciences (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Molecular Biology (AREA)
  • Physiology (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention provides a method and a device for analyzing the movement process of transcription factor protein on DNA. The method comprises the steps of obtaining a full-atom MD simulation track of WRKY structural domain protein moving on a DNA double strand as an initial path; uniformly extracting the conformations of WRKY structural domain protein and DNA of a plurality of frames from the initial path as a first conformation; calculating a root-mean-square distance and a rotation angle, clustering the first conformation by using a K-means algorithm, and performing a first round of MD simulation to obtain a second conformation; clustering the second conformation by using a K-centers algorithm, and performing a second round of MD simulation in a wide space to obtain a third conformation; and constructing a Markov state model, and analyzing the MD simulation track of the second round. In this way, movement of proteins on the DNA base or nanoscale between microseconds and milliseconds can be directly observed under experimental measurements, providing an effective physical mechanism not derivable in experiments or direct simulations.

Description

Method and device for analyzing movement process of transcription factor protein on DNA
Technical Field
The present invention relates generally to the fields of molecular dynamics and biological computing, and more particularly, to a method and apparatus for analyzing the course of movement of transcription factor proteins on DNA.
Background
One-dimensional sliding of transcription factor proteins on DNA is critical for convenient spreading of transcription factors in genetic regulation to locate target DNA sites. Transcription factors often need to be alternately subjected to one-dimensional random walk diffusion (1D diffusion) on nonspecific sequences of DNA in order to effectively search for specific DNA recognition sites; the rate and accuracy of target site search also directly affect the efficiency and accuracy of gene expression.
Although the above diffusion model of transcription factor protein is clear, the fine molecular dynamics during diffusion is not clear, especially the movement of protein on DNA base or nano scale, which occurs between microsecond and millisecond, is difficult to directly observe in the accuracy of current experimental measurement.
Disclosure of Invention
According to an embodiment of the present invention, a scheme for analyzing the movement process of a transcription factor protein on DNA is provided. The scheme determines a reduced dynamic network model for the most basic conformation change of the biomolecule by establishing a Markov state model, and can provide an effective physical mechanism which cannot be obtained in experiments or direct simulation.
In a first aspect of the invention, a method for analyzing the movement process of a transcription factor protein on DNA is provided. The method comprises the following steps:
acquiring a full-atom MD simulation track of WRKY structural domain protein moving on a DNA double strand as an initial path;
uniformly extracting the conformations of the WRKY domain protein and the DNA of a plurality of frames from the initial path to be used as a first conformation;
calculating the root-mean-square distance of alpha carbon atoms of a protein framework and the rotation angle of the protein around DNA, clustering the first conformation by using a K-means algorithm to obtain a primary clustering result, and performing a first round of MD simulation from a central conformation corresponding to each cluster in the primary clustering result to obtain a second conformation;
clustering the second conformations by using a K-centers algorithm to obtain a secondary clustering result, and starting from a central conformation corresponding to each cluster in the secondary clustering result, performing a second round of MD simulation in a wide space to obtain a third conformation;
and constructing a Markov state model by using the third conformation, and analyzing the second round of MD simulation tracks.
Further, the calculating the root mean square distance of alpha carbon atoms of the protein skeleton and the rotation angle of the protein around the DNA comprises the following steps:
setting the long axis of the DNA double chain in the initial crystal structure as the X axis and the mass center of the whole DNA molecule as the origin of the space coordinate;
aligning the first conformation with the initial crystal structure based on a number of base pairs of the DNA duplex intermediate adjacent to the protein;
calculating the root mean square distance of the aligned alpha carbon atoms of the protein skeleton;
projecting a vector from a protein centroid to an origin connecting line to a YZ plane; the YZ plane is a plane which is vertical to the X axis and passes through the origin;
setting the initial angle of the initial crystal structure to 0, and calculating the angle between the vector projection of the initial crystal structure and the first conformation after the alignment operation as the rotation angle of the protein around the DNA.
Further, the first round of MD simulation results in a second conformation comprising:
under the force field of parmbsc1, establishing a corresponding all-atom simulation system for the central structure corresponding to each cluster in the primary clustering result;
setting a time step and a random initial speed in an NPT ensemble, and performing MD simulation on the all-atom simulation system to obtain a first round of MD simulation track;
and deleting the conformation in the initial period of time of each simulation track in the first round of MD simulation tracks, and taking all the conformations in the rest simulation tracks as a second conformation.
Further, said clustering said second conformations using a K-centers algorithm comprises:
selecting a plurality of distance pairs with the possibility of forming hydrogen bonds and salt bridges between the protein and the DNA as input parameters of clustering, and clustering the second conformation by using a K-centers algorithm.
Further, the performing a second round of MD simulation in the extensive space to obtain a third conformation, comprises:
selecting a full-atom simulation system of a central conformation corresponding to each cluster in the secondary clustering result as an initial system of a second round of MD simulation;
setting a time step and a random initial speed in the NPT ensemble, and performing MD simulation on the initial system of the second round of MD simulation to obtain a second round of MD simulation track;
and deleting the conformation in the initial period of time of each simulated track in the second round of MD simulated tracks, and taking all the conformations in the rest simulated tracks as a third conformation.
Further, the constructing a markov state model using the third constellation includes:
calculating a plurality of distance pairs with the possibility of forming hydrogen bonds and salt bridges between the protein and the DNA in the third conformation as input parameters of clustering;
reducing the dimensions of the data of the distance pairs, taking the data of the distance pairs after dimension reduction as input, and clustering the third conformation into a plurality of micro-states by utilizing a K-centers algorithm;
respectively calculating hidden time scales corresponding to different delay times and micro-state numbers, generating a hidden time scale curve, and selecting the delay times and the micro-state numbers corresponding to the flat parts in the hidden time scale curve as parameters of a Markov state model;
classifying the plurality of micro-state conformations into macro-state conformations, and calculating the number of the conformations in each macro-state and the proportion of the number of the conformations in the macro-state to the total conformations to make the proportion of the number of the conformations in the macro-state to the total conformations larger than a preset proportion threshold value.
Further, the analyzing the second round MD simulated trajectory includes:
constructing a plurality of Monte Carlo tracks according to the transition probability matrix of the micro state of the Markov state model, and setting the time step length of the Monte Carlo tracks;
calculating the first transition time between the macroscopic states in each Monte Carlo track, and then calculating the average first transition time of the macroscopic states in each Monte Carlo track;
according to the proportion of the total conformation number of the macroscopic states and the average first transition time of each macroscopic state, taking the path with the shortest total transition time as the shortest path of WRKY protein movement on DNA, and taking the step with the longest average first transition time in the shortest path as a key speed-deciding step.
In a second aspect of the present invention, there is provided an apparatus for analyzing a process of movement of a transcription factor protein on DNA. The device includes:
the acquisition module is used for acquiring a full-atom MD simulation track of WRKY structural domain protein moving on a DNA double chain as an initial path;
the extraction module is used for uniformly extracting the conformations of the WRKY domain protein and the DNA of a plurality of frames from the initial path to serve as a first conformation;
the first clustering simulation module is used for calculating the root-mean-square distance of alpha carbon atoms of a protein framework and the rotation angle of the protein around DNA, clustering the first conformation by using a K-means algorithm to obtain a primary clustering result, and performing a first round of MD simulation starting from a central conformation corresponding to each cluster in the primary clustering result to obtain a second conformation;
the second clustering simulation module is used for clustering the second conformations by utilizing a K-centers algorithm to obtain secondary clustering results, and performing a second round of MD simulation in a wide space from a central conformation corresponding to each cluster in the secondary clustering results to obtain a third conformation;
and the modeling analysis module is used for constructing a Markov state model by utilizing the second-round MD simulation track and analyzing the second-round MD simulation track.
In a third aspect of the invention, an electronic device is provided. The electronic device at least one processor; and a memory communicatively coupled to the at least one processor; the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of the first aspect of the invention.
In a fourth aspect of the invention, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to perform the method of the first aspect of the invention.
It should be understood that the statements herein reciting aspects are not intended to limit the critical or essential features of any embodiment of the invention, nor are they intended to limit the scope of the invention. Other features of the present invention will become apparent from the following description.
Drawings
The above and other features, advantages and aspects of various embodiments of the present invention will become more apparent by referring to the following detailed description when taken in conjunction with the accompanying drawings. In the drawings, like or similar reference characters designate like or similar elements, and wherein:
FIG. 1 shows a flowchart of a method for analyzing a process of movement of a transcription factor protein on DNA according to an embodiment of the present invention;
FIG. 2 shows a flow chart of a method for calculating the root mean square distance of alpha carbon atoms of a protein backbone and the rotation angle of the protein around DNA according to an embodiment of the invention;
FIG. 3 shows a flow diagram for constructing a Markov state model using a third constellation in accordance with embodiments of the invention;
FIG. 4 shows a flow diagram for analyzing a second round of MD simulated trajectories according to an embodiment of the present invention;
FIG. 5 is a block diagram showing a device for analyzing a movement process of a transcription factor protein on DNA according to an embodiment of the present invention;
FIG. 6 illustrates a block diagram of an exemplary electronic device capable of implementing embodiments of the present invention;
in this case, 600 denotes an electronic device, 601 denotes a CPU, 602 denotes a ROM, 603 denotes a RAM, 604 denotes a bus, 605 denotes an I/O interface, 606 denotes an input unit, 607 denotes an output unit, 608 denotes a storage unit, and 609 denotes a communication unit.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention.
In addition, the term "and/or" herein is only one kind of association relationship describing an associated object, and means that there may be three kinds of relationships, for example, a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter related objects are in an "or" relationship.
FIG. 1 is a flowchart showing a method for analyzing a process of movement of a transcription factor protein on DNA according to an embodiment of the present invention.
The method comprises the following steps:
s101, acquiring a full-atom MD simulation track of movement of WRKY domain protein on a DNA double strand as an initial path.
Transcription Factors (TF), also known as trans-acting factors, refer to DNA binding proteins that specifically interact with cis-acting elements of eukaryotic genes and activate or inhibit transcription of the genes.
The WRKY structural domain protein is a specific zinc finger protein transcription factor of the plant, and plays an important role in regulation and control on aspects of plant defense, growth, development, metabolism and the like. The WRKY protein structure used here has about 60 amino acids and binds to the W-box DNA recognition sequence.
As an example of the present invention, a 10 microsecond simulation duration all-atomic MD simulation trajectory using WRKY domain proteins walking 1bp distance on a 34bp homogeneous type A DNA duplex was used as an initial path to initiate further conformational sampling.
S102, uniformly extracting conformations of the WRKY protein and the DNA of a plurality of frames from the initial path to be used as a first conformation.
In this example, 10000 frames (i.e., one frame per nanosecond) of the conformation of WRKY protein and DNA were uniformly extracted from the MD simulation trace of 10 μ s duration. Wherein the total number of 10000 frames is large enough to include all representative constellations.
S103, calculating the root mean square distance of alpha carbon atoms of the protein skeleton and the rotation angle of the protein around the DNA, clustering the first conformation by using a K-means algorithm to obtain a primary clustering result, and performing a first round of MD simulation from a central conformation corresponding to each cluster in the primary clustering result to obtain a second conformation.
Further, as shown in fig. 2, the calculating the root mean square distance of the α carbon atoms of the protein skeleton and the rotation angle of the protein around the DNA includes:
s201, setting a long axis of a DNA double strand in the initial crystal structure as an X axis, and setting a mass center of the whole DNA molecule as an origin of a space coordinate.
S202, taking a plurality of base pairs of the middle of the DNA double chain close to the protein as a reference, and aligning the first conformation with the initial crystal structure.
In this example, all conformations selected from the MD simulation traces were aligned with the original crystal structure, the portion selected for alignment being the middle 10 base pair portion of the DNA duplex, e.g., 10 base pairs A14-23 and T14 '-23'.
And S203, calculating the root mean square distance of the aligned alpha carbon atoms of the protein skeleton.
After the above alignment procedure was completed, the root mean square of the distance of change of the position of each protein backbone alpha carbon atom from the initial crystal structure was calculated.
S204, projecting a vector from the center of mass of the protein to an origin connecting line to a YZ plane; the YZ plane is a plane which is vertical to the X axis and passes through the origin;
s205, setting the initial angle of the initial crystal structure to 0, that is, Θ (0) ═ 0; the angle Θ (t) between the vector projection of the first conformation and the initial crystal structure after the alignment operation is calculated as the rotation angle of the protein around the DNA.
The angle Θ (t) is calculated as follows:
and calculating the position of the centroid of the protein, subtracting the position of the origin to obtain a vector A of the conformation, and calculating to obtain a vector A' of the vector A projected to a YZ plane. The vector A of the initial crystal structure is calculated in the same way0'. Multiplication of formula A'. A with vector points0’=∣A’∣∣A0' | -) cos (Θ (t)) can be obtained, and then the angle Θ (t) is calculated by using an inverse cosine function.
As an embodiment of the present invention, clustering the first constellation by using a K-means algorithm to obtain a primary clustering result, includes:
in Matlab, "[ idx, C ] ═ kmeans (X,25)" is inputted, and 10000 constellations are divided into 25 clusters using the K-means method, where X is a two-dimensional matrix of root-mean-square distance and WRKY rotation angle. The 25 constellation clusters were collected for the next first round of MD simulation.
As an embodiment of the present invention, starting from a central conformation corresponding to each cluster in the primary clustering result, performing a first round of MD simulation to obtain a second conformation, including:
under the force field of parmbsc1, using GROMACS software to establish a corresponding all-atom simulation system for the central structure corresponding to each cluster in the primary clustering result;
setting a time step size of 2fs and a random initial speed under an NPT ensemble, and performing 60ns MD simulation on the full-atom simulation system to obtain a first round of MD simulation track;
and deleting the conformation in each initial period of time of the first round MD simulation track, and using all the conformations in the rest simulation tracks as second conformations.
Wherein the initial period of time may be set to the first 10 ns. I.e., deleting the first 10ns constellations from each simulation trace, and then collecting the constellations from the remaining 25 x 50ns traces for clustering to prepare the input structure for the next second round of MD simulation. The conformation within the initial first time period is deleted in order to reduce the influence of the initial path and to allow local equilibrium.
And S104, clustering the second conformations by using a K-centers algorithm to obtain secondary clustering results, and performing a second round of MD simulation in a wide space from a central conformation corresponding to each cluster in the secondary clustering results to obtain a third conformation.
Further, the clustering the second constellations using a K-centers algorithm, comprising:
selecting a plurality of distance pairs with the possibility of forming hydrogen bonds and salt bridges between the protein and the DNA as input parameters of clustering, and clustering the second conformation by using a K-centers algorithm.
There are several distance pairs in which the possibility of hydrogen bond and salt bridge formation exists, for example the heavy atom of amino acids Y119, K122, K125, R131, Y133, Q146, K144, R135, W116, R117, Y134, K118, Q121 in the protein (NH1, NH2, OH, NZ, NE2, ND2) and the backbone atoms of nucleotides a14 to 20 in DNA. The selected amino acids may form stable hydrogen bonds or salt bridges with the DNA.
In the present embodiment, the projection data is aggregated into 100 clusters by the K-centers method using the command "msmb KCenters-i./tica _ resources. h5-o kcenter _ output-t kcenter _ output- -n _ clusters 100" in MSMbuilder. The central structure of each cluster was then selected as the initial structure for the second round of MD simulation. In addition to velocity, simulation information for the simulated 100 structures, including position, temperature, pressure, etc., should be retained. After the first 25 simulations, the memory of the initial path has been weakened, so we generated more clusters, e.g., 100 clusters, in the second simulation in order to greatly enlarge the conformational sampling.
In this embodiment, starting from the central conformation corresponding to each cluster in the secondary clustering result, performing a second round of MD simulation in a wide space to obtain a third conformation, including:
selecting a full-atom simulation system of a central conformation corresponding to each cluster in the secondary clustering result as an initial system of a second round of MD simulation;
setting a time step of 2fs and a random initial speed in the NPT ensemble, and performing MD simulation for 60ns on the initial system of the second round of MD simulation to obtain a second round of MD simulation track; for example 100 simulated tracks.
And deleting the conformation in the initial period of time of each simulated track in the second round of MD simulated tracks, and taking all the conformations in the rest simulated tracks as a third conformation.
The initial period of time was set to the first 10ns, i.e., the constellation for the first 10ns of each simulated trace was deleted. From the remaining 100 x 50ns traces, 2500000 frame constellations were collected for clustering to obtain the data basis for subsequent construction of the markov state model. The conformation within the initial first time period is deleted in order to reduce the influence of the initial path and to allow local equilibrium.
And S105, constructing a Markov state model by using the third conformation, and analyzing the second round of MD simulation tracks.
Further, as shown in fig. 3, the constructing a markov state model using the third constellation includes:
s301, calculating a plurality of distance pairs with the possibility of forming hydrogen bonds and salt bridges between the protein and the DNA in the third conformation to be used as input parameters of clustering.
There are several distance pairs in which the possibility of hydrogen bond and salt bridge formation exists, for example the heavy atom of amino acids Y119, K122, K125, R131, Y133, Q146, K144, R135, W116, R117, Y134, K118, Q121 in the protein (NH1, NH2, OH, NZ, NE2, ND2) and the backbone atoms of nucleotides a14 to 20 in DNA. The selected amino acids may form stable hydrogen bonds or salt bridges with the DNA.
S302, reducing the dimensions of the data of the distance pairs, taking the data of the distance pairs after the dimensions are reduced as input, and clustering the third conformation into 500 micro-states by utilizing a K-centers algorithm.
In this embodiment, 415 distance pairs of data of all trajectory constellations are calculated as input objects, and the tca is used to reduce the dimension of the data to two dimensions.
The tICA is a dimension reduction method that calculates a time-lag correlation matrix
Figure BDA0003501798550000111
The slowest relaxation freedom degree of the simulation system is determined through a formula.
Figure BDA0003501798550000112
Wherein Xi(t) is the value of the ith reaction coordinate at time t, Xj(t + Δ t) is the value of the jth reaction coordinate at time t + Δ t.
Figure BDA0003501798550000113
Is Xi(t) and Xj(t + Δ t) expected value of the product of the overall simulated trajectory. The direction along the slowest relaxed degree of freedom corresponds to the time-lag correlation matrix mentioned above
Figure BDA0003501798550000114
The maximum eigenvalue of (c). The 2 tIC are the smallest dimension data that distinguishes this markov state model.
S303, respectively calculating the hidden time scales corresponding to different delay times and the number of the microscopic states, generating a hidden time scale curve, and selecting the delay time and the number of the microscopic states corresponding to the gentle part in the hidden time scale curve as parameters of the Markov state model.
When the implicit timescale curve begins to flatten out as the timescale separates, the system is considered a Markov system. Then, t is selected as the delay time, and τ at which the implicit time scale begins to settle is selected as the lag time for establishing the Markov state model.
S304, classifying the plurality of micro-state conformations into macro-state conformations, and calculating the number of the macro-state conformations and the proportion of the macro-state conformations in the total conformations to make the proportion of the macro-state conformations in the total conformations larger than a preset proportion threshold.
In the embodiment, 500 microscopic states are classified into 3-6 macroscopic states through a PCCA + algorithm in MSMBuilder software; the number of conformations in each macro state and the ratio of the number of conformations in the macro state to the total conformations are calculated, for example, by setting a predetermined ratio threshold value of 3%, if the 5 macro states are classified from 500 micro states, the ratio of the number of conformations in 1 macro state to the total conformations in the 5 macro states is calculated to be 0.02%, that is, 3% lower than the predetermined ratio threshold value, so that the classification result into 5 macro states is not reasonable and needs to be reclassified. And after reclassification, classifying the 500 microscopic states into 3 macroscopic states, wherein the proportion of the total conformation number of the 3 macroscopic states is more than 3% of a preset proportion threshold value, and obtaining that the 3 macroscopic states meet the requirements. It can be seen that a small number of macroscopic states are constructed by performing kinetic stacking on hundreds of microscopic states, and a reduced kinetic network model is determined for the most basic conformational changes of biomolecules.
Further, as shown in fig. 4, the analyzing the second round MD simulated trajectory includes:
s401, constructing a plurality of Monte Carlo tracks according to the transition probability matrix of the micro state of the Markov state model, and setting the time step length of the Monte Carlo tracks.
In this embodiment, 5 monte carlo trajectories of 10ms are constructed according to the transition probability matrix of 500 micro-state MSMs, and the lag time of 10ns is set as the time step of the monte carlo trajectories.
S402, calculating the first transition time between the macroscopic states in each Monte Carlo track, and then calculating the average first transition time of the macroscopic states in each Monte Carlo track.
In this embodiment, calculating the first transition time between the macro-states comprises:
in each Monte Carlo track, respectively counting the number of step lengths required by each transition in each macroscopic state, wherein the first transition time is equal to the step length number multiplied by the step length of 10 ns.
Calculating the average first transition time for each macroscopic state, comprising: and averaging the first transition time obtained by each Monte Carlo track to obtain the average first transition time.
S403, according to the proportion of the total conformation number of the macro states and the average first transition time of each macro state, summing the average first transition time of all the macro states on all the paths in the movement process to obtain the average total transition time of each path. Such as the two possible paths S1-S3 and S1-S2-S3, it follows that the average first transition time of S1-S3 is greater than the sum of the average first transition time of S1-S2 plus the average first transition time of S2-S3. Taking the path with the shortest total transition time as the shortest path for WRKY protein to move on DNA, and taking the step with the longest average first transition time in the shortest paths as a key speed-determining step. For example, in S1-S2-S3, the average first transition time of S2-S3 is greater than S1-S2, and S2-S3 are key speed-decision steps. The shortest path is the path with the fastest generation speed and the maximum probability in the motion process. The key speed determining step is a step of determining the speed of the whole motion path in the shortest path.
According to the above embodiment, a motion process of S1- > S2- > S3 is obtained. Wherein the S1 state is the initial state, S3 refers to the state after the 1-bp protein step motion, and the highest conformational proportion (63%) is quite stable. The intermediate state S2 connects S1 and S3 with a medium duty ratio (30%). Overall, the WRKY protein COM was shifted by-2.9 angstroms, rotated by-55 degrees, and stepped into 1-bp. The speed-limiting step of the shortest path S1- > S2- > S3 in the WRKY stepping motion process is S2- > S3, and needs 7 mu S. In contrast, S1- > S2 can be rapidly transported in a time of-0.06. mu.s.
According to the embodiment of the invention, a reduced dynamic network model is determined for the most basic conformational change of a biomolecule by establishing a Markov state model, and effective physical mechanisms which cannot be obtained in experiments or direct simulation, such as an intermediate macroscopic state, a main motion path and a speed-determining step of a motion process, can be provided.
It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the invention. Further, those skilled in the art should also appreciate that the embodiments described in the specification are exemplary embodiments and that the acts and modules illustrated are not necessarily required to practice the invention.
The above is a description of method embodiments, and the embodiments of the present invention are further described below by way of apparatus embodiments.
As shown in fig. 5, the apparatus 500 includes:
an obtaining module 510, configured to obtain a full-atom MD simulation trajectory of a WRKY domain protein moving on a DNA double strand as an initial path;
an extraction module 520, configured to uniformly extract, from the initial path, conformations of the WRKY protein and the DNA of several frames as a first conformation;
the first clustering simulation module 530 is configured to calculate a root mean square distance of alpha carbon atoms of a protein framework and a rotation angle of the protein around DNA, cluster the first conformations by using a K-means algorithm to obtain primary clustering results, and perform a first round of MD simulation starting from a central conformation corresponding to each cluster in the primary clustering results to obtain a second conformation;
a second clustering simulation module 540, configured to cluster the second conformations by using a K-centers algorithm to obtain secondary clustering results, and perform a second round of MD simulation in a wide space starting from a central conformation corresponding to each cluster in the secondary clustering results to obtain a third conformation;
and a modeling analysis module 550, configured to construct a markov state model by using the second-round MD simulation trajectory, and analyze the second-round MD simulation trajectory.
It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working process of the described module may refer to the corresponding process in the foregoing method embodiment, and is not described herein again.
In the technical scheme of the invention, the acquisition, storage, application and the like of the personal information of the related user all accord with the regulations of related laws and regulations without violating the good customs of the public order.
The invention also provides an electronic device and a readable storage medium according to the embodiment of the invention.
FIG. 6 illustrates a schematic block diagram of an electronic device 600 that may be used to implement embodiments of the present invention. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the inventions described and/or claimed herein.
The apparatus 600 includes a computing unit 601, which may perform various appropriate actions and processes in accordance with a computer program stored in a Read Only Memory (ROM)602 or a computer program loaded from a storage unit 608 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data required for the operation of the device 600 can also be stored. The calculation unit 601, the ROM 602, and the RAM 603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.
A number of components in the device 600 are connected to the I/O interface 605, including: an input unit 606 such as a keyboard, a mouse, or the like; an output unit 607 such as various types of displays, speakers, and the like; a storage unit 608, such as a magnetic disk, optical disk, or the like; and a communication unit 609 such as a network card, modem, wireless communication transceiver, etc. The communication unit 609 allows the device 600 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.
Computing unit 601 may be a variety of general and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 601 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 601 executes the respective methods and processes described above, such as the methods S101 to S105. For example, in some embodiments, methods S101-S105 may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 608. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 600 via the ROM 602 and/or the communication unit 609. When the computer program is loaded into the RAM 603 and executed by the computing unit 601, one or more steps of the methods S101-S105 described above may be performed. Alternatively, in other embodiments, the computing unit 601 may be configured to perform the methods S101-S105 by any other suitable means (e.g., by means of firmware).
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.
Program code for implementing the methods of the present invention may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of the present invention, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user may provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.
The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server with a combined blockchain.
It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present invention may be executed in parallel, sequentially, or in different orders, and are not limited herein as long as the desired results of the technical solution of the present invention can be achieved.
The above-described embodiments should not be construed as limiting the scope of the invention. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. A method for analyzing a movement process of a transcription factor protein on DNA, comprising:
acquiring a full-atom MD simulation track of WRKY structural domain protein moving on a DNA double strand as an initial path;
uniformly extracting conformations of WRKY structural domain protein and DNA of a plurality of frames from the initial path as a first conformation;
calculating the root-mean-square distance of alpha carbon atoms of a protein framework and the rotation angle of the protein around DNA, clustering the first conformation by using a K-means algorithm to obtain a primary clustering result, and performing a first round of MD simulation from a central conformation corresponding to each cluster in the primary clustering result to obtain a second conformation;
clustering the second conformations by using a K-centers algorithm to obtain secondary clustering results, and performing a second round of MD simulation in a wide space from a central conformation corresponding to each cluster in the secondary clustering results to obtain a third conformation;
and constructing a Markov state model by using the third conformation, and analyzing the second round of MD simulation tracks.
2. The method of claim 1, wherein calculating the root mean square distance of alpha carbon atoms of the protein backbone and the rotation angle of the protein around the DNA comprises:
setting the long axis of the DNA double chain in the initial crystal structure as the X axis and the mass center of the whole DNA molecule as the origin of the space coordinate;
aligning the first conformation with an initial crystal structure based on a number of base pairs of the middle of the DNA duplex near the protein;
calculating the root-mean-square distance of the aligned alpha carbon atoms of the protein skeleton;
projecting a vector from a protein centroid to an origin connecting line to a YZ plane; the YZ plane is a plane which is vertical to the X axis and passes through the origin;
setting the initial angle of the initial crystal structure to 0, and calculating the angle between the vector projection of the initial crystal structure and the first conformation after the alignment operation as the rotation angle of the protein around the DNA.
3. The method of claim 1, wherein the first round of MD simulation to obtain a second conformation comprises:
under the force field of parmbsc1, establishing a corresponding all-atom simulation system for the central structure corresponding to each cluster in the primary clustering result;
setting a time step and a random initial speed in an NPT ensemble, and performing MD simulation on the all-atom simulation system to obtain a first round of MD simulation track;
and deleting the conformation in the initial period of time of each simulation track in the first round of MD simulation tracks, and taking all the conformations in the rest simulation tracks as a second conformation.
4. The method of claim 1, wherein said clustering said second constellation using a K-centers algorithm comprises:
selecting a plurality of distance pairs with the possibility of forming hydrogen bonds and salt bridges between the protein and the DNA as input parameters of clustering, and clustering the second conformation by using a K-centers algorithm.
5. The method of claim 1, wherein performing a second round of MD simulation in the extensive space to obtain a third conformation comprises:
selecting a full-atom simulation system of a central conformation corresponding to each cluster in the secondary clustering result as an initial system of a second round of MD simulation;
setting a time step and a random initial speed in the NPT ensemble, and performing MD simulation on the initial system of the second round of MD simulation to obtain a second round of MD simulation track;
and deleting the conformation in the initial period of time of each simulated track in the second round of MD simulated tracks, and taking all the conformations in the rest simulated tracks as a third conformation.
6. The method of claim 1, wherein constructing a markov state model using the third constellation comprises:
calculating a plurality of distance pairs with the possibility of forming hydrogen bonds and salt bridges between the protein and the DNA in the third conformation as input parameters of clustering;
performing dimensionality reduction on the data of the distance pairs, taking the data of the distance pairs subjected to dimensionality reduction as input, and clustering the third conformation into a plurality of micro-states by utilizing a K-centers algorithm;
respectively calculating hidden time scales corresponding to different delay times and micro-state numbers, generating a hidden time scale curve, and selecting the delay times and the micro-state numbers corresponding to the flat parts in the hidden time scale curve as parameters of a Markov state model;
classifying the plurality of micro-state conformations into macro-state conformations, and calculating the number of the conformations in each macro-state and the proportion of the number of the conformations in the macro-state to the total conformations to make the proportion of the number of the conformations in the macro-state to the total conformations larger than a preset proportion threshold value.
7. The method of claim 6, wherein the analyzing the second round of MD simulated trajectories comprises:
constructing a plurality of Monte Carlo tracks according to the transition probability matrix of the microscopic state of the Markov state model, and setting the time step length of the Monte Carlo tracks;
calculating the first transition time between the macroscopic states in each Monte Carlo track, and then calculating the average first transition time of the macroscopic states in each Monte Carlo track;
according to the proportion of the total conformation number of the macroscopic states and the average first transition time of each macroscopic state, taking the path with the shortest total transition time as the shortest path of WRKY protein movement on DNA, and taking the step with the longest average first transition time in the shortest path as a key speed-deciding step.
8. An apparatus for analyzing a movement process of a transcription factor protein on a DNA, comprising:
the acquisition module is used for acquiring a full-atom MD simulation track of WRKY structural domain protein moving on a DNA double chain as an initial path;
the extraction module is used for uniformly extracting the conformations of the WRKY structural domain protein and the DNA of a plurality of frames from the initial path as a first conformation;
the first clustering simulation module is used for calculating the root-mean-square distance of alpha carbon atoms of a protein framework and the rotation angle of the protein around the DNA, clustering the first conformation by using a K-means algorithm to obtain a primary clustering result, and performing a first round of MD simulation from a central conformation corresponding to each cluster in the primary clustering result to obtain a second conformation;
the second clustering simulation module is used for clustering the second conformations by utilizing a K-centers algorithm to obtain secondary clustering results, and performing a second round of MD simulation in a wide space from a central conformation corresponding to each cluster in the secondary clustering results to obtain a third conformation;
and the modeling analysis module is used for constructing a Markov state model by utilizing the second-round MD simulation track and analyzing the second-round MD simulation track.
9. An electronic device comprising at least one processor; and
a memory communicatively coupled to the at least one processor; it is characterized in that the preparation method is characterized in that,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-7.
10. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-7.
CN202210128910.2A 2022-02-11 2022-02-11 Method and device for analyzing movement process of transcription factor protein on DNA Pending CN114496103A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210128910.2A CN114496103A (en) 2022-02-11 2022-02-11 Method and device for analyzing movement process of transcription factor protein on DNA

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210128910.2A CN114496103A (en) 2022-02-11 2022-02-11 Method and device for analyzing movement process of transcription factor protein on DNA

Publications (1)

Publication Number Publication Date
CN114496103A true CN114496103A (en) 2022-05-13

Family

ID=81480978

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210128910.2A Pending CN114496103A (en) 2022-02-11 2022-02-11 Method and device for analyzing movement process of transcription factor protein on DNA

Country Status (1)

Country Link
CN (1) CN114496103A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115116537A (en) * 2022-08-29 2022-09-27 香港中文大学(深圳) Method and system for calculating multiple transformation paths of biomolecule functional dynamics

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2007025205A2 (en) * 2005-08-24 2007-03-01 Oregon Health And Science University Systems and methods for identifying functional dna codes of condition-specific common cis-regulatory motifs and modules and their corresponding transcription factors
US20090037118A1 (en) * 2007-06-21 2009-02-05 Ravinder Abrol Methods for predicting three-dimensional structures for alpha helical membrane proteins and their use in design of selective ligands
CN112420122A (en) * 2020-11-04 2021-02-26 南京大学 Method for identifying allosteric site of action of endocrine disruptor and nuclear receptor
CN113990401A (en) * 2021-11-18 2022-01-28 北京深势科技有限公司 Method and apparatus for designing drug molecules of intrinsically disordered proteins

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2007025205A2 (en) * 2005-08-24 2007-03-01 Oregon Health And Science University Systems and methods for identifying functional dna codes of condition-specific common cis-regulatory motifs and modules and their corresponding transcription factors
US20090037118A1 (en) * 2007-06-21 2009-02-05 Ravinder Abrol Methods for predicting three-dimensional structures for alpha helical membrane proteins and their use in design of selective ligands
CN112420122A (en) * 2020-11-04 2021-02-26 南京大学 Method for identifying allosteric site of action of endocrine disruptor and nuclear receptor
CN113990401A (en) * 2021-11-18 2022-01-28 北京深势科技有限公司 Method and apparatus for designing drug molecules of intrinsically disordered proteins

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
黄文亢: ""蛋白质别构位点识别方法发展及机制研究"", 《中国优秀博士学位论文全文数据库 基础科学辑(月刊)》, no. 3, 15 March 2020 (2020-03-15), pages 1 - 117 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115116537A (en) * 2022-08-29 2022-09-27 香港中文大学(深圳) Method and system for calculating multiple transformation paths of biomolecule functional dynamics
CN115116537B (en) * 2022-08-29 2022-12-06 香港中文大学(深圳) Method and system for calculating multiple transformation paths of biomolecule functional dynamics
WO2024045933A1 (en) * 2022-08-29 2024-03-07 香港中文大学(深圳) Method and system for calculating multiple transition paths of biomolecular functional dynamics

Similar Documents

Publication Publication Date Title
CN110910951B (en) Method for predicting free energy of protein and ligand binding based on progressive neural network
Suárez et al. Simultaneous computation of dynamical and equilibrium information using a weighted ensemble of trajectories
Gajula et al. Protocol for molecular dynamics simulations of proteins
CN111656375A (en) Method and system for quantum computation enabled molecular de novo computation simulation using quantum classical computation hardware
Carmona et al. Particle methods for the estimation of credit portfolio loss distributions
CN111161810B (en) Free energy perturbation method based on constraint probability distribution function optimization
Tian et al. Explore protein conformational space with variational autoencoder
Zhu et al. Using novel variable transformations to enhance conformational sampling in molecular dynamics
Ali et al. Target-DBPPred: an intelligent model for prediction of DNA-binding proteins using discrete wavelet transform based compression and light eXtreme gradient boosting
CN114496103A (en) Method and device for analyzing movement process of transcription factor protein on DNA
Bircher et al. Improved description of atomic environments using low-cost polynomial functions with compact support
JP7446359B2 (en) Traffic data prediction method, traffic data prediction device, electronic equipment, storage medium, computer program product and computer program
Cruz et al. Hybrid computational modeling methods for systems biology
Singh et al. Adaptively restrained molecular dynamics in LAMMPS
Harada et al. Efficient conformational sampling of proteins based on a multi-dimensional TaBoo SeArch algorithm: An application to folding of chignolin in explicit solvent
CN114121180A (en) Drug screening method, drug screening device, electronic device and storage medium
Palermo et al. Multiscale modeling from macromolecules to cell: Opportunities and challenges of biomolecular simulations
Dev et al. Comparison of tree based ensemble machine learning methods for prediction of rate constant of Diels-Alder reaction
Stanke et al. Molecular Relativistic Corrections Determined in the Framework Where the Born–Oppenheimer Approximation is Not Assumed
Shukla et al. Application of hidden Markov models in biomolecular simulations
Caniparoli et al. Modeling the effect of codon translation rates on co-translational protein folding mechanisms of arbitrary complexity
CN113255770B (en) Training method of compound attribute prediction model and compound attribute prediction method
CN114974438A (en) Particle motion simulation method, device, apparatus, storage medium and program product
Oestereich et al. Force probe simulations using an adaptive resolution scheme
Deeks et al. Free energy along drug-protein binding pathways interactively sampled in virtual reality

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination