Biomedical Informatics 214 (also listed as Computer Science 274)
Representations and Algorithms for Computational Molecular Biology

Spring 2007

Project 4
Molecular Dynamics Simulations of Biological Structures
Due electronically by 5:00pm, Wednesday, 5/30/07

Note that programming projects should be completed individually. You may discuss algorithms with others, but the coding should be done alone.
WARNING: You should get your code working early for this project, as the actual running of your program can take quite a while for all required runs!

Contents:
  1. Introduction
  2. Overview of a Basic MD Simulation
  3. Project description
      a. The simplified energy function
      b. Simulating the motion of atoms
  4. Data input and output
  5. Visualizing your simulations
  6. Deliverables
  7. Submission instructions
  8. Code Optimization Tips
Introduction

Molecular Dynamics (MD) is using computer to simulate the interaction of atoms and molecules for a period of time under known laws of physics. It was first developed in the area of theoretical physics and traditionally has been applied in material science, biochemistry, and biophysics. Over the last decade or so, molecular dynamics has seen great success in the field of biology, shedding light on the mechanisms by which proteins fold, stabilize, interact, and function. It is now used routinely during structure determination and refinement, and as a tool in drug design. MD is especially powerful in that it allows us to make predictions and engineer drugs that we would not be able to make using sequence or static methods alone. For example, proteins often change conformation upon ligand (the small molecule interacting with it) binding. Given the unbounded protein structure, we may not be able to predict the binding because the functional site is not in the correct shape. Using MD, during the course of simulation, the bounded conformation may appear, and we will recognize the correct binding form.

As in animation, MD consists of many snapshots of atom positions at a series time spots. The principles governing the behavior of molecules are essentially those of chemistry and physics, such as the force fields and Newtonian or quantum mechanics, from which we can derive the interacting forces and compute the positions of atoms at each time spot.

In Project 4, you will get the chance to implement a molecular dynamics simulation of a small protein. This is going to be fun, because you are making animations of a "dancing" protein!

Start the project early because the simulations take time.

<< back to top




Overview of a Basic MD Simulation

At its core, an MD simulation can be broken down into simple components. These parts and how they should be implemented within the scope of this project are expanded upon in the following sections.
  1. A biological molecule is represented within a simulation as a set of atoms.
  2. These atoms each have mass (though for simplification purposes different atoms may all be given the same constant mass).
  3. Some of these atoms are chemically bonded to each other (such as adjacent carbon atoms in a single amino acid), and the others are not (the alpha carbon atoms of different amino acids). Only bonded atoms, and nonbonded atoms within a certain cutoff distance from each other will be considered to interact with each other.
  4. Atoms exert forces on one another, depending on whether they are bonded or not, the distances between each pair of atoms, their electostatic charges, and other factors contained within the force/energy model (also called the Force Field) you are using for your simulation.
  5. The atoms also start with initial velocities in 3 dimensions, based on the temperature at which the simulation is run.
  6. In reality these atoms move smoothly in continuous time; in a model we divide time into little slices (dt), determine the sum of the forces acting on each atom for a given time slice, and use classical mechanics to determine the change in velocity and position of the atom over that time slice.
  7. The atoms are then moved, the forces acting on each atom are re-calculated at this new time t+dt, and this is repeated for as many time slices as desired.
  8. Smaller time slices are more accurate as we are modeling forces that in reality act continuously and dynamically as atoms move in relation to each other. However, small slices are more costly: if your time slices are 1 picosecond, you need 1000 slices to model motion over 1 nanosecond. Using a 0.1 picosecond slice would more accurately capture the motion of the atoms, but you would need 10 times the number of iterations. Computation time requirements can build up quickly!
  9. The atoms' positions at each time slice can be output to files in specific formats that allow them to be read by various visualization software packages. Displaying atom positions at each time in succession makes a movie showing us how the molecule as a whole behaves as its atoms move during the simulation.
<< back to top




Project description

THE SIMPLIFIED FORCE FIELD

A given protein structure will have a certain total energy depending on the potential and kinetic energies taken over all of the atoms. As discussed in lecture, the potential energy is calculated using an empirically derived function or "force field" containing several terms corresponding to the different contributions to potential energy: bond, angle, torsion, and nonbonded interactions. Deviations in these terms away from their reference values will result in higher energy (and thus greater instability). For this project, we will be using a simplified force field function containing only two terms, one for bond interactions and one for nonbonded interactions (see the lecture notes for a description of the standard 4 term force field). Bond interactions are due to direct chemical connections between atoms, and nonbonded interactions occur between atoms with no direct chemical bond.

Our force field has the following functional form:


The first term captures the interaction between chemically connected or bonded atoms, which is modeled as a physical spring (if you remember from physics, the potential energy in a spring is generally E = (1/2)kx^2 where x is the distance from the center of the spring and k is the spring constant). The term thus sums the potential energy between all pairs of bonded atoms using the deviation in bond lengths away from their reference lengths. Note that the potential energy for this term is zero when all bonds are at their reference bond distance. Low or zero energy is favorable for the stability of a structure.

The second term approximates all of the nonbonded interactions in the system and we will also use a spring-type equation to model this (this is a big simplification - see the lecture notes for a more canonical expression modeling nonbonded interactions). This term will sum over all pairs of atoms that are NOT bonded to each other, but are within a certain distance from each other. We will use an input parameter to specify the distance cutoff for nonbonded interactions. In reality, we would need to account for charge, electrostatic and van der Waals interactions, but for the purposes of this project we will be modeling all nonbonded interactions with one simple term.

If two atoms are too far away, the nonbonded interaction between them is negligible. We will have a distance cutoff--if two atoms are not bonded and are closer than the cutoff, then they are considered as nonbonded pair. Note that the pairs of nonbonded atoms technically may change between time steps as a result of atoms moving within or beyond the distance cutoff. For our purposes, you should only find the nonbonded pairs once - at the beginning of the simulation using the input files - and use these as a constant set throughout the simulation. The cutoff value is specified in the input file.

SIMULATING THE MOTIONS OF ATOMS

In order to make the molecule "move", we must know how the positions of the atoms change over time. We get this by calculating the forces on each atom, using these forces to calculate the velocities of each atom, and then applying these velocities to obtain the new position of each atom for each time step.

Deriving the forces from the force field

In order to compute the acceleration of an atom, we need to compute the total force acting on the atom. To do so, firstly, we need to know how to compute the force acting on an atom from the interaction (bond or nonbond) with one single other atom.
Force is the derivative of the potential energy, and so in order to compute the force, you must derive an expression for force from the terms in the force field. Using the bonded term of our force field as an example:
PEbonds = (1/2)kb(b-b0)2
F = dPEbonds/db
  = d((1/2)kb(b-b0)2)/db
  = 2(1/2)kb(b-b0)d(b-b0)/db
  = kb(b-b0)

This expression represents the force between two atoms paired in a bond, and is proportional in the length of the bond (b) from its reference length (b0). Similarly, you can derive the formula for the force between nonbond atoms. Bond and nonbond lengths are calculated as Euclidean distance between the two atoms in the pair.

 

The expression for force derived above is actually the magnitude of the force. Force has direction in 3D space. For the convenience in computing the 3D coordinates of an atom in the next time step, we would like to know the force component in each XYZ direction. To do so, we multiply it separately by the proportion of the displacement in each of those directions. For example, to compute the force acting on atom A from the interaction with atom B: 
Fx(A,B) = F*(xB-xA)/b
Fy(A,B) = F*(yB-yA)/b
Fz(A,B) = F*(zB-zA)/b
 
From the formula, you would notice that, if the distance between two atoms is larger than the reference length, the atom A will be attracted to atom B; otherwise, A will receive a repulsion force. When computing the force acting on atom B from the interaction with atom A, you should switch the A's and B's in the formula. Beware of the signs--you can get very frustrated if you get them wrong!
 
The total force acting on one atom in each of the XYZ directions is the sum of forces in that direction with all atoms in interaction:

Fx(A) = sum of Fx(A,B) for all other atoms B  
Fy(A) = sum of Fy(A,B) for all other atoms B
Fz(A) = sum of Fz(A,B) for all other atoms B


Updating the positions and velocities

The motion of atoms can be described using mathematical equations incoporating position, velocity, and accelaration, which you may remember from classical physics. However, due to the complexity of the force field, we can only approximate the solutions to these equations numerically using an intergration algorithm. Several integration algorithms have been developed. Some of them are discussed at http://www.ch.embnet.org/MD_tutorial/pages/MD.Part1.html.
 
In our project, we use the Velocity Verlet algorithm:
rt+dt = rt + vtdt + (1/2)atdt2
vt+dt/2 = vt + (1/2)atdt
at+dt = (1/m) Ft+dt
vt+dt = vt+dt/2 + (1/2)at+dtdt
r is the position, v is the velocity, a is the acceleration, and dt is the length of one time step. You should strictly follow the formula order to update the state of an atom. You should also update the components in XYZ directions separately--therefore, each formula above will be expanded into 3 formula, for example:
rx(t+dt) = rx(t) + vx(t)dt + (1/2)ax(t)dt2
ry(t+dt) = ry(t) + vy(t)dt + (1/2)ay(t)dt2
rz(t+dt) = rz(t) + vz(t)dt + (1/2)az(t)dt2
Keeping track of energy

The total energy of the molecule is a good indicator of the stability of the simulation. It is also a good check on whether your code is working correctly - a proper molecular dynamics simulation with reasonable parameters should have essentially constant total energy (give or take 1 or 2 kiloJoules, in our case). Recall that once the atoms start to move, they will have kinetic energy in addition to the bonded and nonbonded energy. It is normal for KE and PE to trade off throughout the simulation, but if your total energy varies it means something is wrong with your implementation. That means that the total energy (Etot = KE + PEbonds + PEnonbonds) must remain constant - you will be calculating and monitoring all of the energies and the total energy during every time step as part of your project output. You should keep track of the energy to monitor how your simulation is going and output the energy values at EACH time step. You will need to include all three terms in your total energy calculation and write out all three terms in your output.

Kinetic energy is a function of velocity and is calculated using the formula: KE = (1/2)mv2. You should calculate the kinetic energy after updating the velocity at t+dt (not the half time velocity). The total kinetic energy is also the sum of the individual total kinetic energies computed from the velocities in each of the three dimensions. Note that force is a vector, but energy is a scalar. You do not need to consider kinetic energy when calculating the forces on each atom.

  • For each pair of interating atoms, you only need to compute the bond or nonbond energy ONCE.
  • Calculating velocities from temperature

    In order for you to understand the effect of the temperature parameter in simulations, we have created four total versions of the input file (50 K, 300 K, 1000 K, and 4500 K, corresponding to cool, normal, hot, and blazing) that will run the same default simulation, but with different initial velocities corresponding to the different temperatures.

    The temperature of the simulation primarily affects the initial distribution of atomic velocities. This is discussed at http://www.ch.embnet.org/MD_tutorial/pages/MD.Part1.html#Verlet (and other places).

    SUMMARY OF STEPS FOR CODING THE SIMULATION

    1. Parse an input file containing atom, coordinate, initial velocity, connectivity, and parameter information.
    2. Iterate through 1000 time steps, and for each time step do the following:
        a. Update the velocities on each atom (the forces should be initialized to zero when you do this in the first time step, so your velocities won't change yet)
        b. Update the positions of each atom
        c. Make sure forces and energy are zero (if you are keeping global variables for these)
        d. For each interaction pair (bonded and nonbonded):
          i. Calculate the potential energy and update the total potential energy
          ii. Calculate the force and update the total forces in each dimension for each atom
        e. Update the velocities on each atom and calculate the kinetic energy
    3. Write the appropriate output every 10 time steps.

    << back to top


    Data input and output

    Data Input

    Input to your program will consist of an input filename and a set of optional parameters to replace default values. Your program MUST be able to accept these options on the command line. The parameters options your program should accept are:
    -if (input filename)
      no default value, must be specified
    -kB (same as kb)
      default value = 40000.0
    -kN (same as kn)
      default value = 400.0
    -nbCutoff (distance within which atoms should be considered as having nonbonded interactions, if they are not a bonded pair)
      default value = 0.50
    -m (atom mass, a constant to be applied to all atoms)
      default value = 12.0
    -dt (length of time step)
      default value = 0.001

    An example of a call to your program replacing the default kb value with a value of 800.0 is:
    python MyMD.py -if ./input.rvc -kB 800.0

    The input files (RVC files) are of extension .rvc (r = positions, v = velocities, c = connectivity), specify the positions and velocities of every atom in the simulation, and contain one line per atom. The first line in the file is always a header line containing the parameters used to generate that file (all will be the default values, except for the temperature). Each line after the first consists of (in white-space-delimited format) the atom ID number (1 indexed), the atom's starting x coordinate, y coordinate, z coordinate, initial x, y, and z velocities, and then up to 4 atom ID numbers corresponding to the atoms "connected", or bonded, to that atom. See example below:

    # 6pti.pdb: kB=40000.0 kN=400.0 nbCutoff=0.50 dt=0.0010 mass=12.0 T=300.0
    atomNum X Y Z Vx Vy Vz bond1 bond2 bond3 bond4
    ...
    [etc]

    The input files essentially represent frame 0 of the simulation; all values are at time t = 0. You will calculate reference lengths and other reference values using the input file. You will use the connectivity information to determine the bonded pairs of atoms, and you will need to calculate inter-atom distances from the initial configuration and compare to nbCutoff to find all the nonbonded pairs of atoms. These nonbond reference distances (r0) are fixed through the simulation, and they do not change as the molecule moves, since they represent the "forces" that hold together the initial structural configuration. You should also calculate your reference bond lengths from the input file. Note that we are using a constant mass across all atom types (this is a simplification - in reality, every type of atom has a different atomic mass).

    For many of the simulations you will be doing, however, you will only be changing one or two parameters from the default, so it is best to hard code the default parameter values and override these if command line options are used. You should also run each simulation for 1000 time steps, so you can hard code this value in as well, though if you are inclined to make longer simulations (not required for this project) you may modify this.

    The input files you will be using are: 6pti_50_K, 6pti_300K.rvc, 6pti_4500K.rvc.

    These input files were generated using different temperatures (50K, 300K, and 4500K) and so have different initial velocities.

    Data Output

    The output of your program will be two different files, a .erg (ERG) file and a .rvc (RVC) file.

    The ERG file should contain your energy values for each time step, and should look exactly like this one (with different values, of course). The lines should be tab-delimited, with the time step number in the first column, the kinetic energy in the second column, the bond potential energy in the third column, the nonbond potential energy in the fourth column, and the total energy in the last column. All values should be written out to exactly 1 decimal place.

    The RVC output file will look very similar to the input RVC file. During simulation, you should output all the atoms' IDs, positions, velocities and connectivities every 10 time steps, concatenated together into one RVC output file, like this example file. You should include the original input RVC file and you should use 1000 as the default number of total steps (feel free to perform longer simulations if you have the time and inclination), so you should have about 100 frames in your output file.

    You should then use the following Python script to convert your RVC files into AMBER coordinate files (.crd), which will be needed to visualize your simulations: convertRVCtoCRD.py

    To run this script on a machine with Python installed, simply type:

    > python convertRVCtoCRD.py input.rvc output.crd

    AMBER stands for "Assisted Model Building with Energy Refinement", and refers both to a set of force fields commonly used in MD simulations of biomolecules, as well as to a popular package of MD simulation programs. The visualization program we are using is capable of loading the AMBER file format.

    For your debugging convenience, we have provided RVC and ERG files for 2 example simulations:

    Example 1: kB = 1234.0, all other parameters are default.
    (Output file: example1.rvc and example1.erg)
    Example 2: nbCutoff = 0.65, all other parameters are default.
    (Output file: example2.rvc and example2.erg)

    You will be running your simulation code on BPTI, bovine pancreatic trypsin inhibitor (PDB ID: 6PTI). Download the 3 input files (6pti_*.rvc) from the Project 4 Files directory. Also download our version of the 6PTI PDB file (which has all the hydrogen atoms removed to reduce the number of atoms you'll be simulating).

    Perform 9 simulations of BPTI with the following parameters (assume default values for unspecified parameters, and use 6pti_300k unless using a different temperature):*

    1. Default parameters, Temperature = 300 (6pti_300K.rvc, other parameter values are in the header)
    2. Temperature = 50. (use 6pti_50.rvc)
    3. Temperature = 4500.0 (use 6pti_4500.rvc)
    4. kB=1000.0, kN=100.0
    5. nbCutoff = 0.25
    6. nbCutoff = 0.75
    7. time step dt = 0.01
    8. dt = 0.05
    9. dt = 0.1
    * Be aware that some of these parameter sets (certainly while varying dt) will result in very large values and may cause overflow errors. You might want to catch those errors when they occur, and should understand that this is one manifestation of the simulation becoming "unstable" - i.e. the molecule is behaving outside its physical bounds. We will be asking you to determine the time step value at which this seems to occur.

    These 9 separate simulations should result in 9 RVC and 9 energy output files. Convert each RVC file to a CRD file using the provided Python script. You MUST use the following file-naming scheme for each set of simulations (you will lose points if you don't!):
    1. default.rvc, default.crd, default.erg
    2. T50.rvc, T50.crd, T50.erg
    3. T4500.rvc, etc...
    4. kB1000kN100.rvc, etc...
    5. nb25.rvc, etc...
    6. nb75.rvc, etc...
    7. t01.rvc, etc...
    8. t05.rvc, etc...
    9. t1.rvc, etc...

    COMMON BUGS AND FIXES

  • Applying force in the wrong direction for each pair of atoms. This means that you are adding force to an atom when you should be subtracting it, and vice versa. A result of this is increasing energy. To fix, just switch the operation.

  • Counting your bonded pairs as nonbonded pairs. You should only consider a pair of atoms to be nonbonded if the initial distance between them is below the nbCutoff AND if the atoms are NOT BONDED. This will also result in increasing energy. Just make sure that you check whether two atoms are bondeds (using the connectivity information in the input RVC files) before labeling them as a nonbonded pair.

  • Forgetting to zero force and energy before updating them. You must set all forces and energies to zero before calculating them during each time step. If you don't, your energies will obviously increase.

  • Poorly optimized code... If your code takes forever (> 3 minutes) to complete one 1000 time step simulation, you should think carefully about your data structures and implementation, and perhaps your choice of programming language. See the section on code optimization.

  • << back to top


    Visualizing MD simulations

    The most interesting part of MD is arguably the visualization. Humans are very visual, and being able to see how the molecule moves and behaves allows us to understand and reason about the mechanisms underlying the molecule's behavior, whether it be as complicated as a protein recognition and binding event, or as simple as a molecule oscillating around a stable conformation (i.e. "breathing") under normal conditions. Typically, a simulation is saved as a number of frames, a frame being a recording of all the atom positions at a particular time point. Visualization software then "plays" each frame of the molecule in quick succession, just like an animation. What you can then see is how different properties of the molecule change over time - bond lengths, motion of amino acid side chains, speed and displacement of atoms from each other, overall molecule structure, etc. We will be asking you to view your simulation movies and answer questions about them in the Project 4 quiz.

    There are a number of visualization packages available for viewing MD simulations; one of the most popular is the free software VMD. VMD works on almost all platforms, including Windows XP, MacOS X (10.3.5 or later), and Linux. The brief guide to VMD that follows is based on the Mac version of the software.

    Download VMD (registration required), and follow the installation instructions.



    Figure 3.1. VMD Display window Figure 3.2. VMD Main and Molecule Browser windows Figure 3.3. Graphical Representations window

    To view one simulation:
    1. Open VMD
        - You should see a "Display" window and a "Main" window.
    2. Load the PDB file for the molecule
        - In the "Main" window, go to "File" and then "New Molecule..." This brings up a Molecule File Browser. Select "Browse..." select your PDB file and click "Load"
    3. Load the simulation file
        - In the Molecule File Browser, click "Browse..." again and select your simulation file (extension .crd, more about this format in the data output section)
    4. Play the simulation
        - You can adjust the speed of the simulation using the speed slider bar at the bottom right of the "Main" window
        - You can also rotate and zoom the simulation using your mouse
    5. Try different ways of drawing the molecule
        - In the "Main" window, go to "Graphics" and then "Representations..." This will bring up a "Representations" window. Choose drawing methods from the "Drawing Method" drop down menu (default is "Lines"). DON'T CHOOSE 'SURF' OR 'ISOSURFACE', AS THESE WILL TAKE TOO LONG TO RENDER. Good ones to choose are "Cartoon" or "New Cartoon" (to see secondary structure only), or "CPK" (atoms drawn as spheres, bonds as lines). "CPK" is probably the best representation for seeing details like bond length changes and side chain behavior.
    To load a new simulation, you must first clear the previous one. Select the molecule in the "Main" window and either right click on it or go to the "Molecule" menu. Choose "Delete Frames..." and click OK. Then load a simulation file as before.

    You can also view multiple simulations (for comparing/contrasting) by opening multiple instances of VMD.

    << back to top


    Deliverables
    1. Commented, working code. (30%)
    2. README.txt file containing, at the very least, instructions for running your code and full pseudocode(20%)
    3. 9 .rvc, .crd, and .erg files corresponding to the simulations we specified above. (20%)
    4. Completed Project 4 quiz: download PDF file here (25%)
      - Please submit your answers ONLY in a file named exactly "YourSunetId_quiz4.txt".
    5. Completed project survey: download PDF file here (5%)
      - Please submit your answers ONLY in a file named exactly "YourSunetId_survey.txt". This project is very new, so the survey is for us to get your detailed feedback on it. Please take the time to think about the questions and provide reasonably thoughtful responses (you will get the full 5% if you make an effort at this). Thanks in advance!

    << back to top


    Submission instructions

    Log in to an elaine machine:

    ssh sunetid@elaine.stanford.edu

    Place all of your files from 1-5 above into one directory, e.g.

    ~/biomedin214/p4/submit

    Please only include those listed above. Delete executable binaries if you have them (e.g. if you used Java/C/C++); we will compile your source code.

    cd into your submit folder:

    cd ~/biomedin214/p4/submit

    Run the class submission script:

    /usr/class/biomedin214/bin/submit p4

    Note the space before p4. Be sure to run the script from within your submission folder (where all your files to be submitted are)!

    You may submit multiple times; simply re-run the script. Each new submission overwrites the previous submissions. Your submission date will be the final submission received, and late days will be charged accordingly. Don't forget the quiz and the survey!

    << back to top


    Code Optimization

    Due to the large number of operations required in a simulation, sub-optimal implementation may result in extremely long running times. There are a number of ways to increase efficiency and improve running time:
    1. Use C++ or Java
        These languages have proven to be significantly faster (1 simulation = 5 seconds) than scripting languages (a minute or more). You can still use Python, but you will probably need to incorporate all of the recommendations below.
    2. Avoid unnecessary looping
        There are many loops over all atoms required during each time step, but you can minimize the number of loops made
    3. Use smart data structures
        There are ways of storing the data that can improve the efficiency of look up and even the efficiency of looping. You might not have to loop over all atoms to compute energy and force, for example.

    << back to top