MIS 214 / CS 274 - ASSIGNMENT NO. 1

Surfing the Web for Biological Data

DUE MONDAY, APRIL 3 2000

Objectives:

• Learn how to use world wide web browsers to access biological databases
• Begin to understand contents of these databases and semantics.
• Learn how to extract DNA/ protein sequence based on keywords or sequence fragments
• Learn the differences between the 20 amino acids and why these distinctions are important in structure.
• Understand the relationship between primary amino acid sequence, secondary structure and tertiary structure.
• Learn to use an interactive protein manipulation program called MAGE to study the three-dimensional structure characteristics of proteins
Don't forget to submit your answers to the problems!

Database Surfing:

There are many biological databases on the web. For this assignment you are going to use GENBANK, the databank of DNA sequences that are known; SWISS-PROT, the databank of protein sequences; and PDB, the databank of protein 3D structures. There are also many online servers, such as the BLAST server for rapid sequence matching. A good place to start looking on the web is the web page for the course.

You may want to bookmark this page for future reference. Under "Assignments and Projects", there are some links to online bioinformatics resources. For example, clicking on "The National Center for Biotechnology Information (NCBI)" will lead to the homepage of NCBI, which is home to many online biological services such as GENBANK and BLAST.

GENBANK

The DNA databank, GENBANK, can be accessed from NCBI homepage. Click on the GENBANK link located on the left side of the NCBI homepage. More information about GENBANK, including the number of bases and sequences currently availible, can be found by following the "overview of the database" pointer, under the "GenBank" heading, from this page. There are many ways to search GENBANK. Click on "Search GenBank" (right below the "Overview" pointer), and you'll get a list of different search methods. Entrez allows you to retrieve protein/DNA sequences and 3D structures, BLAST is a sequence similarity search, dbEST and dbSTS are databases of sequence tags and tagged sites. Go to Entrez and try a nucleotide search, open the GenBank report of any hits, you will see a GenBank file. There are many fields in each GenBank file, including definition of the molecule, the accession number, the DNA sequence (at the bottom of the file), translation of the DNA code into the protein code (in the FEATURES field), reference to literature, etc. Detailed description of GenBank file format can be found at ftp://ncbi.nlm.nih.gov/genbank/gbrel.txt.

SWISS-PROT

SWISS-PROT, the protein sequence databank, is located at Geneva University Hospital. Go to SWISS-PROT home page and try searching keyword "sperm whale myoglobin". Click on any hit to look at a SWISS-PROT file. SWISS-PROT entries have a totally different format from GENBANK. Each entry has many records including the ID, the ACCESSION, the protein sequence (at the bottom of the file), as well as cross-reference access numbers for other databases. To know more about SWISS-PROT, go to its release notes and user manual.

PDB

You can get 3D protein structure files from the Protein Data Bank, PDB. Go back to Russ's hotlist webpage and follow the "PDB WWW Server " link. It will lead you to the PDB homepage. To search the PDB, follow the link "SearchFields". Type your keyword(s) into the appropriate field(s) and click "Search". You may get a list of matching entries. Select the entries you are interested in and choose a desired action from the pull-down menu at the top and hit "Go". For example, you can choose "Download Structures and Sequences" and click "Go" to download a PDB formatted file. After unzipping the file, take a look at the PDB file with your favorite text editor (emacs, vi, word, etc). You can use the MAGE program (described below) to view the structure encoded in the file. Each PDB entry also has many records, such as COMPND, SOURCE, SEQRES (the amino acid sequence), etc. The xyz coordinates of atoms in a protein are given in records ATOM. The coordinates are the first three decimal numbers in each ATOM record.

BLAST

You can search sequence databases for other sequences that are similar to the one you have using the online BLAST server. BLAST can be accessed from the NCBI homepage. Click "BLAST" in the bar at the top of the page. Choose "Basic BLAST search" for now. You may need to change the default Program and default Database: if you are searching with a DNA sequence, you can choose Program "blastn" and Database "nr"; if you are searching with a protein sequence, you should choose Program "blastp" and Database "swissprot". Type or cut-Paste your query sequence into the window, and press "Submit Query". This may take a few minutes.

Amino Acids and Protein Structure:

Proteins are essential to all forms of life, helping them carry out some of their most basic biological processes. Their functions are diverse, particularly in humans where they can act as hormones, enzymes, oxygen carriers, structural components, messenger molecules, and antibodies — just to name a few. This diversity in function is reinforced by the estimated 50,000-100,000 different protein molecules coded for by human DNA.

Because of their fundamental role in biological processes, elucidating protein function is a subject of intense interest within the scientific community. Many human diseases can be traced back to a defective protein (sickle cell anemia, cystic fibrosis, alpha and beta thalassemias,etc.) , and hence there is much hope that greater understanding of protein function will aid in treatment of these diseases.

How does one go about studying protein function? One way is through exhaustive biochemical experimentation where biological processes are mapped out in detail and linked to other processes. Another way, which we will explore through this assignment, is by studying the three-dimensional structure or conformation of a protein. This structure-to-function relationship is an important one which we will explore in this assignment.

BASIC BUILDING BLOCKS: AMINO ACIDS

Amino acids (= amino acid residues) are the basic building blocks of proteins. There are twenty different amino acids coded for by a DNA, and these amino acids are usually referred to by their three or one letter codes (see table below). The amino acid Tyrosine, for example, can be referred to as Tyr or Y. The amino acids share a common core structure which comprises the "backbone" of the protein, but each amino acid has a unique "side chain" group that comes off the main backbone. These side chains are the critical components for distinguishing between the different amino acids biochemically, electrostatically, and structurally.

The Genetic Code
 
One-letter code Amino acid  Three-letter code Genetic code
A Alanine Ala GC*
C Cysteine Cys UGU, UGC
D Aspartic Acid Asp GAU, GAC
E Glutamic Acid Glu GAA, GAG
F Phenylalanine Phe UUU, UUC
G Glycine Gly GG*
H Histidine His CAU, CAC
I Isoleucine Ile AUU,AUC,AUA
K Lysine Lys AAA, AAG
L Leucine Leu UUA, UUG,CU*
M Methionine Met AUG
N Asparagine Asn AAU, AAC
P Proline Pro CC*
Q Glutamine Gln CAA, CAG
R Arginine Arg CG*,AGA, AGG
S Serine Ser UC*,AGU, AGC
T Threonine Thr AC*
V Valine Val GU*
W Tryptophan Trp UGG
Y Tyrosine Tyr UAU, UAC

One important property of amino acids, for example, is the degree to which they like to be near water (since proteins often exist in aqueous environments). Amino acids that have polar (charged) side chains tend to be categorized as hydrophilic ("water loving"), while others that have non-polar (uncharged) side chains tend to classified as hydrophobic ("water avoiding"). The KYTE-DOOLITTLE table below is a popular classification of the amino acids according to this characteristic:

Kyte-Doolittle hydrophobicity values of the 20 amino acids
 
Category Kyte-Doolittle value One-letter code for amino acid Amino acid Three-letter code
Hydrophobic +4.5 I Isoleucine Ile
Hydrophobic +4.2 V Valine Val
Hydrophobic +3.8 L Leucine Leu
Hydrophobic +2.8 F Phenylalanine Phe
Hydrophobic +2.5 C Cysteine Cys
Hydrophobic +1.9 M Methionine Met
Hydrophobic +1.8 A Alanine Ala
Neutral –0.4 G Glycine Gly
Neutral –0.7 T Threonine Thr
Neutral –0.8 S Serine Ser
Neutral –0.9 W Tryptophan Trp
Neutral –1.3 Y Tyrosine Tyr
Neutral –1.6 P Proline Pro
Hydrophilic –3.2 H Histidine His
Hydrophilic –3.5 Q Glutamine Gln
Hydrophilic –3.5 N Asparagine Asn
Hydrophilic –3.5 E Glutamic Acid Glu
Hydrophilic –3.5 D Aspartic Acid Asp
Hydrophilic –3.9 K Lysine Lys
Hydrophilic –4.0 R Arginine Arg

A SEQUENCE OF AMINO ACIDS: THE PRIMARY STUCTURE

To build a protein, the cell needs information on the sequence of amino acids that need to be linked. The DNA code (3 DNA nucleotides codes for one amino acid) is transcribed to messenger RNA which then travels in the cell to the ribosome, where translation of RNA sequence to amino acid sequence occurs. The result is a string of amino acids that will quickly fold into its native three-dimensional shape. Hence the three dimensional structure of a protein is uniquely determined by its amino acid sequence (or its primary structure). Changes to this amino acid sequence can greatly alter the protein's structure and corrupt its original function. Sickle cell anemia, for example, is caused by a single mutation in the original amino acid sequence for a protein involved in supporting the walls of red blood cells.

LOCALIZED STRUCTURES: SECONDARY STRUCTURE ELEMENTS

When examining different protein structures, one can find commonly repeated, stable substructures that make up these proteins. These substructures usually involve relatively localized interaction between amino acids. The most common ones are called the alpha helix and the beta sheet. In this assignment we will be focusing on a-helices. a-helices are spiral staircase like structures first predicted by Linus Pauling in 1953. The carbonyl group (CO) of each amino acid in an a-helix is hydrogen bonded to an NH group of the amino acid 4 positions ahead in the sequence. This results in a right-handed, spiral staircase structure with each amino acid translated 1.5 Å along the helix axis. Mygoblin, for example, is composed of eight a-helices total with single strands of amino acids connecting them. Certain amino acids have proclivities for being in alpha helices, and others have the ability to "break" them, like the amino acid proline. Comparing and aligning proteins in terms of these substructures often makes more sense than comparing on the basis of sequence alone.

THE TERTIARY STRUCTURE

The tertiary structure refers to the final three dimensional structure of a protein that presumably is the result of localized interactions creating secondary structure elements, relatively long distance bonds and interactions between distant parts of the protein, as well as interactions the protein has with heteroatoms and its environment.

It appears that the sequence of amino acids is all the information we need to predict the three dimensional structure of the resulting protein, but this problem (the protein folding problem) has turned out to be VERY difficult. We have discussed methods that people have developed for predicting tertiary structure (3-D structure) from primary sequence data.

Using MAGE

MAGE is a program for interactive 3-D visualization of a protein. Go to MAGE website and download version 5.35 of MAGE program files. There will be two main programs you will be dealing with:

a) PREKIN (or PKIN) - Program that outputs files readable by MAGE
b) MAGE - Lets you view and manipulate 3-D protein structures. Take PREKIN files as input

PREPARE FILES USING PREKIN • Start up the PREKIN ("Prepare Kinemage") application software. (The icon for the PREKIN is a hand and a piece of blank paper).
• In the 1st window that comes up in PREKIN, click on "Proceed." (Or, you can click on "Explanation" the first time).
• The 2nd window that comes up in PREKIN is the "open file" window. Here you should open the PDB file you want to view. For this assignment, you would open your PDB file for 1mUP. If the saved PDB file for 1MUP has .txt extension, you could drag this file and drop into PREKIN to open it.
• The 3rd window asks where you want to store the file that results from running PREKIN. Name and put the resulting file in an appropriate place (Remember to end the filename with a ".kin" suffix to indicate it is a PREKIN processed file).
• In the 4th window, either select "backbones" or click on "Selection of built-in scripts".
• If you selected built-in scripts, you have a choice to make. In this assignment, just use the "aasc - individual amino acids in separate sets and ca-ca".
• In the 6th window, click on "OK." The resulting output ("kinemage") file will take anywhere from a few seconds to a few minutes to prepare (depending on whether you asked for side chains or not). When done, pull down the "file" menu of the application and then "Quit" the PREKIN application. Your Kinemage should now be on the desktop (or wherever you designated).
VIEW FILES USING MAGE

Start up the MAGE application software (we used version 5.70 for these instructions). The icon for MAGE consists of several overlaid squares and a finger. It is desirable to use a color monitor for viewing kinemages, if available.

• In the 1st window that comes up in MAGE, click on "Proceed."
• Pull down the "File" menu, click on "open." Open your newly-created kinemage. Your kinemage should now appear. (Click in the large black area if it doesn't immediately come up).
• You can rotate, move, and manipulate the protein in MAGE using the mouse. Try the "zoom" feature (by mousing on the arrow at the top (–) end of the slider on the top of the right edge of the Kinemage window in order to "zoom out" a bit.
• You can find a particular amino acid type by going to the "other" menu under "find" and typing in the amino acid you want. Similarly, you can find the amino acid at the position you want by typing the position number under "find."
• Calculating distances between atoms: Locate a particular residue, by pulling down the "Find" menu on "Tools" and type the residue number in the "primary character string" window, and click on the "search from beginning." The first amino acid residue should now be marked on your kinemage with an "X" marker. Now follow along the white backbone of the protein until you find residue 2 of this protein (located where the backbone bends) and click on residue 2. In the lower middle of the screen will be a number (like 3.907). This is the distance between residue 1 and 2 in Angstom units. For example: In myglobin, the first residue is valine, in the lower left of the kinemage screen, notice the appearance of "ca val 1." This indicates that you have located the alpha carbon (Ca) of the amino acid residue valine (called "val" in the 3-letter code and "V" in the 1-letter code) at position 1 of the protein. Now follow along the white backbone of the protein until you find residue 2 of this protein (located where the backbone bends) and click on residue 2. In the lower left of the kinemage screen, "ca leu 2" indicates that you have located the alpha carbon (Ca) of the amino acid residue leucine (called "leu" in the 3-letter code and "L" in the 1-letter code) at position 2. The 3.907 in the lower middle of the screen indicates that the alpha carbon of "Val 1" is 3.907 angstroms from the alpha carbon of "Leu 2." These distances are automatically calculated between every consecutive pair of items on which you click.
Now that you know how to use the MAGE program, you might want to look at some of the demonstration kinemages provided on the course web page under lecture 2. You are also ready for answer the problems for Assignment 1.

Troubleshooting

If you have trouble installing MAGE or PREKIN on your favorite machine, then log into a leland sun machine (e.g. myth1 through myth9), and copy both the mage and prekin executables that we precompiled (i.e. "right-click" and choose "save link as..." on the two hyperlinks in this sentence). Make sure you put the mage and prekin executables in the same directory. Once you've copied the files, set their permissions with the command:

>chmod 755 mage prekin

You can now convert 1MUP.pdb into a kinemage file with the command:

>./prekin 1MUP.pdb


and view the resulting kinemage structure with:

>./mage 1MUP.kin


For questions, contact Josh Stuart.