Structure prediction of globular and membrane proteins

 

Structure prediction of globular and membrane proteins

For the course in "Protein structure prediction of membrane and globular proteins" the main task is to do a project and a presentation on the same subject. In the project you will use your experience from earlier courses to develop a predictor for a "feature" of a globular or membrane protein. Please note that it is perfectly acceptable to talk about the project with other students, but that you are expected to write your code and work individually.



The following parts need to be fulfilled to pass the course

  1. Read SVM book Chapter 1-3 and write a two page summary. This should be included in the final report.
  2. Prepare and give one seminar about a scientific paper, see separate list.
  3. Participate in all seminars. If you due to illness can not participate you need to write a longer report and discuss and compare the papers presented at the seminar you miss.
  4. Write half a page summary of each scientific paper and include it in the final report (see below).
  5. Develop a functional predictor for the "feature" assigned to you
  6. Write a report where you review the field for your type of predictor and compare your predictor with others.
  7. For top grades develop a web-server that accepts a sequence as input and predicts "your" feature.
  8. Also for top grades you need to do this during the period of the course.

 

Calendar

                                        

 

Seminars and lectures

Many of the lectures in the protein physics and molecular modeling course are relevant for this course. Therefore we strongly recommend all students to take these courses before starting here. At the seminars one or two students are presenting one scientific paper each. At the end all students should include a half page description of each paper in their final report. A separate schedule for the seminars will be given later.

                               

  • Lecture  Intro March 21
    Machine learning - general concepts (SVM chapter 1)
    * What is the difference between a) supervised and unsupervised learning? 
    b) binary classification and regression?
    * What is meant with generalization in the context of machine learning?
  • Seminar 1 -  April 1 - Secondary structure prediction of globular proteins
  • Seminar 2 -  April 8-  prediction of membrane proteins
  • Seminar 3 - April 15 - prediction of structure of proteins

                                     

The project

All project points should result in a (small) program written in python or any other language that can perform the supposed task. The data needed for your project as well as individual assignments will be given to you at the beginning of the course. The assistants will be available to help you with the assignments that happens this week. There will be very limited possibilities to get support with assignments from previous weeks, and there will be no support after the end of the course.

                                      

Teachers

Please send a short (five line) summary before each thursday afternoon to Kristoffer. Below follows a schedule for your project.

 

Support is given at specific times and by email by sending an email to the bioinfo_master@sbc.su.se mailing list.

Mar 21 10:00 in Green Room - Start of project

Here, the individual projects will be given out. A short description of the project will be given.

Workplan Week A
  • There are two different datasets, one for globular proteins and one for membrane proteins. Which dataset to use is given by the project description.
  • The data for each protein contains the amino acid sequence and also a number of "features" for each amino acid residue.
  • Your first task is to write a program that for each protein (file) can extract the amino acids and the associated feature you have been assigned to.
  • The second step is to homology reduce the dataset in a proper way, using blastclust.
  • The third part is to write a program that takes all your homology reduced sequences and splits the data into 5 groups so that it can be used for cross-validated training.
  • The program should format the output in such a way that it can be used with SVM light as input (your future predictor) using sparse encoding.
     

                                       

Workplan Week B
  • Perform cross-validated training, run SVM light and write a program that evaluates your first training round. To use svm, write "module add svmlight" in a terminal window (without the "").
  • Compare the predictions against the true assignments. Define the true positives (TP), true negatives (TN), false positives (FP) and false negatives (FN). Calculate a MCC-value for the whole cross-validated set. You can find information about MCC-value here: http://en.wikipedia.org/wiki/Matthew...on_Coefficient
  • Before next lab you should send a short email to the assistant about which predictor you are making, and the results of your initial predictor with cross-validated data. Send the results for the number of TP, TN, FP, FN and the resulting MCC-value. Try different kernels.
Workplan Week C
  • Your predictor has so far been based only on single amino acids. Your task is to improve the predictor by including A) a window of position-specific information and B) evolutionarily related sequence by using output from PSI-Blast

  • First, include the local sequence environment, by using different window sizes, i.e. to predict the feature for residue i and a window size 2k  include the sequence information for all residues i-k to i+k.

  • Secondly, include information from evolutionarily related sequences. This should be done by including information from PSI-Blast sequence alignments. To save time the PSI-Blast output files for the training data have been calculated for you (/afs/pdc.kth.se/home/b/bjornw/Public/profiles/). However, you should know how to do it, since that is needed for generating input data for you final predictor (see below).
  • Report to the assistant before next time A) how you included the information from neighboring residues and B) how you included the PSI-Blast information and C) both. Please also report how your MCC-values have changed from last time in A), B) and C).                                          

 

      How to run PSI-BLAST:

bash$ module add blast
#This will add the blast module which contains PSI-BLAST in the program called "blastpgp"

bash$ blastpgp -
#This will bring up the blast help page

#The profiles for the training data were generated by searching UNIREF90 with the following command
bash$ blastpgp -b 0 -j 3 -h 0.001 -d /afs/pdc.kth.se/home/b/bjornw/Public/DB/uniref90.fasta -F F -i <fasta_file> -Q <profile>
DEADLINE !! Apr 8 HALF TIME REPORT

Write one page about (1) your progress and (2) compare your method with previous work. Send it in an email to Kristoffer

                 

Workplan: Week D
  • Continue improving the predictor, gather and analyze results. Your grades will depend on how carefully you have investigated the effect of different parameters, individually and in combination, and how you use them in the final predictor.
  • Write a script ready which takes a protein sequence in fasta format as input and then predicts the presence of your type of local feature.
  • For higher grades: start working on the web-server by learning PHP by yourself.
  • Continue improving the predictor, gather and analyze results.
  • Write the prereport and final report
  • Make sure the program works
DEADLINE !! Apr 12 - PRE-REPORT
  • Optional submission of "pre-report" for feedback. Late pre-reports will not receive any feedback. Feedback will be provided Apr 15 at the latest.
    • It is a good idea to submit your prediction program already now. Because a non-working prediction program at the end of the course will result in a failed course.
  • For a top grade (A or B):
    • Set up a web-server using the provided PHP templates below. The webserver should allow the submission and prediction of a sequence. A
    • A nice graphical output
    • Extend the basic predictions. If your basic task is to predict the presence of coils extend so your final predictor can predict a residue to be in one of three states coil/helix/sheet. Similar extensions can be made if your task is to predict whether a residues is located within a specific distance from membrane center or having a specific accessibility level.
    • A good analysis of which parameters you have optimized.
  • For non web-servers the report should inclde a functional program that provide the same functionality, i.e. the ability to run a program and predict the requested feature.

                                       

Information about the webserver
The following steps should work.

1) Put your stuff in ~/public_html/
2) Make sure it has the right permissions (remember how to do AFS stuff)
3) Try it on the following URL http://bogart.cbr.su.se/~yournam/

PHP and mysql should be installed, if you need something else please contact the assistant(s).

 

 

Workplan Week E (Final week)
 
  • Finalize the server and report
 
DEADLINE !!! Apr 29 - FINAL REPORT
  • Submission of the final written report. Note that the report should include a link to the web-server and a description of it. The written report should be written in the form of a "scientific" paper, containing the following sections: Abstract, Introduction, Methods, Results, Discussion and References. It should be between 10 and 30 pages (single spaced 12 points) long and contain between 2 and 6 figures/tables (of which at least one should be a sensitivity-specificity plot).

                                       

Deadlines

The report, the program and the potential webserver all has to be submitted by Apr 29.

The final grade will be based on the quality of the report, and of the web-server written. If the project is not finished in time, the program does not work correctly or your report is not good enough you will fail. There will be a possibility to submit it again by Aug 20.                                       

The grading will be based on the ability to do all assignments in time, the quality of the program and the written report.

                                       

Projects:

Globular proteins

 

  1. Coil predictor,                    identifying BCTS in DSSP
  2. Helix predictor,                  identifying HGI in DSSP
  3. Sheet predictor,                  identifying E in DSSP            
  4. Buried residue predictor,     identifying (<=25) in NACCESS
 
Membrane proteins
  1. Coil predictor,                     identifying BCTS in DSSP
  2. Helix predictor,                   identifying HGI in DSSP
  3. Buried residue predictor,     identifying (<=25) in NACCESS
  4. Interface region predictor,    identifying  |Z| <= 22 & |Z| > 10
  5. Transmembrane predictor,   identifying M in the topology
  6. Predict |Z| (absolut value of the Z-coordinate), regression problem.

 

Students:

  1. Linus Östberg
  2. Ino de Bruijn
  3. Ganapathi Varma
  4. Biprodas Biswas
  5. Nair Satish
  6. Bjørn Sponberg
  7. Simon Kebede

    Files

    Attach file

     

    [MISSING: skin.common.header-gallery-count]

    Attach file