Personal tools
You are here: Home Module Bioinformatics 1 2009 Exercises Day 1: Gene modeling (extrinsic)
Document Actions

Day 1: Gene modeling (extrinsic)

by schoof last modified 2009-11-02 23:58
Usually nowadays, related sequences are already present in the databases. When available these may be the fastest way to get a good gene prediction. Often this prediction will be more reliable than the coding bias predictions though one should be aware of the possibility of sequence errors, differential splicing etc. and of course finding the coding exons is not a complete gene prediction: UTRs, transcription start, PolyA site etc.  also belong to a gene, but are not usually considered in gene prediction.

For alignment, we can use either proteins or mRNA-derived sequences. mRNA-derived sequences like ESTs or cDNAs do not contain introns. If they correspond exactly to the gene we are analyzing, they can serve as proof that it is indeed transcribed and that the gene model is correct.
Protein sequences can be helpful to identify coding regions, even if they come from only remotely related species. For determining gene models, the GenSeqer program has an exhaustive (slow) algorithm to align a protein to a DNA sequence, allowing for splice site recognition. In a real situation, BLAST programs would be useful for first picking up the matches in a DB search.

You can use Artemis to assemble your predictions. Use "Open" to load the hsak.fa sequence file. You can download the reformated output of HMMgene here and load the predicted exons using "Read entry..." in the file menu.
See Artemis tutorial to learn how to create your own annotation.

Step 6. Gene prediction with expressed sequences using Gene2EST and GeneSeqer (don't seem to work anymore, so we use Spidey)

Instead of Gene2EST, BLAST at NCBI can be used to pick up EST sequences that match your query.  Remember to select human ESTs as database.

    Use this EST database for GeneSeqer or Spidey. Remember to set the species to human.
    • In the output, do the suggested gene models correspond to your model from Step 5? What are differences, and how can you explain these?

    Step 7. Comparison with annotation in a genome database

    Look at the annotated gene for our test sequence, HSAK1.
    Note that no cDNA has been sequenced for this gene: gene structure was inferred by some transcript mapping and by protein homology.

    • Most of the elements of the gene are listed in the feature table.
      • Did you get the promoter?
      • Did you get the starting methionine? Does it obey Kozak's rules?
      • How many amino acids are in the first coding exon?
      • If you made any errors in the prediction, can you see where you went wrong?
      • There is a problem with the annotation of the first intron's acceptor:
        • do you think this is -
          • an unusual splice site?
          • an annotation error made by the authors?
    Now look up this gene in the ENSEMBL genome database. Use BLAST with the test sequence to find the correct genome region.
    • What information is displayed about this region in ENSEMBL? Which things that were used for gene prediction in this exercise can you find in the ENSEMBL display?
    Hope you enjoyed this exercise... in case the servers don't work, here are the results.


    Powered by Plone, the Open Source Content Management System

    This site conforms to the following standards: