Day 1: Gene modeling (extrinsic)
For alignment, we can use either proteins or mRNA-derived sequences. mRNA-derived
sequences like ESTs or cDNAs do not contain introns. If they correspond exactly
to the gene we are analyzing, they can serve as proof that it is indeed transcribed
and that the gene model is correct.
Protein sequences can be helpful to identify coding regions, even if they
come from only remotely related species. For determining gene models, the
GenSeqer program has an exhaustive (slow) algorithm to align a protein to
a DNA sequence, allowing for splice site recognition. In a real situation,
BLAST programs would be useful for first picking up the matches in a DB search.
You can use Artemis to assemble your predictions. Use "Open" to load the hsak.fa sequence file. You can download the reformated output of HMMgene here and load the predicted exons using "Read entry..." in the file menu.
See Artemis tutorial to learn how to create your own annotation.
Step 6. Gene prediction with expressed sequences using Gene2EST and GeneSeqer
(don't seem to work anymore, so we use Spidey)
Instead of Gene2EST, BLAST at NCBI can be used to pick up EST
sequences that match your query. Remember to select human ESTs as
database.
- In the output, do the suggested gene models correspond to your model from Step 5? What are differences, and how can you explain these?
Step 7. Comparison with annotation in a genome database
Look at the annotated gene for our test sequence, HSAK1.
Note that no cDNA has been sequenced for this gene: gene structure was inferred by some transcript mapping and by protein homology.
- Most of the elements of the gene are listed in the feature table.
- Did you get the promoter?
- Did you get the starting methionine? Does it obey Kozak's rules?
- How many amino acids are in the first coding exon?
- If you made any errors in the prediction, can you see where you went wrong?
- There is a problem with the annotation of the first intron's acceptor:
- do you think this is -
- an unusual splice site?
- an annotation error made by the authors?
- What information is displayed about this region in ENSEMBL? Which things that were used for gene prediction in this exercise can you find in the ENSEMBL display?