1
BIO828 Assignment
The following tasks should be completed and submitted before the midnight on Friday in Week 11. Your answers
eport to the task questions
should be in Word or PDF format, your Python code as a separate *.py file and your Python script output file(s) as specified in Task 5. Please
pack all your assignment work files into a single zip file and submit it using the submission dropbox provided on BBL. Please get in touch if you
have any queries.
Task 1 – Pubmed search (10%)
Using the key information from statement below, perform a Pubmed search and then answer the questions below:
Although traditional sweeteners such as sugar are ca
ohydrates, most cu
ent research instead is focusing on proteins that have an intrinsically
sweet taste. Because these sweet-tasting proteins are much sweeter than their ca
ohydrate counterparts, they are, in essence, calorie free,
ecause so little is used to achieve a sweet taste in food. The most successful example of such a protein is aspartame; however, aspartame is
synthetic and does not occur in nature. Alternate natural protein sources are being investigated, including a sweet-tasting protein called monellin.
a. According to Ogata et al, how much sweeter than ordinary sugar is monellin on both a molar and weight basis?
. Based on the UniProt entry for monellin chain B from serendipity be
y (P02882) what residue (amino acid position) when blocked
eliminates monellin’s sweetness? Please also provide the amino acid if possible.
Task 2 – RNA prediction (10%)
Using the Mfold method provided in week 3, for the following homologous sequences, predict the secondary structure for each of the sequences
and determine the consensus secondary structure for the four sequences.
Sequence A –
UUAAGGCGGCCAGAGCGGUGAGGUUCCACCCGUACCCAUCCCGAACACGGAAGUUAAGCUCACCUGCGUUCUGGUCAGUACUGGAGU
GAGCGAUCCUCUGGGAAAUCCAGUUCGCCGCCCCU
Sequence B –
GUUACGGCGGUCAAUAGCGGCAGGGAAACGCCCGGUCCCAUCCCGAACCCGGAAGCUAAGCCUGCCAGCGCCAAUGAUACUACCCUU
CCGGGUGGAAAAGUAGGACACCGCCGAACAU
2
Sequence C –
AUCUGCGGCCAUACCGCGCUGAACGUUCCGCGUCUCGUCCGAUCCGCGCAGACAAGCAUCGCAGGGGCCAGAGAGUAUUGACGUGGG
UGACCAGUCGAGAACACUGUGCUGCCGCAGGU
Sequence D –
AUGUGCGACCAUACCAAGCUGAAAAUACUGCAUCCCGUCUGAUCUGCACAGUCAAGCAGCUUAGGGCCCAGUCAGUAGUGCGGUGGG
GGACCAUGCGCGAACAUUGUGGUGUUGCACUU
Task 3 – Functional Prediction of variants (10%)
Use the variants and programs provided in the table below (websites for programs provided in Week 6 Lecture) to complete the table. All these
variants are have been associated with Primary Congenital Glaucoma, you will need to perform the Polyphen analysis first as it provides a protein
accession number which can be clicked on to link to further information on COL1A1, this information will be needed for some of the other
programs. Please
iefly comment on the similarity/difference between the results for each mutation.
Gene/protein Variant Polyphen SIFT FATHMM
(provide
score)
COL1A1 p.Met264Leu
COL1A1 p.Ala1083Thr
COL1A1 p.Gly767Ser
COL1A1 p.Gly154Val
3
Task 4 – Protein sequence alignment (10%)
Search the Pfam database (this can be accessed using the following link - https:
pfam.xfam.org/ ) with the following query sequence; Describe
the matches found, their significance, and the co
esponding alignment positions; Discuss the likely function(s) of this query protein as predicted
y this Pfam search.
example.seq
MADLEAVLADVSYLMAMEKSKATPAARASKKILLPEPSIRSVMQKYLEDRGEVTFEKIFSQKLGYLLFRDFCLNHLEEARPLVEFYEEIKKYEKLET
EEERVARSREIFDSYIMKELLACSHPFSKSATEHVQGHLGKKQVPPDLFQPYIEEICQNLRGDVFQKFIESDKFTRFCQWKNVELNIHLTMNDFSV
HRIIGRGGFGEVYGCRKRDTGKMYAMKCLDKKRIKMKQGETLALNERIMLSLVSTGDCPFIVCMSYAFHTPDKLSFILDLMNGGDLHYHLSQHGV
FSEADMRFYAAEIILGLEHMHNRFVVYRDLKPANILLDEHGHVRISDLGLACDFSKKKPHASVGTHGYMAPEVLQKGVAYDSSADWFSLGCMLFK
LLRGHSPFRQHKTKDKHEIDRMTLTMAVELPDSFSPELHSLLEGLLQRDVNRRLGCLGRGAQEVKESPFFRSLDWQMVFLQRYPPPLIPPRGEV
NAADAFDIGSFDEEDTKGIKLLDSDQELYRNFPLTISERWQQEVAETVFDTINAETDRLEARKKAKNKQLGHEEDYALGKDCIMHGYMSKMGNPF
LTQWQRRYFYLFPNRLEWRGEGEAPQSLLTMEEIQSVEETQIKERKCLLLKIRGGKQFILQCDSDPELVQWKKELRDAYREAQQLVQRVPKMKN
KPRSPVVELSKVPLVQRGSANGL
Task 5 – Python coding for data analysis (10%)
The file “DNA-Seqs.fasta.tab” is a tab delimited text file that contains DNA sequences in fasta format. Here are a few lines from the file.
SeqName seqstr
seqsA0001 GGCAGATTCCCCCTAGACCCGCCCGCACCATGGTCAGGCATGCCC
seqsA0002 ATAAACAGTGCTGGAGGCTGGCGGGGCAGGCCAGCTGAGTCCTGA
seqsB0003 GCCTCTCTGGGTTGTGGTGGGGGTACAGGCAGCCTGCCCTGGTGGGCACCCTGGAGCCCCATGTGTAGGGAGAGG
Write a Python program to read in the data from this input file; for each sequence in the file, count the numbers of nucleotides A, C, G, T, calculate
the fraction (proportion) of each nucleotide and the GC content (see the definition at https:
en.wikipedia.org/wiki/GC-content) per sequence, and
finally output your summary of the analysis in the following format and save the result to a disk file as a tab delimited text file, a csv (comma
separated values) file, or an Excel workbook. Please attach your Python code (the .py file) so that it can be tested.
seqName n nA nC nG nT fracA fracC fracG fracT GCcontent
seqsA XXXXXXXXXX ? ? ? 0.18 ? ? ? ?
seqsA XXXXXXXXXX ? ? ? 0.22 ? ? ? ?
https:
pfam.xfam.org