Microsoft Word - BLAST Algorithm Steps - Part 2 .docx
09-April-2018
Project (Step 1 out 2)
BLAST Algorithm Steps:
1. User enters the DNA Sequence that he/she wants to search in the database (The
database for this project would be a txt file).
Example Output of the program:
Please enter the DNA Sequence that you would like to match:
User entry: ATGCCCGTCATTCC
2. The program then searches the Database for exact or close matches.
3. The first step for search is to
eak the user entry into 3 letters:
Example: ATGCCCGTCATTCC
The program
eaks this up into 3 letter words: ATG, TGC, GCC, CCC,…………………
4. The program then searches the Database (The txt file) for sequences that match the 3
letter words.
DNA Sequence Database:
GTCATGCCCGTCATTCC
GGGGATGCCCGGGGG
TTTTATGCCCGTCGAAG
TAATGCCCGTTTTTTTT
GCCATGCCCGTTACCCC
5. After finding the “Hook” by matching the 3 letter words to the database, the algorithm
then moves left and right along the DNA sequence of interest.
For example: In the above instance, the first Sequence was of interest since we were
able to find the 3 Letter Word: ATG in the sequence.
GTCATGCCCGTCATTCC
The 3 Letter Words (User Entry)
ATG
TGC
GCC
CCC
6. Once it moves left and right it will keep a score, and try to match the user entry with the
DNA sequence of interest from the database.
With every match, the algorithm adds +1 to the overall score, and -1 for any mismatch.
In the above example:
User Entry: XXXXXXXXXXATGCCCGTCATTCC
Database (Sequence of Interest): TCATGCCCGTCATTCC
User Entry Database Score
- T -1
- C -1
A A 1
T T 1
G G 1
C C 1
…….. ……. …………….
Overall Score: XXXX
7. After the algorithm matches the user entry with the sequence of interest it provides an
output to the user that looks something like this.
User Entry: XXXXXXXXXXATGCCCGTCATTCC
||| ||||||||||||
Database (Sequence of Interest): TCATGCCCGTCATTCC
(The program will show lines where the letters match, and won’t show any lines where
there is no match).
8. It will also provide the user with the score, that it calculated in the step 6 and a
percentage value of how much matched.
14-Apri-2018
Updates: Ashish created the program (Step 1)
Run Sample:
Updates to be made:
1. SCORE VARIABLE: The “maximum count” variable should be changed to “Score”.
2. OUTPUT %: The output should also display percentage of total match. For example: if 8
letters out of 9 matched, the percentage should be displayed as 88.8% or 89.0%
3. USER INPUT: The program should allow the user to input a DNA sequence. (Of a fixed
length). Its upto you to decide how long of a user input the program can accept. But
about 8-10 letters should be ideal.
4. TXT FILE: The program should have a database (.txt file) with fixed length stored DNA
sequences.
5. COMMENTS: Ideally I would prefer detailed comments so that I can follow through the
code. This is not necessary since we can always discuss the code over skype. But if you
have the time, I would appreciate it.
6. RUN TIME: The output should display a run time. This is necessary since in the next step
when we implement parallel computing techniques, we need to compare the speed of
oth programs.
7. LINES: The output should display lines where the letters match and no-lines where it
doesn’t.
A sample output for reference: Note some of this wouldn’t apply to our program but just want
you to get an idea of how it should look like.