Great Deal! Get Instant $10 FREE in Account on First Order + 10% Cashback on Every Order Order Now

Microsoft Word - BLAST Algorithm Steps - Part 2 .docx 09-April-2018 Project (Step 1 out 2) BLAST Algorithm Steps: 1. User enters the DNA Sequence that he/she wants to search in the database (The...

1 answer below »
Microsoft Word - BLAST Algorithm Steps - Part 2 .docx

Project (Step 1 out 2)
BLAST Algorithm Steps:

1. User enters the DNA Sequence that he/she wants to search in the database (The
database for this project would be a txt file).

Example Output of the program:
Please enter the DNA Sequence that you would like to match:

2. The program then searches the Database for exact or close matches.

3. The first step for search is to
eak the user entry into 3 letters:

The program
eaks this up into 3 letter words: ATG, TGC, GCC, CCC,…………………

4. The program then searches the Database (The txt file) for sequences that match the 3
letter words.

DNA Sequence Database:






5. After finding the “Hook” by matching the 3 letter words to the database, the algorithm
then moves left and right along the DNA sequence of interest.

For example: In the above instance, the first Sequence was of interest since we were
able to find the 3 Letter Word: ATG in the sequence.

The 3 Letter Words (User Entry)


6. Once it moves left and right it will keep a score, and try to match the user entry with the
DNA sequence of interest from the database.
With every match, the algorithm adds +1 to the overall score, and -1 for any mismatch.

In the above example:

Database (Sequence of Interest): TCATGCCCGTCATTCC

User Entry Database Score
- T -1
- C -1
A A 1
T T 1
G G 1
C C 1
…….. ……. …………….
Overall Score: XXXX

7. After the algorithm matches the user entry with the sequence of interest it provides an
output to the user that looks something like this.

||| ||||||||||||
Database (Sequence of Interest): TCATGCCCGTCATTCC
(The program will show lines where the letters match, and won’t show any lines where
there is no match).

8. It will also provide the user with the score, that it calculated in the step 6 and a
percentage value of how much matched.


Updates: Ashish created the program (Step 1)

Run Sample:

Updates to be made:

1. SCORE VARIABLE: The “maximum count” variable should be changed to “Score”.
2. OUTPUT %: The output should also display percentage of total match. For example: if 8
letters out of 9 matched, the percentage should be displayed as 88.8% or 89.0%
3. USER INPUT: The program should allow the user to input a DNA sequence. (Of a fixed
length). Its upto you to decide how long of a user input the program can accept. But
about 8-10 letters should be ideal.
4. TXT FILE: The program should have a database (.txt file) with fixed length stored DNA
5. COMMENTS: Ideally I would prefer detailed comments so that I can follow through the
code. This is not necessary since we can always discuss the code over skype. But if you
have the time, I would appreciate it.
6. RUN TIME: The output should display a run time. This is necessary since in the next step
when we implement parallel computing techniques, we need to compare the speed of
oth programs.
7. LINES: The output should display lines where the letters match and no-lines where it

A sample output for reference: Note some of this wouldn’t apply to our program but just want
you to get an idea of how it should look like.
Answered Same Day May 06, 2020


Abr Writing answered on May 12 2020
134 Votes
* BLAST - Search two DNA sequences for locally maximal segment pairs. The basic
* command syntax is
*     BLAST sequence1 sequence2
* where sequence1 and sequence2 name files containing DNA sequences. Lines
* at the beginnings of the files that don't start with 'A', 'C', 'T' or 'G'
* are discarded. Thus a typical sequence file might begin:
*    >BOVHBPP3I Bovine beta-globin psi-3 pseudogene, 5' end.
* If sequence1 and sequence2 are identical file names, then matches are
* computed without "mi
or image" pairs and without the trivial long match.
* The BLAST command permits optional additional arguments, such as X=50,
* that reset certain internal parameters. The available parameters are:
*    M or m gives the score for a match.
*    I or i gives the score for a transition (A <--> G, C <--> T).
*    V or v gives the score for a tranversion (non-transition change).
*    W or w gives the word size.
*    X or x gives the value for terminating word extensions.
*    K or k give the cutoff.
*    (defaults: M = 1, I = -1, V = -1, W = 8, X = 10*M,
*     and K = 5% significance level)
* If W is set to 0, then all locally maximum segment pairs (LMSPs) are
* computed; in this case the value of X is i
* In addition, the word "noalign" requests that no alignments be printed
* (summary statistics for each locally maximal segment pair are reported)
* and the word "stats" requests that statistics concerning the performance
* of BLAST be printed. Thus a typical command is
*    BLAST SEQ1 SEQ2 M=1 I=0 V=-1 noalign
#include #include #include #define max(x,y) ((x > y) ? (x) : (y))
#define min(x,y) ((x < y) ? (x) : (y))
#define SUBSTITUTION_SCORE(x,y) S[x][y]
#define DEFAULT_M    1    /* default score for match *
#define DEFAULT_I    -1    /* default score for transition *
#define DEFAULT_V    -1    /* default score for transversion *
#define DEFAULT_W    8    /* default word size *
#define DEFAULT_SIG    0.05    /* default significance level for setting K *
    *s1, *seq1, *s2, *seq2;
    K,        /* cutoff score *
    M,        /* score for a match *
    I,        /* score for a transition *
    V,        /* score for a transversion *
    W,        /* word length *
    X,        /* cutoff for hit extensions *
    sig99, sig90, sig50,    /* number of MPSs above given significance *
    numMSPs,    /* total number of MSPs recorded *
    numhits,    /* number of word hits *
    param_K, param_lambda;
    *diag_lev,    /* extent of discovered matches on a given diagonal *
    len1, len2;    /* sequence lengths *
typedef struct msp {
    long len, pos1, pos2;
    int score;
    struct msp *next_msp;
    } Msp, *Msp_ptr;
Msp_ptr msp_list, *msp;
main(argc, argv)
int argc;
char *argv[];
    int i, v;
    double x, log(), sqrt(), significance();
    char *ckalloc();
    if (argc < 3)
        fatalf("%s seq1 seq2 [M=] [I=] [V=] [K=] [W=] [X=] [noalign] [stats]",
    M = DEFAULT_M;
    I = DEFAULT_I;
    V = DEFAULT_V;
    W = DEFAULT_W;
    X = -1;
    alignments = 1;
    print_stats = 0;
    while (--argc > 2) {
        if (strcmp(argv[argc],"noalign") == 0) {
            alignments = 0;
        if (strcmp(argv[argc],"stats") == 0) {
            print_stats = 1;
        if (argv[argc][1] != '=')
            fatalf("argument %d has improper form", argc);
        v = atoi(argv[argc] + 2);
        switch (argv[argc][0]) {
            case 'M':
            case 'm':
                M = v;
            case 'I':
            case 'i':
                I = v;
            case 'V':
            case 'v':
                V = v;
            case 'K':
            case 'k':
                K = v;
            case 'W':
            case 'w':
                W = v;
            case 'X':
            case 'x':
                X = v;
                fatal("Options are M, I, V, K, W, X.");
    if (X == -1)
        X = 10*M;
    if (V > I)
        fatal("transversions score higher than transitions?");
    len1 = get_seq(argv[1], &seq1);
    if (strcmp(argv[1],argv[2]) == 0) {
        if (W == 0)
            fatal("file1 = file2 not implemented when W=0");
        seq2 = seq1;
        len2 = len1;
    } else
        len2 = get_seq(argv[2], &seq2);
    s1 = seq1 - 1;    /* so subscripts start with 1 *
    s2 = seq2 - 1;
    diag_lev = (long *)ckalloc((len1+len2+1)*sizeof(long)) + len1;
    /* set up scoring matrix *
    S['A']['A'] = S['C']['C'] = S['G']['G'] = S['T']['T'] = M;
    S['A']['G'] = S['G']['A'] = S['C']['T'] = S['T']['C'] = I;
    S['A']['C'] = S['A']['T'] = S['C']['A'] = S['C']['G'] =
        S['G']['C'] = S['G']['T'] = S['T']['A'] = S['T']['G'] = V;
    if (K == 0) {
        x = log(DEFAULT_SIG)/(-param_K*len1*len2);
        K = 2*log(sqrt(x))/(-param_lambda);
        while (significance(K-1) >= DEFAULT_SIG)
        while (significance(K) < DEFAULT_SIG)
    if (alignments) {
        printf("d {\n\"BLAST output with parameters:\n");
        printf("M = %d, I = %d, V = %d\n", M, I, V);
        if (W > 0)
            printf("W = %d, X = %d\"", W, X);
            printf("W = 0 (computes all LMSPs)\"");
        printf("s {\n \"%s\" 1 %d\n \"%s\" 1 %d\n}\n",
            argv[1], len1, argv[2], len2);
        printf("k {\n \"significance\"\n}\n");
    if (W > 0) {
    } else
    if (!alignments)
        printf("\n\n Seq. 1 Seq. 2 len score signif W X\n");
    /* print the matches for each sequence *
    for (i = 0; i < numMSPs; ++i)
    if (W == 0 && print_stats) {
        printf("\n%3d MSPs above significance level 0.99\n", sig99);
        printf("%3d MSPs above significance level 0.90\n", sig90);
        printf("%3d MSPs above significance level 0.50\n", sig50);
        printf("\nPercent of MSPs missed by BLAST at various");
        printf(" W and significance levels:\n");
        printf(" W: 0.99 0.90 0.50\n");
        for (i = 4; i <= 12; ++i)
            printf("%2d: %4.1f%% %4.1f%% %4.1f%%\n",
        printf("\nPercent of MSPs missed by BLAST at various");
        printf(" X and significance levels:\n");
        printf(" X: 0.99 0.90 0.50\n");
        for (i = 1; i <= 20; ++i)
            printf("%2d: %4.1f%% %4.1f%% %4.1f%%\n",
    if (W > 0 && print_stats) {
        printf(" %s: M = %d, I = %d, V = %d, K = %d, W = %d, X = %d\n",
            argv[0], M, I, V, K, W, X);
        printf(" # of word hits = %d\n", numhits);
        printf(" # of matches found = %d\n", numMSPs);

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here