Microsoft Word - BLAST Algorithm Steps - Part 2 .docx09-April-2018 Project (Step 1 out 2) BLAST...

Question

Microsoft Word - BLAST Algorithm Steps - Part 2 .docx
09-April-2018

Project (Step 1 out 2)
BLAST Algorithm Steps:

1. User enters the DNA Sequence that he/she wants to search in the database (The
database for this project would be a txt file).

Example Output of the program:
Please enter the DNA Sequence that you would like to match:
User entry: ATGCCCGTCATTCC

2. The program then searches the Database for exact or close matches.

3. The first step for search is to
eak the user entry into 3 letters:

Example: ATGCCCGTCATTCC
The program
eaks this up into 3 letter words: ATG, TGC, GCC, CCC,…………………

4. The program then searches the Database (The txt file) for sequences that match the 3
letter words.

DNA Sequence Database:

GTCATGCCCGTCATTCC

GGGGATGCCCGGGGG

TTTTATGCCCGTCGAAG

TAATGCCCGTTTTTTTT

GCCATGCCCGTTACCCC

5. After finding the “Hook” by matching the 3 letter words to the database, the algorithm
then moves left and right along the DNA sequence of interest.

For example: In the above instance, the first Sequence was of interest since we were
able to find the 3 Letter Word: ATG in the sequence.

GTCATGCCCGTCATTCC
The 3 Letter Words (User Entry)

ATG
TGC
GCC
CCC

6. Once it moves left and right it will keep a score, and try to match the user entry with the
DNA sequence of interest from the database.
With every match, the algorithm adds +1 to the overall score, and -1 for any mismatch.

In the above example:

User Entry: XXXXXXXXXXATGCCCGTCATTCC
Database (Sequence of Interest): TCATGCCCGTCATTCC

User Entry Database Score
- T -1
- C -1
A A 1
T T 1
G G 1
C C 1
…….. ……. …………….
Overall Score: XXXX

7. After the algorithm matches the user entry with the sequence of interest it provides an
output to the user that looks something like this.

User Entry: XXXXXXXXXXATGCCCGTCATTCC
||| ||||||||||||
Database (Sequence of Interest): TCATGCCCGTCATTCC
(The program will show lines where the letters match, and won’t show any lines where
there is no match).

8. It will also provide the user with the score, that it calculated in the step 6 and a
percentage value of how much matched.

14-Apri-2018

Updates: Ashish created the program (Step 1)

Run Sample:

Updates to be made:

1. SCORE VARIABLE: The “maximum count” variable should be changed to “Score”.
2. OUTPUT %: The output should also display percentage of total match. For example: if 8
letters out of 9 matched, the percentage should be displayed as 88.8% or 89.0%
3. USER INPUT: The program should allow the user to input a DNA sequence. (Of a fixed
length). Its upto you to decide how long of a user input the program can accept. But
about 8-10 letters should be ideal.
4. TXT FILE: The program should have a database (.txt file) with fixed length stored DNA
sequences.
5. COMMENTS: Ideally I would prefer detailed comments so that I can follow through the
code. This is not necessary since we can always discuss the code over skype. But if you
have the time, I would appreciate it.
6. RUN TIME: The output should display a run time. This is necessary since in the next step
when we implement parallel computing techniques, we need to compare the speed of
oth programs.
7. LINES: The output should display lines where the letters match and no-lines where it
doesn’t.

A sample output for reference: Note some of this wouldn’t apply to our program but just want
you to get an idea of how it should look like.

tae_636612327819332889_117905_1.pdf

Abr Writing · Accepted Answer

/*
* BLAST - Search two DNA sequences for locally maximal segment pairs. The basic
* command syntax is
*
* 	BLAST sequence1 sequence2
*
* where sequence1 and sequence2 name files containing DNA sequences.  Lines
* at the beginnings of the files that don't start with 'A', 'C', 'T' or 'G'
* are discarded.  Thus a typical sequence file might begin:
*
*	>BOVHBPP3I  Bovine beta-globin psi-3 pseudogene, 5' end.
*	GGAGAATAAAGTTTCTGAGTCTAGACACACTGGATCAGCCAATCACAGATGAAGGGCACT
*	GAGGAACAGGAGTGCATCTTACATTCCCCCAAACCAATGAACTTGTATTATGCCCTGGGC
*
* If sequence1 and sequence2 are identical file names, then matches are
* computed without "mirror image" pairs and without the trivial long match.
*
* The BLAST command permits optional additional arguments, such as X=50,
* that reset certain internal parameters.  The available parameters are:
*
*	M or m gives the score for a match.
*	I or i gives the score for a transition (A  G, C  T).
*	V or v gives the score for a tranversion (non-transition change).
*	W or w gives the word size.
*	X or x gives the value for terminating word extensions.
*	K or k give the cutoff.
*	(defaults: M = 1, I = -1, V = -1, W = 8, X = 10*M,
*	 and  K = 5% significance level)
*
* If W is set to 0, then all locally maximum segment pairs (LMSPs) are
* computed; in this case the value of X is irrelevant.
*
* In addition, the word "noalign" requests that no alignments be printed
* (summary statistics for each locally maximal segment pair are reported)
* and the word "stats" requests that statistics concerning the performance
* of BLAST be printed.  Thus a typical command is
*
*	BLAST SEQ1 SEQ2 M=1 I=0 V=-1 noalign
*/
#include 
#include 
#include 
#define max(x,y) ((x > y) ? (x) : (y))
#define min(x,y) ((x  2) {
		if (strcmp(argv[argc],"noalign") == 0) {
			alignments = 0;
			continue;
		}
		if (strcmp(argv[argc],"stats") == 0) {
			print_stats = 1;
			continue;
		}
		if (argv[argc][1] != '=')
			fatalf("argument %d has improper form", argc);
		v = atoi(argv[argc] + 2);
		switch (argv[argc][0]) {
			case 'M':
			case 'm':
				M = v;
				break;
			case 'I':
			case 'i':
				I = v;
				break;
			case 'V':
			case 'v':
				V = v;
				break;
			case 'K':
			case 'k':
				K = v;
				break;
			case 'W':
			case 'w':
				W = v;
				break;
			case 'X':
			case 'x':
				X = v;
				break;
			default:
				fatal("Options are M, I, V, K, W, X.");
		}
	}
	if (X == -1)
		X = 10*M;
	if (V > I)
		fatal("transversions score higher than transitions?");
	len1 = get_seq(argv[1], &seq1);
	if (strcmp(argv[1],argv[2]) == 0) {
		if (W == 0)
			fatal("file1 = file2 not implemented when W=0");
		seq2 = seq1;
		len2 = len1;
	} else
		len2 = get_seq(argv[2], &seq2);
	s1 = seq1 - 1;	/* so subscripts start with 1 */
	s2 = seq2 - 1;
	diag_lev = (long *)ckalloc((len1+len2+1)*sizeof(long)) + len1;
	/* set up scoring matrix */
	S['A']['A'] = S['C']['C'] = S['G']['G'] = S['T']['T'] = M;
	S['A']['G'] = S['G']['A'] = S['C']['T'] = S['T']['C'] = I;
	S['A']['C'] = S['A']['T'] = S['C']['A'] = S['C']['G'] =
		S['G']['C'] = S['G']['T'] = S['T']['A'] = S['T']['G'] = V;
	get_params();
	if (K == 0) {
		x = log(DEFAULT_SIG)/(-param_K*len1*len2);
		K = 2*log(sqrt(x))/(-param_lambda);
		while (significance(K-1) >= DEFAULT_SIG)
			--K;
		while (significance(K)  0)
			printf("W = %d, X = %d"", W, X);
		else
			printf("W = 0 (computes all LMSPs)"");
		printf("
}
");
		printf("s {
  "%s" 1 %d
  "%s" 1 %d
}
",
			argv[1], len1, argv[2], len2);
		printf("k {
  "significance"
}
");
	}
	if (W > 0) {
		bld_table();
		search();
	} else
		get_msps();
	sort_msps();
	if (!alignments)
		printf("

Seq. 1  Seq. 2   len  score  signif    W    X
");
	/* print the matches for each sequence */
	for (i = 0;

Microsoft Word - BLAST Algorithm Steps - Part 2 .docx 09-April-2018 Project (Step 1 out 2) BLAST Algorithm Steps: 1. User enters the DNA Sequence that he/she wants to search in the database (The...

Solution

Answer To This Question Is Available To Download

Related Questions & Answers

Submit New Assignment