Comp 364 - Homework 3

Assigned Mar. 7
Due Mar. 14
Late Mar. 17
Turn in electronically to perkins@mcb.mcgill.ca

Please turn in by email to the professor, preferably by attaching a single single .rtf, .doc, or .txt file, with all of your answers. Please include your name and HW3 as part of the file name (for example, like: TedPerkinsHW3.doc).

Note: This homework introduces several new concepts, particularly reading input from a file and writing output to a file, that were not covered in class. Question 2 explains these concepts and asks you to do a simple task employing them. Question 3 also requires you to use these concepts, so you should do Question 2 before Question 3. You may also wish to refer to Chapter 6 of the text, which discusses file I/O (input and output) at greater length.

Counting occurrences and reporting the most-frequent items (10 points)
Suppose you have an array @A of strings. This array may have duplicate elements, or it may not. Write a program, or if you prefer, two separate programs, to do the following tasks:
(a) Print out a table unique strings appearing in @A, along with the number of times each appears.
(b) Create an array that includes only the string or strings that occur the greatest number of times, and print out that array.
For example, if @A=(ted,ann,george,frank,ann,george,tom,george), then a sample run of your program might look like:
```
[~/HW3] perl Q1.pl
Here are the counts of each string:
ted	1
ann	2
george	3
frank	1
tom	1
The most frequent string(s): george
[~/HW3]
```
Or, if @A=(apple,banana,apple,grape,banana), a sample run of your program might look like:
```
[~/HW3] perl Q1.pl
Here are the counts of each string:
apple	2
banana	2
grape	1
The most frequent string(s): apple banana
[~/HW3]
```
You do not need to worry about the order in which the table or the most common words are printed out. Include in your answer to this question your Perl code, and sample runs on a few different arrays of your choice. (You could use the ones demonstrated above, or different ones.)
Sorting the strings in a disk-file and saving them (10 points)
We have seen the use of "<>" to read input from the keyboard while a Perl program is running, and we have seen the use of "print" to print to the screen. These operations, reading and writing (or input and output), can also be performed on disk-files. To read from a disk-file, you must first "open" that file, like this:
```
open INFILE, "Input.txt"
```
The "open" command is opening the file for input. The file it tries to open is called "Input.txt". The "INFILE" part of the line above is called a "file handle", and it is how you refer to the file after it is open. You can use any name you want; it doesn't have to be INFILE. However, it is traditional to use all uppercase letters. For example, consider the program below.
```
open INFILE, "Input.txt";
$Line1 = <INFILE>;
$Line2 = <INFILE>;
```
This program opens the file "Input.txt" and reads the first two lines from that file. Notice that we have put the file handle, "INFILE", between the "<>" that we used for taking input from the keyboard. The file "Input.txt" should be in the same directory as you are in when running the program. If not, then the open command will not be successful. If you are running perl with the "-w" switch, as below, then Perl will warn you if you try to read from a file that wasn't opened correctly:
```
[~/HW3] perl -w ReadTwoLines.pl
Name "main::Line2" used only once: possible typo at ReadTwoLines.pl line 3.
Name "main::Line1" used only once: possible typo at ReadTwoLines.pl line 2.
readline() on closed filehandle INFILE at ReadTwoLines.pl line 2.
readline() on closed filehandle INFILE at ReadTwoLines.pl line 3.
  However, you can also check if the open command succeeded.
```
It is the last two lines of above that state the error of trying to read from a file that didn't open successfully. A better way, however, is for your program itself to check that the file opens. You can do this because the "open" commands returns true or false, depending on whether it succeeded in opening the file or not.
```
$Result = open INFILE, "Input.txt";
```
If $Result is true, then the file was opened, and is $Result is false, then the file was not opened.
While you can read the contents of a file line by line, it suffices for this homework to read all the lines in at once. That can be done like this:
```
open INFILE, "Input.txt";
@Lines = <INFILE>;
```
This program opens the file "Input.txt" and reads all the lines of the file into an array called @Lines. Each line of the file will be one element of the array. So, the first line is in $Lines[1], the second line in $Lines[2], and so on.
You can also open a file and write to it using the print command. To do this, the open command looks only slightly different -- the name of the file should be preceded by the ">" character.
```
open OUTFILE, ">Output.txt";
```
This opens the file "Output.txt" for writing. If the file does not already exist, it is created. If it does exist, its previous contents are erased. As an aside, if you do:
```
open OUTFILE, ">>Output.txt";
```
Then the file is opened for writing. However, if the file already exists, anything you write to the file is added to the end of what is already there. The old contents are not erased. (You will not need this feature for this homework, but it is good to know about.)
To write to an output file, just use the print command as you would for making output to the screen. However, list the file handle right after the "print" as follows.
```
open OUTFILE, ">Output.txt";
print OUTFILE "Hello world!\n";
@A = (ted,tom,ann);
print OUTFILE "@A\n";
```
This program opens the file "Output.txt", writes "Hello world!" on one line, and writes out an array of names on the next line. It is generally good practice to close files after you are done with them, as follows:
```
close INFILE;
close OUTFILE;
```
However, nothing terrible will happen if you do not close the files. They will be closed automatically when your program ends, if you do not do it yourself.
Now, what I actually want you to do for this question! Create a file that has a bunch of strings on separate lines. You could list the names of all your friends, or cities in Canada, or favorite songs... whatever. Write a program that reads in all the lines of that file, and sorts the strings ASCII-betically. Then, open an output file, and write the strings in sorted order on a single line, separated by commas. For example, with an input file with contents below:
```
jerry seinfeld
elaine benes
george costanza
cosmo kramer
```
Your program should produce an output file with the contents:
```
cosmo kramer,elaine benes,george costanza,jerry seinfeld
```
Note that when you take input from the file, the lines will have the newline "\n" at the end. You will need to "chomp" these off before outputting your results. Turn in your code, and show the input file and resulting output file.
Integrating two sources of data (10 points)
Professor X has two students doing lab work to detect genes related to yeast metabolism. Each student generates a list of genes, which can be found as HW3Q3List1.txt and HW3Q3List2.txt. The two lists are different, and to prioritize his future research, Professor X wants to generate two lists: a list of genes that occurred in both student's lists, and a list of genes that occurred in the lists of one student or the other but not both. Write a Perl program that reads the files above as input and creates two output files containing the two lists that Professor X wants. In the output files, please print the genes in alphabetically-sorted order (for easy correction). Turn in your code and the two output files you produce. You may assume (and it is true) that each input file has no duplicate gene names within it.
GC / AT content of a gene (10 points)
Professor Y has a theory that the exons of genes tend to be "GC-rich". That is, they have significantly more G's or C's than A's or T's. Conversely, Professor Y believes that introns are "AT-rich". Consider the coding region for the gene given in the file DNA.txt. In this file, capital A's, C's, G's or T's refer to exonic DNA and lower case a's, c's, g's or t's refer to intronic DNA. Write a Perl program that reads in the "DNA.txt" file and then computes and prints out four numbers: (1) the number of G's and C's in exons, (2) the number of A's and T's in exons, (3) the number of g's and c's in introns, (4) the number of a's and t's in introns. Turn in your code, the output of your program when run, and comment on whether this gene conforms to Professor Y's theory.
Looking for an amino acid pattern (10 points)
Professor Z is interested in proteins that contain a particular amino acid pattern: an Aspartic acid, followed by a Glutamic acid, followed by three to five Glutamines, followed by a Proline. Write a Perl program that (1) reads in the "DNA.txt" file, (2) extracts the exonic sequences from the DNA sequence, and (3) and tests whether or not Professor Z's amino acid pattern occurs in any of the exons. For each exon, the program should print out whether or not the pattern occurs, and if it does occur, the program should print out the exact DNA sequence that makes the match. You may find it useful to know that Aspartic acid corresponds to the DNA codons GAT or GAC, Glutamic acid corresponds to GAA or GAG, Glutamine corresponds to CAA or CAG, and Proline corresponds to CCT, CCC, CCA or CCG. Turn in your code and the output of your program when run.