Comp 364 - Winter 2008 - Homework 3 Sample Solutions ====================================================================== 1. As with any programming problem, many different solutions are possible. One is below: ---------------------------------------------------------------------- # Here are the names in our array. @A=(ted,ann,george,frank,ann,george,tom,george); # First, we will count how many times each name occurs, # and store the results in a hash. %H = (); foreach $Name (@A) { # Find out $Name is already in the hash. $Count = $H{ $Name }; # If so, increase the count by one, and update the hash. if ($Count) { $H{ $Name } = $Count+1; } else { # Otherwise, we add $Name to the hash, with a count of 1. $H{ $Name } = 1; } } # Now, we can print out the entries of the hash, along with # their counts. print "Here are the counts of each string:\n"; @Keys = keys %H; foreach $Key (@Keys) { print $Key, "\t", $H{ $Key }, "\n"; } # Next, we need to find out which entry or entries occur most often. # We will do this in two steps. First, we find out what the highest # number of occurrences was. $MaxOccur = 0; @Values = values %H; for $Val (@Values) { if ($Val > $MaxOccur) { $MaxOccur = $Val; } } # Then, we find the strings that occurred $MaxOccur number of times. print "The most frequent string(s):"; foreach $Key (@Keys) { if ($H{ $Key } == $MaxOccur) { print " ", $Key; } } print "\n"; ---------------------------------------------------------------------- Here is a sample run: [perkins] perl HW3Q1.pl Here are the counts of each string: tom 1 frank 1 ann 2 ted 1 george 3 The most frequent string(s): george [perkins] ====================================================================== 2. The program below is one straightforward solution. ---------------------------------------------------------------------- # First, we open the input file, and read in all the lines. open INFILE, "Input.txt"; @Lines = ; close INFILE; # Then, we sort the lines. @Lines = sort @Lines; # Next, we chomp the lines, to get rid of the newline characters. # It turns out that one can chomp all the strings in an array # simply by chomping the array. chomp @Lines; # We join the chomped lines together with commas. $Str = join(',',@Lines); # Open the output file, print the output, and close the file. open OUTFILE, ">Output.txt"; print OUTFILE $Str; close OUTFILE; ---------------------------------------------------------------------- When run, this program doesn't write any output to the screen. But it writes the correct output in the file "Output.txt". For example, when run on an input file containing: jerry seinfeld elaine benes george costanza cosmo kramer It writes to the output file: cosmo kramer,elaine benes,george costanza,jerry seinfeld ====================================================================== 3. I expect a lot of variety in the solution to this problem. One answer is below. ---------------------------------------------------------------------- # First, we read in the two files. open IN1, "HW3Q3List1.txt"; open IN2, "HW3Q3List2.txt"; @List1 = ; @List2 = ; # The lists below will be used to store the gene names that # occur only one (Occ1) or that occur on both lists (Occ2). @Occ1 = (); @Occ2 = (); # First, we look through the genes on @List1. For each gene, if it # also occurs on list @List2, we should add it to @Occ2. Otherwise, # we should add it to @Occ1. foreach $G1 (@List1) { $InList2 = 0; foreach $G2 (@List2) { if ($G1 eq $G2) { $InList2 = 1; } } if ($InList2) { push @Occ2, $G1; } else { push @Occ1, $G1; } } # Next, we look at genes in list @List2. Each one that does not # occur on @List1, should be added to @Occ1. If it does occur # on list @List1, then we should have already added it to @Occ2, # so we need do nothing now. foreach $G2 (@List2) { $InList1 = 0; foreach $G1 (@List1) { if ($G1 eq $G2) { $InList1 = 1; } } if (! $InList1) { push @Occ1, $G2; } } # Sort the lists @Occ1 = sort @Occ1; @Occ2 = sort @Occ2; # Print them to output files. open OUT1, ">HW3Q3Occ1.txt"; open OUT2, ">HW3Q3Occ2.txt"; print OUT1 @Occ1; print OUT2 @Occ2; ---------------------------------------------------------------------- This program produces no output on the screen when run, but it sends the correct output to each output file. ====================================================================== 4. Here is one Perl program that does the job. ---------------------------------------------------------------------- # First, we need to read in the DNA open INFILE, "DNA.txt"; $DNA = ; # A simple way to count the number of times different # types of letters occur, is to perform a global match, # and check the length of the list. @CGMatches = $DNA =~ /[CG]/g; $CGCount = $#CGMatches + 1; print "Exonic C's and G's: $CGCount\n"; @ATMatches = $DNA =~ /[AT]/g; $ATCount = $#ATMatches + 1; print "Exonic A's and T's: $ATCount\n"; @cgMatches = $DNA =~ /[cg]/g; $cgCount = $#cgMatches + 1; print "Intronic c's and g's: $cgCount\n"; @atMatches = $DNA =~ /[at]/g; $atCount = $#atMatches + 1; print "Intronic a's and t's: $atCount\n"; ---------------------------------------------------------------------- The output of this program is shown below. [perkins] perl HW3Q4.pl Exonic C's and G's: 1791 Exonic A's and T's: 1707 Intronic c's and g's: 1188 Intronic a's and t's: 1816 [perkins] Professor Y's theory is partly true for this gene. The exons are not particularly rich in C's and G's, having only slightly more of those than of A's and T's. However, the intron are heavy with a's and t's. ====================================================================== 5. Here is one sample solution. ---------------------------------------------------------------------- # First, we need to read in the DNA open INFILE, "DNA.txt"; $DNA = ; # We seperate out the exons @Exons = split(/[acgt]+/,$DNA); # We check each exon for the pattern. $NExons = @Exons; for ($Count=0; $Count<$NExons; $Count++) { $Exon = $Exons[ $Count ]; # First, we insert a # symbol between every codon, to remember where # the frame is. @Codons = $Exon =~ /.../g; $ExonWithFrame = join('#',@Codons); # Then, we search for the desired pattern. if ($ExonWithFrame =~ /GA[CT]#GA[AG]#(CA[AG]#){3,5}CC[ACGT]/) { # If found, report it, after remove the '#' signs. $Match = $&; @MatchCodons = split('#',$Match); $Match = join('',@MatchCodons); print "Match found in Exon $Count: $Match.\n"; } else { # Otherwise, report that no match was found. print "No match found in Exon $Count.\n"; } } ---------------------------------------------------------------------- When run, this program outputs: [perkins] perl HW3Q5.pl No match found in Exon 0. No match found in Exon 1. Match found in Exon 2: GATGAGCAGCAGCAACAACCG. [perkins] Your output need not look like this. In particular, it's okay to leave a frameshift-marking character (like #) between the codons. It's also okay (though not preferable) if your match includes the whole exon up to the part that matches the pattern of interest to Professor Z. This may be true if you used (...)* in a search pattern to account for frameshift. ======================================================================