update July 19, 2008
NAME
BLExtractSubset.py
- Extract a subset of sequences from a GDE flatfile
SYNOPSIS
python
BLExtractSubset.py namefile infile outfile
DESCRIPTION
This script reads infile, containing
sequences in GDE flatfile format, and writes a subset of those
sequences, as listed in namefile, to the outfile.
NAMEFILE
The namefile consists of a list of
names for sequences in a GDE flatfile, with one name per line.
Example:
A30238
A31075
A34313
INPUT
The input file is a GDE flatfile of
strings consisting of a name, followed by a string. The type of the
string is indicated by the flag characters used by GDE: # for DNA
or RNA, % for protein, or " for text.
Example:
%A30238
MKSAILTGLLFVLLCVDHLSSASQSVVATQLIPINTALTPIMMKGQVVNPAGIPFAEMSQ
IVGKQVNRPVAKDETLMPNMVKTYRAAK
%A30839
NQASVVANQLIPINTALTLVMMRSEVVTPVGIPAEDIPRLVSMQVNRAVPLGTTLMPDM
VKGYPPA
%A31075
MKSVILTGLLFVLLCVDHMTASQSVVATQLIPMNSALTPVMMEGKVTNPIGIPFAEMSQM
VGKQVNRPVAKGQTIMPNMVKTYAAGK
%A34313
MLTVSLLVCAMMALTQANDDKILKGTATEAGPVSQRAPPNCPAGWQPLGDRCIYYETTAM
TWALAETNCMKLGGHLASIHSQEEHSFIQTLNAGVVWIGGSACLQAGAWTWSDGTPMNFR
SWCSTKPDDVLAACCMQMTAAADQCWDDLPCPASHKSVCAMTF
OUTPUT
The output is a GDE flatfile containing
only the sequence specified in namefile. In the case where two or more
sequences occur with the same name, only the first sequence will be
written, and an error message will be written to the standard output.
Example:
%A30238
MKSAILTGLLFVLLCVDHLSSASQSVVATQLIPINTALTPIMMKGQVVNPAGIPFAEMSQ
IVGKQVNRPVAKDETLMPNMVKTYRAAK
%A31075
MKSVILTGLLFVLLCVDHMTASQSVVATQLIPMNSALTPVMMEGKVTNPIGIPFAEMSQM
VGKQVNRPVAKGQTIMPNMVKTYAAGK
%A34313
MLTVSLLVCAMMALTQANDDKILKGTATEAGPVSQRAPPNCPAGWQPLGDRCIYYETTAM
TWALAETNCMKLGGHLASIHSQEEHSFIQTLNAGVVWIGGSACLQAGAWTWSDGTPMNFR
SWCSTKPDDVLAACCMQMTAAADQCWDDLPCPASHKSVCAMTF
NOTES
1. This script is used by GDE for Edit
--> Extract subset.
AUTHOR
Dr. Brian Fristensky
Department of Plant Science
University of Manitoba
Winnipeg, MB Canada R3T 2N2
frist@cc.umanitoba.ca
http://home.cc.umanitoba.ca/~frist