📄 seqret.txt

📁 emboss的linux版本的源代码
💻 TXT
📖 第 1 页 / 共 3 页
字号:
12 3 下一页
                                  seqret Function   Reads and writes (returns) sequencesDescription   The simplicity of the above description of this program greatly   understates the rich functionality of this program.   Because EMBOSS programs can take a wide range of qualifiers that   slightly change the behaviour of the program when reading or writing a   sequence, this program can do many more things than simply "read and   write a sequence".   seqret can read a sequence or many sequences from databases, files,   files of sequence names, the command-line or the output of other   programs and then can write them to files, the screen or pass them to   other programs. Because it can read in a sequence from a database and   write it to a file, seqret is a program for extracting sequences from   databases. Because it can write the sequence to the screen, seqret is   a program for displaying sequences.   seqret can read sequences in any of a wide range of standard sequence   formats. You can specify the input and output formats being used. If   you don't specify the input format, seqret will try a set of possible   formats until it reads it in successfully. Because you can specify the   output sequence format, seqret is a program to reformat a sequence.   seqret can read in the reverse complement of a nucleic acid sequence.   It therefore is a program for producing the reverse complement of a   sequence.   seqret can read in a sequence whose begin and end positions you have   specified and write out that fragment. It is therefore a utility for   doing simple extraction of a region of a sequence.   seqret can change the case of the sequence being read in to upper or   to lower case. It is therefore a simple sequence beautification   utility.   seqret can do any combination of the above functions.   The sequence input and output specification of this (and many other   EMBOSS programs) is described as being a Uniform Sequence Address.   The Uniform Sequence Address, or USA, is a somewhat tongue-in-cheek   reference to a URL-style sequence naming used by all EMBOSS   applications.   The USA is a very flexible way of specifying one or more sequences   from a variety of sources and includes sequence files, database   queries and external applications.   See the full specification of USA syntax at:   http://emboss.sourceforge.net/docs/themes/UniformSequenceAddress.html    The basic USA syntax is one of:     * "file"     * "file:entry"     * "format::file"     * "format::file:entry"     * "database:entry"     * "database"     * "@file"   Note that ':' separates the name of a file containing many possible   entries from the specific name of a sequence entry in that file. It   also separates the name of a database from an entry in that database   Note also that '::' separates the specified format of a file from the   name of the file. Normally the format can be omitted, in which case   the program will attempt to identify the correct format when reading   the sequence in and will default to using FASTA format when writing   the sequence out.   Valid names of the databases set up in your local implementation of   EMBOSS can be seen by using the program 'showdb'.   Database queries, and individual entries in files that have more than   one sequence entry, use wildcards of "?" for any character and "*" for   any string of characters. There are some problems with the Unix shell   catching these characters so they do need to be hidden in quotes or   preceded by a backslash on the Unix command line, (for example   "embl:hs\*")   The output USA name 'stdout' is special. It makes the output go to the   device 'standard output'. This is the screen, by default.  Example USAs   The following are valid USAs for sequences:   USA Description   xxx.seq A sequence file "xxx.seq" in any format   fasta::xxx.seq A sequence file "xxx.seq" in fasta format   gcg::egmsmg.gcg A sequence file "egmsmg.gcg" in GCG 9 format   egmsmg.gcg -sformat=gcg A sequence file "egmsmg.gcg" in GCG 9 format   embl::paamir.em A sequence file "paamir.em" in EMBL format   embl:paamir EMBL entry PAAMIR, using whatever access method is defined   locally for the EMBL database   embl:X13776 EMBL entry X13776, using whatever access method is defined   locally for the EMBL database and searching by accession number and   entry name (X13776 is the accession number in this case)   embl-acc:X13776 EMBL entry X13776, using whatever access method is   defined locally for the EMBL database and searching by accession   number only   embl-id:paamir EMBL entry PAAMIR, using whatever access method is   defined locally for the EMBL database, and searching by ID only   embl:paami* EMBL entries PAAMIB, PAAMIE and so on, usually in   alphabetical order, using whatever access method is defined locally   for the EMBL database   embl or EMBL:* All sequences in the EMBL database   @mylist Reads file mylist and uses each line as a separate USA. This   is standard VMS list file syntax, also used in SRS 4.0 but missing in   SRS 5.0. The list file is a list of USAs (one per line). List files   can contain references to other lists files or any other standard USA.   list::mylist Same as "@mylist" above   'getz -e [embl-id:paamir] |' The pipe character "|" causes EMBOSS to   fire up getz (SRS 5.1) to extract entry PAAMIR from EMBL in EMBL   format. Any application or script which writes one or more sequences   to stdout can be used in this way.   asis::atacgcagttatctgaccat So far the shortest USA we could invent. In   'asis' format the name is the sequence so no file needs to be opened.   This is a special case. It was intended as a joke, but could be quite   useful for generating command lines.  Input sequence formats   To date, the following sequence formats are accepted as input.   By default, (i.e. if no format is explicitly specified) EMBOSS tries   each format in turn until one succeeds.   Input Format Comments   gcg GCG 9.x and 10.x format with the format and sequence type   identified on the first line of the file   gcg8 GCG 8.x format where anything up to the first line containing   ".." is considered as heading, and the remainder is sequence data.   This format is complicated by the header appearing to be in other   formats such as EMBL, and by the possibility of reading a large amount   of data in the wrong format before discovering that there is no ".."   line because it is not GCG format after all.   embl   em EMBL entry format, or at least a minimal subset of the fields. The   Staden package and others use EMBL or similar formats for sequence   data.   swiss   sw SWISSPROT entry format, or at least a minimal subset of the fields.   fasta   pearson FASTA format with an optional accession number after the   sequence identifier, eg:   >name description   or   >name accession description   and with an optional database name in GCG style fasta format included   as part of the sequence identifier, eg:   >database:name accession description   ncbi FASTA format with optional accession number and database name in   NCBI style included as part of the sequence identifier. eg   >database|accession|id description   (and other variants on this theme!)   genbank   gb GENBANK entry format, or at least a minimal subset of the fields.   nbrf   pir NBRF (PIR) format, as used in the PIR database sequence files.   codata CODATA format.   strider DNA strider format   clustal   aln ClustalW ALN (multiple alignment) format.   phylip PHYLIP interleaved multiple alignment format.   acedb ACeDB format   msf Wisconsin Package GCG's MSF multiple sequence format.   hennig86 Hennig86 format   jackknifer Jackknifer format   jackknifernon Jackknifernon format   nexus   paup Nexus/PAUP format   nexusnon   paupnon Nexusnon/PAUPnon format   treecon Treecon format   mega Mega format   meganon Meganon format   ig IntelliGenetics format.   staden   experiment The experiment file format used by the "gap" program in the   Staden package, where the sequence identifier is optional and the   remainer is plain text. Some alternative nucleotide ambiguity codes   are used and must be converted.   unknown   text   plain Plain text. This is the format with no format. The whole of the   file is read in as a sequence. No attempt is made to parse the file   contents in any way. Anything is acceptable in this format.   raw Like unknown/text/plain format except that it accepts only   alphanumeric and whitespace characters and rejects anything else.   asis This is not so much a sequence format as a quick way of entering   a sequence on the command line, but it is included here for   completeness. Where a filename would normally be given, in asis format   there is the sequence itself. An example would be:   asis::atacgcagttatctgaccat   In 'asis' format the name is the sequence so no file needs to be   opened. This is a special case. It was intended as a joke, but could   be quite useful for generating command lines.  Output sequence formats   To date, the following sequence formats are available as output.   Some sequence formats can hold multiple sequences in one file, these   are marked as multiple in the following table. The details of how many   sequences are held in one file differs between formats, but they   either allow many sequences to be concatenated one after the other, or   they hold the sequences together in some sort of aligned set of   sequences.   Other formats, such as GCG, plain and staden formats can only hold one   sequence per file, these are marked as single. An attempt to   concatenate several sequences in one file leaves the results as a mess   that makes it impossible to decide where the sequences start and end   or what is annotation and what is sequence.   These single formats therefore cause problems when there are multiple   sequences to write out because a single file containing multiple   sequences in that format is invalid. When these formats are specified   for output, an EMBOSS program will allow you to write many sequences   to one file, but EMBOSS programs will not be able to reliably read in   the resulting mess.   N.B This behaviour changed in EMBOSS version 1.7.0. (31 Oct 2000)   Previously, EMBOSS programs that were asked to write multiple   sequences in a single format would ignore the requested output file   name and would write each sequence into a separate file whose name was   constructed from the sequence name and the name of the format. This   resulted in ouput to files whose names could not be reliably   controlled. A decision was taken that EMBOSS users were intelligent   people who could live with the consequences of their actions and who   could learn not to write out multiple sequences to a file in formats   that could not cope with multiple sequences.   It you really wish to write multiple sequences out in formats that can   not cope with multiple sequences, you are advised to add the global   qualifier -ossingle on the command line. This will force the EMBOSS   program to ignore the given output file name and will generate its own   file names. One sequence will be written to each such file. These file   names are made from the sequence ID name, with the name of the format   as the extension (e.g. hsfau.gcg).   This is not ideal. Preferably, you should stay away from formats that   can't cope with multiple sequences in a file.   Output Format Single/   Multiple Comments   gcg single Wisconsin Package GCG 9.x and 10.x format with the sequence   type on the first line of the file.   gcg8 single GCG 8.x format where anything up to the first line   containing ".." is considered as heading, and the remainder is   sequence data.   embl   em multiple EMBL entry format with available fields filled in and   others with no infomation omitted. The EMBOSS command line allows   missing data such as accession numbers to be provided if they are not   obtainable from the input sequence.   swiss   sw multiple SwisProt entry format with available fields filled in and   others with no infomation omitted. The EMBOSS command line allows   missing data such as accession numbers to be provided if they are not   obtainable from the input sequence.   fasta multiple Standard Pearson FASTA format, but with the accession   number included after the identifier if available.   pearson multiple Simple Pearson FASTA format, an alias for "fasta"   format.   ncbi multiple NCBI style FASTA format with the database name, entry   name and accession number separated by pipe ("|") characters.   nbrf   pir multiple NBRF (PIR) format, as used in the PIR database sequence   files.   genbank   gb multiple GENBANK entry format with available fields filled in and   others with no infomation omitted. The EMBOSS command line allows   missing data such as accession numbers to be provided if they are not   obtainable from the input sequence.   ig multiple Intelligenetics format, as used by the Intelligenetics   package   codata multiple CODATA format.   strider multiple DNA strider format   acedb multiple ACeDB format   staden   experiment single The experiment file format used by the "gap" program   in the Staden package. Some alternative nucleotide ambiguity codes are   used and are converted.   text   plain   raw single Plain sequence, no annotation or heading.   fitch multiple Fitch format   msf multiple Wisconsin Package GCG's MSF multiple sequence format.   clustal   aln multiple Clustal multiple sequence format.   phylip multiple PHYLIP non-interleaved format.   phylip3 multiple PHYLIP interleaved format.   asn1 multiple A subset of ASN.1 containing entry name, accession   number, description and sequence, similar to the current ASN.1 output   of readseq   hennig86 multiple Hennig86 format   mega multiple Mega format   meganon multiple Meganon format   nexus   paup multiple Nexus/PAUP format   nexusnon   paupnon multiple Nexusnon/PAUPnon format   jackknifer multiple Jackknifer format   jackknifernon multiple Jackknifernon format   treecon multiple Treecon format   debug multiple EMBOSS sequence object report for debugging showing all   available fields. Not all fields will contain data - this depends very   much on the input format used.  Future directions   More formats, both for input and for output, can be easily added, so   suggestions are always welcome.  Associated qualifiers   As noted previously there are many 'associated' qualifiers that alter   the behaviour of seqret when it reads in or writes out a sequence. As   these are used in all EMBOSS programs that read in or write out   sequences, they are not reported by the '-help' qualifier. They are   however reported by the pair of qualifiers: '-help -verbose':   Some of the more useful associated qualifiers are:   Qualifier                        Description   -sbegin   The first position to be used in the sequence   -send     The last position to be used in the sequence   -sreverse Use the reverse complement of a nucleic acid sequence   -sask     Ask the user for begin/end/reverse information   -slower   Convert the sequence to lower case   -supper   Convert the sequence to upper case   -sformat  Specify the input sequence format   -osformat Specify the output sequence format   -ossingle Write each entry into a separate file   -auto     Turn off prompts and don't report the one-line description   -stdout   Write the results to 'standard output' (the screen)   -filter   Read input from another program, write to the screen   -options  Prompt for optional qualifiers   -help     Display a table of the command-line options   The set of associated qualifiers for sequences behave in different
12 3 下一页
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -