📄 emma.txt
字号:
Dayhoff one (above) but are much more up to date and are based on a far larger data set. They appear to be more sensitive than the Dayhoff series. We use the GONNET 40, 80, 120, 160, 250 and 350 matrices. We also supply an identity matrix which gives a score of 1.0 to two identical amino acids and a score of zero otherwise. This matrix is not very useful. (Values: b (blosum); p (pam); g (gonnet); i (id); o (own))* -pwdnamatrix menu [i] The scoring table which describes the scores assigned to matches and mismatches (including IUB ambiguity codes). (Values: i (iub); c (clustalw); o (own))* -pairwisedatafile infile Comparison matrix file (optional)* -matrix menu [b] This gives a menu where you are offered a choice of weight matrices. The default for proteins is the PAM series derived by Gonnet and colleagues. Note, a series is used! The actual matrix that is used depends on how similar the sequences to be aligned at this alignment step are. Different matrices work differently at each evolutionary distance. There are three 'in-built' series of weight matrices offered. Each consists of several matrices which work differently at different evolutionary distances. To see the exact details, read the documentation. Crudely, we store several matrices in memory, spanning the full range of amino acid distance (from almost identical sequences to highly divergent ones). For very similar sequences, it is best to use a strict weight matrix which only gives a high score to identities and the most favoured conservative substitutions. For more divergent sequences, it is appropriate to use 'softer' matrices which give a high score to many other frequent substitutions. 1) BLOSUM (Henikoff). These matrices appear to be the best available for carrying out data base similarity (homology searches). The matrices used are: Blosum80, 62, 45 and 30. 2) PAM (Dayhoff). These have been extremely widely used since the late '70s. We use the PAM 120, 160, 250 and 350 matrices. 3) GONNET . These matrices were derived using almost the same procedure as the Dayhoff one (above) but are much more up to date and are based on a far larger data set. They appear to be more sensitive than the Dayhoff series. We use the GONNET 40, 80, 120, 160, 250 and 350 matrices. We also supply an identity matrix which gives a score of 1.0 to two identical amino acids and a score of zero otherwise. This matrix is not very useful. Alternatively, you can read in your own (just one matrix, not a series). (Values: b (blosum); p (pam); g (gonnet); i (id); o (own))* -dnamatrix menu [i] This gives a menu where a single matrix (not a series) can be selected. (Values: i (iub); c (clustalw); o (own))* -mamatrixfile infile Comparison matrix file (optional)' -[no]slow toggle [Y] A distance is calculated between every pair of sequences and these are used to construct the dendrogram which guides the final multiple alignment. The scores are calculated from separate pairwise alignments. These can be calculated using 2 methods: dynamic programming (slow but accurate) or by the method of Wilbur and Lipman (extremely fast but approximate). The slow-accurate method is fine for short sequences but will be VERY SLOW for many (e.g. >100) long (e.g. >1000 residue) sequences.* -pwgapopen float [10.0] The penalty for opening a gap in the pairwise alignments. (Number 0.000 or more)* -pwgapextend float [0.1] The penalty for extending a gap by 1 residue in the pairwise alignments. (Number 0.000 or more)* -ktup integer [1 for protein, 2 for nucleic] This is the size of exactly matching fragment that is used. INCREASE for speed (max= 2 for proteins; 4 for DNA), DECREASE for sensitivity. For longer sequences (e.g. >1000 residues) you may need to increase the default. (integer from 0 to 4)* -gapw integer [3 for protein, 5 for nucleic] This is a penalty for each gap in the fast alignments. It has little affect on the speed or sensitivity except for extreme values. (Positive integer)* -topdiags integer [5 for protein, 4 for nucleic] The number of k-tuple matches on each diagonal (in an imaginary dot-matrix plot) is calculated. Only the best ones (with most matches) are used in the alignment. This parameter specifies how many. Decrease for speed; increase for sensitivity. (Positive integer)* -window integer [5 for protein, 4 for nucleic] This is the number of diagonals around each of the 'best' diagonals that will be used. Decrease for speed; increase for sensitivity. (Positive integer)* -nopercent boolean [N] Fast pairwise alignment: similarity scores: suppresses percentage score -gapopen float [10.0] The penalty for opening a gap in the alignment. Increasing the gap opening penalty will make gaps less frequent. (Positive foating point number) -gapextend float [5.0] The penalty for extending a gap by 1 residue. Increasing the gap extension penalty will make gaps shorter. Terminal gaps are not penalised. (Positive foating point number) -[no]endgaps boolean [Y] End gap separation: treats end gaps just like internal gaps for the purposes of avoiding gaps that are too close (set by 'gap separation distance'). If you turn this off, end gaps will be ignored for this purpose. This is useful when you wish to align fragments where the end gaps are not biologically meaningful. -gapdist integer [8] Gap separation distance: tries to decrease the chances of gaps being too close to each other. Gaps that are less than this distance apart are penalised more than other gaps. This does not prevent close gaps; it makes them less frequent, promoting a block-like appearance of the alignment. (Positive integer)* -norgap boolean [N] Residue specific penalties: amino acid specific gap penalties that reduce or increase the gap opening penalties at each position in the alignment or sequence. As an example, positions that are rich in glycine are more likely to have an adjacent gap than positions that are rich in valine.* -hgapres string [GPSNDQEKR] This is a set of the residues 'considered' to be hydrophilic. It is used when introducing Hydrophilic gap penalties. (Any string is accepted)* -nohgap boolean [N] Hydrophilic gap penalties: used to increase the chances of a gap within a run (5 or more residues) of hydrophilic amino acids; these are likely to be loop or random coil regions where gaps are more common. The residues that are 'considered' to be hydrophilic are set by '-hgapres'. -maxdiv integer [30] This switch, delays the alignment of the most distantly related sequences until after the most closely related sequences have been aligned. The setting shows the percent identity level required to delay the addition of a sequence; sequences that are less identical than this level to any other sequences will be aligned later. (Integer from 0 to 100) Advanced (Unprompted) qualifiers: (none) Associated qualifiers: "-sequence" associated qualifiers -sbegin1 integer Start of each sequence to be used -send1 integer End of each sequence to be used -sreverse1 boolean Reverse (if DNA) -sask1 boolean Ask for begin/end/reverse -snucleotide1 boolean Sequence is nucleotide -sprotein1 boolean Sequence is protein -slower1 boolean Make lower case -supper1 boolean Make upper case -sformat1 string Input sequence format -sdbname1 string Database name -sid1 string Entryname -ufo1 string UFO features -fformat1 string Features format -fopenfile1 string Features file name "-outseq" associated qualifiers -osformat2 string Output seq format -osextension2 string File name extension -osname2 string Base file name -osdirectory2 string Output directory -osdbname2 string Database name to add -ossingle2 boolean Separate file for each entry -oufo2 string UFO features -offormat2 string Features format -ofname2 string Features file name -ofdirectory2 string Output directory "-dendoutfile" associated qualifiers -odirectory3 string Output directory General qualifiers: -auto boolean Turn off prompts -stdout boolean Write standard output -filter boolean Read standard input, write standard output -options boolean Prompt for standard and additional values -debug boolean Write debug output to program.dbg -verbose boolean Report some/full command line options -help boolean Report command line options. More information on associated and general qualifiers can be found with -help -verbose -warning boolean Report warnings -error boolean Report errors -fatal boolean Report fatal errors -die boolean Report dying program messagesInput file format The input is two or more sequences. Input files for usage example File: globins.fasta>HBB_HUMAN Sw:Hbb_Human => HBB_HUMANVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH>HBB_HORSE Sw:Hbb_Horse => HBB_HORSEVQLSGEEKAAVLALWDKVNEEEVGGEALGRLLVVYPWTQRFFDSFGDLSNPGAVMGNPKVKAHGKKVLHSFGEGVHHLDNLKGTFAALSELHCDKLHVDPENFRLLGNVLVVVLARHFGKDFTPELQASYQKVVAGVANALAHKYH>HBA_HUMAN Sw:Hba_Human => HBA_HUMANVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR>HBA_HORSE Sw:Hba_Horse => HBA_HORSEVLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHFDLSHGSAQVKAHGKKVGDALTLAVGHLDDLPGALSNLSDLHAHKLRVDPVNFKLLSHCLLSTLAVHLPNDFTPAVHASLDKFLSSVSTVLTSKYR>MYG_PHYCA Sw:Myg_Phyca => MYG_PHYCAVLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTEAEMKASEDLKKHGVTVLTALGAILKKKGHHEAELKPLAQSHATKHKIPIKYLEFISEAIIHVLHSRHPGDFGADAQGAMNKALELFRKDIAAKYKELGYQG>GLB5_PETMA Sw:Glb5_Petma => GLB5_PETMAPIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTTADQLKKSADVRWHAERIINAVNDAVASMDDTEKMSMKLRDLSGKHAKSFQVDPQYFKVLAAVIADTVAAGDAGFEKLMSMICILLRSAY>LGB2_LUPLU Sw:Lgb2_Luplu => LGB2_LUPLUGALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSEVPQNNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKGVADAHFPVVKEAILKTIKEVVGAKWSEELNSAWTIAYDELAIVIKKEMNDAA EMBOSS programs do not allow you to simply type the names of two or more files or database entries - they try to interpret this as all one file-name and complain that a file of that name does not exist. In order to enter the sequences that you wish to align, you must group them in one of three ways: either make a 'list file' or place several sequences in a single sequence file or specify the sequences using wildcards. Making a List file A list file is a text file that holds the names of database entries and/or sequence files. You should use a text editor such as pico or nedit to edit a file to contain the names of the sequence files or database entries. There must be one sequence per line. An example is the file 'fred' which contains: __________________________________________________________________________opsd_abyko.fastasw:opsd_xenlasw:opsd_c*@another_list __________________________________________________________________________ This List files contains: * opsd_abyko.fasta - this is the name of a sequence file. The file is read in from the current directory. * sw:opsd_xenla - this is a reference to a specific sequence in the SwissProt database * sw:opsd_c* - this represents all the sequences in SwissProt whose identifiers start with ``opsd_c'' * another_list - this is the name of a second list file. List files can be nested! Notice the @ in front of the last entry. This is the way you tell EMBOSS that this file is a List file, not a regular sequence file. That last line was put there both as an indication of the way you tell EMBOSS that a file is a List file and to emphasise that List files can contain other List files. When emma asks for the sequences to align, you should type '@fred'. The '@' character tells EMBOSS that this is the name of a List file. An alternative to editing a file and laboriously typing in all of the names you require is to make a list of a directory containing the sequence files and then to edit the list file to remove the names of the sequences files than you do not require. To make a list of all the files in the current directory that end in
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -