📄 emma.txt
字号:
'.pep', type: ls *.pep > listfile Several sequences in one file EMBOSS can read in a single file which contains many sequences. Each of the sequences in the file must be in the same format - if the first sequence is in EMBL format, then all the others must be in EMBL format. There are some sequence formats that cannot be used when placing many sequences in the same file. These are sequence formats that have no clear indication of where the sequence ends and the annotation of the next sequence starts. These formats include: plain or text format (no real format, just the sequence), staden, gcg. If your sequences are not already in a single file, you can place them in one using seqret. The following example takes all the files ending in '.pep' and places them in the file 'mystuff' in Fasta format. seqret "*.pep" mystuff When emma asks for the sequences to align, you should type 'mystuff'. Using wildcards 'Wildcard' characters are characters that are expanded to match all possible matching files or entries in a database. By far the most commonly used wildcard character is '*' which matches any number (or zero) of possible characters at that position in the name. A less commonly used wildcard character is '?' which matches any one character at that position. For example, when emma asks for sequences to align, you could answer: abc*.pep This would select any files whose name starts with 'abc' and then ends in '.pep'; the centre of the name where there is a '*' can be anything. Both file names and database entry names can be wildcarded. There is a slightly irritating problem that occurs when wildcards are used one the Unix command line (This is the line that you type against the 'Unix' prompt together with the program name.) In this case the Unix session gets the command line first, runs the program, expands the wildcards and passes the program parameters to the program. When Unix expands the wildcards, two things go wrong. You may have specified wildcarded database entries - the Unix system tries to file files that match that specification, it fails and refuses to run the program. Alternatively, you may have specified wildcarded files - Unix fileds them and gives the name of each of them to the program as a separate parameter - emma gets the wrong number of parameters and refuses to run. You get round this by quoting the wildcard. You can either put the whole wildcarded name in quotes: "abc*.pep" or you can quote just the '*' using a '\' as: abc\*.pep This problem does not occur when you reply to the prompt from the program for the input sequences, or when you are typing the wildcard files name in a web browser of GUI (such as Jemboss or SPIN) fieldOutput file format Output files for usage example File: hbb_human.aln>HBB_HUMAN--------VHLTPEEKSAVTALWGKVN--VDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDP----ENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH------>HBB_HORSE--------VQLSGEEKAAVLALWDKVN--EEEVGGEALGRLLVVYPWTQRFFDSFGDLSNPGAVMGNPKVKAHGKKVLHSFGEGVHHLDNLKGTFAALSELHCDKLHVDP----ENFRLLGNVLVVVLARHFGKDFTPELQASYQKVVAGVANALAHKYH------>HBA_HUMAN---------VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF------DLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDP----VNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR------>HBA_HORSE---------VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHF------DLSHGSAQVKAHGKKVGDALTLAVGHLDDLPGALSNLSDLHAHKLRVDP----VNFKLLSHCLLSTLAVHLPNDFTPAVHASLDKFLSSVSTVLTSKYR------>MYG_PHYCA---------VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTEAEMKASEDLKKHGVTVLTALGAILKKKGHHEAELKPLAQSHATKHKIPI----KYLEFISEAIIHVLHSRHPGDFGADAQGAMNKALELFRKDIAAKYKELGYQG>GLB5_PETMAPIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTTADQLKKSADVRWHAERIINAVNDAVASMDDTEKMSMKLRDLSGKHAKSFQ----VDPQYFKVLAAVIADTVAAGDAGFEKLMSMICILLRSAY------------->LGB2_LUPLU--------GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKG--TSEVPQNNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKGVADAHFPVVKEAILKTIKEVVGAKWSEELNSAWTIAYDELAIVIKKEMNDAA--- File: hbb_human.dnd((((HBB_HUMAN:0.08080,HBB_HORSE:0.08359):0.21952,(HBA_HUMAN:0.05452,HBA_HORSE:0.06605):0.21070):0.06034,MYG_PHYCA:0.39882):0.01490,GLB5_PETMA:0.38267,LGB2_LUPLU:0.50324); Sequences emma writes the aligned sequences and a dendrogram file showing how the sequences were clustered during the progressive alignments. The clustalw output sequences are reformatted into the default EMBOSS output format instead of being left as Clustal-format '.aln' files. Trees Believe it or not, we now use the New Hampshire (nested parentheses) format as default for our trees. This format is compatible with e.g. the PHYLIP package. If you want to view a tree, you can use the RETREE or DRAWGRAM/DRAWTREE programs of PHYLIP. This format is used for all our trees, even the initial guide trees for deciding the order of multiple alignment. The output trees from the phylogenetic tree menu can also be requested in our old verbose/cryptic format. This may be more useful if, for example, you wish to see the bootstrap figures. The bootstrap trees in the default New Hampshire format give the bootstrap figures as extra labels which can be viewed very easily using TREETOOL which is available as part of the GDE package. TREETOOL is available from the RDP project by ftp from rdp.life.uiuc.edu. The New Hampshire format is only useful if you have software to display or manipulate the trees. The PHYLIP package is highly recommended if you intend to do much work with trees and includes programs for doing this. WE DO NOT PROVIDE ANY DIRECT MEANS FOR VIEWING TREES GRAPHICALLY.Data files The comparison matrices available for clustalw are not EMBOSS matrix files, as they are defined in the clustalw code. The matrices available for carrying out a protein sequence alignment are: * blosum * pam * gonnet * id * user defined The comparison matrices available in clustalw for carrying out a nucleotide sequence alignment are: * iub * clustalw * user definedNotes NoneReferences The main reference for ClustalW is Thompson et al below. 1. Thompson, J.D., Higgins, D.G. and Gibson, T.J. (1994) "CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positions-specific gap penalties and weight matrix choice." Nucleic Acids Research, 22:4673-4680. 2. Feng, D.-F. and Doolittle, R.F. (1987). J. Mol. Evol. 25, 351-360. 3. Needleman, S.B. and Wunsch, C.D. (1970). J. Mol. Biol. 48, 443-453. 4. Dayhoff, M.O., Schwartz, R.M. and Orcutt, B.C. (1978) in Atlas of Protein Sequence and Structure, vol. 5, suppl. 3 (Dayhoff, M.O., ed.), pp 345-352, NBRF, Washington. 5. Henikoff, S. and Henikoff, J.G. (1992). Proc. Natl. Acad. Sci. USA 89, 10915-10919. 6. Lipman, D.J., Altschul, S.F. and Kececioglu, J.D. (1989). Proc. Natl. Acad. Sci. USA 86, 4412-4415. 7. Barton, G.J. and Sternberg, M.J.E. (1987). J. Mol. Biol. 198, 327-337. 8. Gotoh, O. (1993). CABIOS 9, 361-370. 9. Altschul, S.F. (1989). J. Theor. Biol. 138, 297-309. 10. Lukashin, A.V., Engelbrecht, J. and Brunak, S. (1992). Nucl. Acids Res. 20, 2511-2516. 11. Lawrence, C.E., Altschul, S.F., Boguski, M.S., Liu, J.S., Neuwald, A.F. and Wooton, J.C. (1993). Science, 262, 208-214. 12. Vingron, M. and Waterman, M.S. (1993). J. Mol. Biol. 234, 1-12. 13. Pascarella, S. and Argos, P. (1992). J. Mol. Biol. 224, 461-471. 14. Collins, J.F. and Coulson, A.F.W. (1987). In Nucleic acid and protein sequence analysis a practical approach, Bishop, M.J. and Rawlings, C.J. ed., chapter 13, pp. 323-358. 15. Vingron, M. and Sibbald, P.R. (1993). Proc. Natl. Acad. Sci. USA, 90, 8777-8781. 16. Thompson, J.D., Higgins, D.G. and Gibson, T.J. (1994). CABIOS, 10, 19-29. 17. Lthy, R., Xenarios, I. and Bucher, P. (1994). Protein Science, 3, 139-146. 18. Higgins, D.G. and Sharp, P.M. (1988). Gene, 73, 237-244. 19. Higgins, D.G. and Sharp, P.M. (1989). CABIOS, 5, 151-153. 20. Higgins, D.G., Bleasby, A.J. and Fuchs, R. (1992). CABIOS, 8, 189-191. 21. Sneath, P.H.A. and Sokal, R.R. (1973). Numerical Taxonomy, W.H. Freeman, San Francisco. 22. Saitou, N. and Nei, M. (1987). Mol. Biol. Evol. 4, 406-425. 23. Wilbur, W.J. and Lipman, D.J. (1983). Proc. Natl. Acad. Sci. USA, 80, 726-730. 24. Musacchio, A., Gibson, T., Lehto, V.-P. and Saraste, M. (1992). FEBS Lett. 307, 55-61. 25. Musacchio, A., Noble, M., Pauptit, R., Wierenga, R. and Saraste, M. (1992). Nature, 359, 851-855. 26. Bashford, D., Chothia, C. and Lesk, A.M. (1987). J. Mol. Biol. 196, 199-216. 27. Myers, E.W. and Miller, W. (1988). CABIOS, 4, 11-17. 28. Thompson, J.D. (1994). CABIOS, (Submitted). 29. Smith, T.F., Waterman, M.S. and Fitch, W.M. (1981). J. Mol. Evol. 18, 38-46. 30. Pearson, W.R. and Lipman, D.J. (1988). Proc. Natl. Acad. Sci. USA. 85, 2444-2448. 31. Devereux, J., Haeberli, P. and Smithies, O. (1984). Nucleic Acids Res. 12, 387-395. 32. Felsenstein, J. (1989). Cladistics 5, 164-166. 33. Kimura, M. (1980). J. Mol. Evol. 16, 111-120. 34. Kimura, M. (1983). The Neutral Theory of Molecular Evolution. Cambridge University Press, Cambridge. 35. Felsenstein, J. (1985). Evolution 39, 783-791. 36. Smith, R.F. and Smith, T.F. (1992) Protein Engineering 5, 35-41. 37. Krogh, A., Brown, M., Mian, S., Sjlander, K. and Haussler, D. (1994) J. Mol. Biol. 235-1501-1531. 38. Jones, D.T., Taylor, W.R. and Thornton, J.M. (1994). FEBS Lett. 339, 269-275. 39. Bairoch, A. and Bckmann, B. (1992) Nucleic Acids Res., 20, 2019-2022. 40. Noble, M.E.M., Musacchio, A., Saraste, M., Courtneidge, S.A. and Wierenga, R.K. (1993) EMBO J. 12, 2617-2624. 41. Kabsch, W. and Sander, C. (1983) Biopolymers, 22, 2577-2637.Warnings None.Diagnostic Error Messages "cannot find program 'clustalw'" - means that the ClustalW program has not been set up on your site or is not in your environment (i.e. is not on your path). The solutions are to (1) install clustalw in the path so that emma can find it with the command "clustalw", or (2) define a variable (an environment variable of in emboss.defaults or your .embossrc file) called EMBOSS_CLUSTALW containing the command (program name or full path) to run clustalw if you have it elsewhere on your system.Exit status It exits with status 0 unless an error is reportedKnown bugs None.See also Program name Description edialign Local multiple alignment of sequences infoalign Information on a multiple sequence alignment plotcon Plot quality of conservation of a sequence alignment prettyplot Displays aligned sequences, with colouring and boxing showalign Displays a multiple sequence alignment tranalign Align nucleic coding regions given the aligned proteinsAuthor(s) Mark Faller (current e-mail address unknown) while he was with: HGMP-RC, Genome Campus, Hinxton, Cambridge CB10 1SB, UKHistory Completed 18 February 1999Target users This program is intended to be used by everyone and everything, from naive users to embedded scripts.Comments None
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -