📄 moss.html
字号:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN"><!-- =================================================================== File : moss.html Contents: Description of molecular substructure miner Author : Christian Borgelt==================================================================== --><html><head><title>MoSS Documentation</title></head><!-- =============================================================== --><body bgcolor=white><h1><a name="top">MoSS</a></h1><h3>Molecular Substructure Miner</h3><p>(aka <b>MoFa</b> - <b>Mo</b>lecular <b>F</b>r<b>a</b>gment Miner)</p><!-- =============================================================== --><p><img src="line.gif" alt="" height=7 width=704></p><p><a href="http://fuzzy.cs.uni-magdeburg.de/~borgelt/moss.html">Download page</a> with most recent version.</p><p>This documentation refers to version 4.9 (2006.08.13)of the MoSS program.<br>Older versions of the program may differ in some aspects.</p><!-- =============================================================== --><p><img src="line.gif" alt="" height=7 width=704></p><h3>Contents</h3><p><ul type=disc><li><a href="#intro">Introduction</a></li><li><a href="#langs">Molecule Description Languages</a><li><a href="#gui">Graphical User Interface</a><li><a href="#invoke">Command Line Program</a><li><a href="#options">Command Line Options</a><li><a href="#input">Format of the Input File</a><li><a href="#output">Formats of the Output Files</a><li><a href="#example1">Example 1: Artificial Data</a><li><a href="#example2">Example 2: Steroids Data</a><li><a href="#publ">Publications</a></li><li><a href="#download">Download</a></li><li><a href="#copying">Copying</a></li><li><a href="#contact">Contact</a></li><li><a href="#links">Useful Links</a></li></ul></p><!-- =============================================================== --><p><img src="line.gif" alt="" height=7 width=704></p><h3><a name="intro">Introduction</a></h3><p>MoSS is a program to find frequent molecular substructures anddiscriminative fragments in a database of molecule descriptions.It can be used in the context of drug discovery and synthesis predictionfor the purpose of analyzing the outcome of screening tests.</p><p>Given a database of molecules, MoSS finds all closed frequentsubstructures, that is, all substructures that appear with auser-specified minimum frequency in the database and do not havesuperstructures that occur with the same frequency. Alternatively,it finds molecular fragments that are frequent in the focus partof the database, but rare in the complement part, that is, thatappear with at least a user-specified minimum frequency in the focuspart of the database (and do not have superstructures that occur withthe same frequency), but with no more than a user-specified maximumfrequency in the complement part of the database. Such molecularsubstructures discriminate between the two parts of the databaseand thus may be called <i>discriminative fragments</i>.</p><p>The algorithm underlying MoSS is inspired by the Eclat algorithmfor frequent item set mining [Zaki et al. 1997] and was first publishedin <a href="#borgelt_and_berthold_2002">[Borgelt and Berthold 2002]</a>.Apart from the default MoSS/MoFa algorithm, this program also containsthe gSpan algorithm [Yan and Han 2002] (or rather its extensionCloseGraph [Yan and Han 2003]) as a special processing mode.</p><p>The first version of the MoSS program was developed in cooperationwith <a href="http://www.tripos.com">Tripos, Inc.</a>, Data AnalysisResearch Lab, South San Francisco, CA, USA. I am very grateful toTripos, Inc., for giving me the opportunity to carry out the researchthat led to the development of this program.</p><p>The MoSS algorithm is currently also studied at the<a href="http://www.inf.uni-konstanz.de/bioml/">ALTANA Chair of Applied Computer Science (M.R. Berthold)</a> of the<a href="http://www.uni-konstanz.de/">University of Konstanz</a>,where a<a href="http://www.inf.uni-konstanz.de/bioml/research/mofa/overview.html">sister page</a> can be found.</p><p>Enjoy,<br><a href="mailto:christian.borgelt@softcomputing.es">Christian Borgelt</a></p><table width="100%" border=0 cellpadding=0 cellspacing=0><tr><td width="95%" align=right> <a href="#top">back to the top</a> </td> <td><a href="#top"><img src="uparrow.gif" border=0></a></td></tr></table><!-- =============================================================== --><p><img src="line.gif" alt="" height=7 width=704></p><h3><a name="langs">Molecule Description Languages</a></h3><p>The MoSS program works on textual descriptions of molecules.In order to describe molecules by simple texts, so that they canbe fed into a computer program, one needs a linear notation for thestructure of a molecule. MoSS supports two such notation languagesfor molecules:</p><ul><li><b>SMILES</b> — <b>S</b>implified <b>M</b>olecular <b>I</b>nput <b>L</b>ine <b>E</b>ntry <b>S</b>ystem<br> SMILES was developed in 1986 by David Weiniger at USEP (US Environmental Research Laboratory). It is a commonly used exchange format for molecular structures. A very detailed <a href="http://www.daylight.com/dayhtml_tutorials/languages/smiles/index.html"> tutorial</a> that describes this language is available from <a href="http://www.daylight.com/">Daylight, Inc.</a>, which also provides a <a href="http://www.daylight.com/daycgi_tutorials/depict.cgi"> molecule renderer</a>.</li><li><b>SLN</b> — <b>S</b>YBYL<sup>®</sup> <b>L</b>ine <b>N</b>otation<br> SLN was developed by <a href="http://www.tripos.com/">Tripos, Inc.</a> for their product SYBYL<sup>®</sup> (a software for molecular modelers). I did not find a good online tutorial for SLN yet, just some very brief descriptions. However, several books on (bio)chemistry contain descriptions of this language.</li></ul><p><table><tr><td valign="top"><b>Example:</b></td><td width=20> </td> <td><img src="steroid_a.png"></td></tr><tr><td valign="top"><b>SMILES</b>:</td><td></td> <td>c1:c:c(:c:c2:c:1C1C(CC2)C2C(CC1)(C(CC2)(O)C#C)C)O</td></tr><tr><td valign="top"><b>SLN</b>:</td><td></td> <td>C[1]H:CH:C(:CH:C[8]:C:@1C[10]HCH(CH2CH2@8)C[20]HC(CH2CH2@10)(C(CH2CH2@20)(OH)C#CH)CH3)OH</td></tr></table></p><p>MoSS can read and write both of these description languages, but maymake mistakes in interpreting certain codes in the SMILES format due tothe fact that SMILES uses implicit single <i>and</i> aromatic bonds.An example is the code <tt>c1ccccc1c2ccccc2</tt> where the type of thebond between the two benzene rings is (wrongly) set to <i>aromatic</i>instead of <i>single</i>. Such misinterpretations can only be removedby including valence rules into the parser, which is a lot of tediouswork I have not found time to do yet. To prevent such mistakes, it istherefore recommended to avoid implicit bonds, or at least to writemolecules like the one above as <tt>c1ccccc1-c2ccccc2</tt> in SMILES.MoSS output <i>never</i> contains implicit bonds: all bonds are statedexplicitly.</p><p>MoSS ignores the hydrogen atoms in the SLN description, since theyare implicit in SMILES. The idea is that the result of the search shouldbe the same, independent of what description language is used for thedata.</p><table width="100%" border=0 cellpadding=0 cellspacing=0><tr><td width="95%" align=right> <a href="#top">back to the top</a> </td> <td><a href="#top"><img src="uparrow.gif" border=0></a></td></tr></table><!-- =============================================================== --><p><img src="line.gif" alt="" height=7 width=704></p><h3><a name="gui">Graphical User Interface</a></h3><p>The MoSS program comes in two flavors: one has a simple graphicaluser interface, which allows to set all parameters in a tabbed dialogbox, while the other has to be invoked on the command line, providingthe files to work on and the processing options as arguments to theprogram. This section describes the version with a graphical userinterface, which may be easier to use for novice users. The commandline version is described in the <a href="#invoke">next section</a></p><p>The MoSS version with a graphical user interface can be startedwith the command</p><pre>java -jar moss.jar [<config>]</pre>for the Java archive (under Windows it may also be started by simplyclicking on the file <tt>moss.jar</tt>) or<pre>java moss.MoSS [<config>]</pre><p>for the compiled source code (assuming in the latter case that thecurrent working directory is the parent directory of the <tt>moss</tt>source directory, the <tt>CLASSPATH</tt> environment variable has beenset appropriately, or a proper class path is set with the command lineoption <tt>-classpath</tt> of the java command). The only parameter,which is optional, is the name of a configuration file, from which thedifferent fields in the dialog can be preset.</p><p>The dialog window contains a set of tabs at the top, in whichparameters and options can be set, and a button bar at the bottom,with which the search can be started, the program can be terminated,and the configuration of the dialog window can be saved and loaded.<p>Since it is inconvenient to set all needed parameters anew every timethe MoSS program is started, it is possible to save the configuration ofthe dialog window into a file. To do so, simply press the <tt>Save</tt>button at the bottom of the window and select the file into which youwant the configuration to be written. To load a configuration, pressthe <tt>Load</tt> button and select the configuration file to beloaded.</p><p>After all parameters have been specified, the search can be startedby pressing the <tt>Run</tt> button. While the search is running, thenumber of found substructures is printed in the status line (this numberis updated once per second). The <tt>Run</tt> button changes into an<tt>Abort</tt> button, with which the search can be aborted. Note thatin this case all substructures that were found up to this point havealready been written to the output file and thus a (partial) result isavailable even in this case. At the end of a (not aborted) search adialog box informs about the number of found substructures and thetotal search time.</p><p>On the first tab the input and output formats and files can beselected and a seed substructure, from which the search is to bestarted, may be specified:</p><p><img src="files.png"></p><p>By default, the SMILES format is used for all input and outputand the names <tt>moss.dat</tt> and <tt>moss.sub</tt> are used forthe molecule input file and substructure output file. The name ofthe molecule identifier file is left empty, indicating that thisfile will not be written. It is written only if the correspondingfield is filled. Information about the format of the input andoutput files can be found <a href="#input">here (input file)</a>and <a href="#output">here (output files)</a>.</p><p>By default, the seed is a star (or may also be left empty), meaningthat the search is started from an empty substructure. Using seedstructures that are bigger than a single atom usually can slow downthe search considerably (technically, because with the current stateof the program canonical form pruning cannot be used in this case)and thus should be used with care.</p><p>On the second tab the basic parameters may be specified:</p><p><img src="params.png"></p><p>The input molecule data set is split into two subsets, which arecalled the <i>focus</i> and the <i>complement</i>. By default, allmolecules which have an associated value no greater than 0.5 (thetheshold for the split) are placed in the focus, all other moleculesare placed into the complement. This division into focus and complementmay be inverted by checking the "Invert split" box.</p><p>With the next fields it can be specified what minimum support asubstructure must have in the focus and what maximum support it mayhave in the complement in order to be reported. Both values are given
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -