📄 moss.html

📁 A program to find frequent molecular substructures and discriminative fragments in a database of mol
💻 HTML
📖 第 1 页 / 共 5 页
字号:
<tr><td valign="top"><tt>-E</tt></td><td>&nbsp;</td>    <td>bond-by-bond support-filtered ring extensions (includes -O)</td></tr><tr><td valign="top"><tt>-O</tt></td><td>&nbsp;</td>    <td>do not record fragments with open rings of marked sizes</td></tr><tr><td valign="top"><tt>-K</tt></td><td>&nbsp;</td>    <td>do not convert Kekul&eacute; representations        to aromatic rings</td></tr></table></p><p>In a Kekul&eacute; representation an aromatic ring has alternatingsingle and double bonds. In order to avoid mismatches due to a differentrepresentation of aromatic rings it is recommended to convert allKekul&eacute; representations to rings with aromatic bonds.</p><p><b>Carbon Chain Options</b> allow to find and match chains of varyinglength that consist only of carbon atoms that are connected by singlebonds and do not have any branches.</p><p><table><tr><td valign="top"><tt>-C&nbsp;&nbsp;&nbsp;</tt></td><td>&nbsp;</td>    <td>find and match variable length chains of carbon atoms</td></tr></table></p><p><b>Extension Options</b> switch between different restrictedextensions that are used in the search. Details about restrictedextensions can be found in <a href="#borgelt_2005">[Borgelt 2005]</a>.<p><table><tr><td valign="top"><tt>-g&nbsp;&nbsp;&nbsp;</tt></td><td>&nbsp;</td>    <td>use rightmost path extensions (default: max. source)</td></tr></table></p><p>Rightmost path extensions are the extension type used in thegSpan algorithm [Yan and Han 2002] and its extension CloseGraph[Yan and Han 2003]. Hence, by specifying <tt>-g</tt> one can switchto these algorithms. The MoSS/MoFa algorithm uses maximum sourceindex extensions, see <a href="#borgelt_2005">[Borgelt 2005]</a>.</p><p><b>Pruning Options</b> control the pruning of the search tree.The default is usually the best choice. For details about thepruning methods, see <a href="#borgelt_2005">[Borgelt 2005]</a>(canonical form pruning) and <a href="#borgelt_et_al_2004">[Borgelt et al. 2004]</a> (other pruning methods).</p><p><table><tr><td valign="top"><tt>+/-P&nbsp;</tt></td><td>&nbsp;</td>    <td>partial perfect extension pruning (default: no/-)</td></tr><tr><td valign="top"><tt>+/-p</tt></td><td>&nbsp;</td>    <td>full    perfect extension pruning (default: yes/+)</td></tr><tr><td valign="top"><tt>+/-e</tt></td><td>&nbsp;</td>    <td>equivalent sibling pruning        (default: no/-)</td></tr><tr><td valign="top"><tt>+/-q</tt></td><td>&nbsp;</td>    <td>canonical form pruning            (default: yes/+)</td></tr></table></p><p>If canonical form pruning is not used, duplicate substructures arefound and eliminated with the help of a repository of already processedsubstructures. This is usually considerably slower.</p><p><b>Memory Saving Options</b> control how the embeddings ofsubstructures are handled during the search.</p><p><table><tr><td valign="top"><tt>-M#&nbsp;&nbsp;</tt></td><td>&nbsp;</td>    <td>maximal number of embeddings per molecule        (to save memory)</td></tr></table></p><p>This option can reduce the amount of memory needed in the search,but usually slows down the search process.</p><p><b>Debug Options</b> have been introduced for debugging purposes,but may also be useful for testing the program and understanding howthe algorithms work.</p><p><table><tr><td valign="top"><tt>-N&nbsp;&nbsp;&nbsp;</tt></td><td>&nbsp;</td>    <td>normalize fragment output form (for result comparisons)</td></tr><tr><td valign="top"><tt>-v</tt></td><td>&nbsp;</td>    <td>verbose output during search (show search tree)</td></tr></table></p><table width="100%" border=0 cellpadding=0 cellspacing=0><tr><td width="95%" align=right>    <a href="#top">back to the top</a>&nbsp;</td>    <td><a href="#top"><img src="uparrow.gif" border=0></a></td></tr></table><!-- =============================================================== --><p><img src="line.gif" alt="" height=7 width=704></p><h3><a name="input">Format of the Input File</a></h3><p>The input file, which contains the molecular database, is expectedto be a text file with one molecule description per line. Each moleculeis described by an identifier, a value that is used for classifying themolecule into the focus or the complement part of the database, and adescription of the molecule's structure, using either the SMILES or theSLN format (see <a href="#langs">Molecule Description Languages</a>).</p><p>Each line of the input file has the general format</p><pre>&lt;id&gt; , &lt;value&gt; , &lt;desc&gt;</pre><p>where</p><p><table border=0 cellpadding=0 cellspacing=0><tr><td valign="top"><tt>&lt;id&gt;</tt>&nbsp;&nbsp;</td>    <td>is a molecule identifier that may be any string, provided        it does not contain a comma, a space, or a tabulator.        The molecule identifier is used to refer to molecules in        the output of the program and thus it is recommended that it        is unique for the database to analyze. However, the program        does not require nor check uniqueness and will work also if        the same identifier is used twice or even more often.</td></tr><tr><td valign="top"><tt>&lt;value&gt;</tt>&nbsp;&nbsp;</td>    <td>is a real value that is compared to a user-specified        threshold in order to split the database into a focus        part and its complement. By default, all molecules with        a value &le; 0.5 are placed into the focus, all other        molecules are placed into the complement. This behavior        can be changed with the options <tt>-t</tt> (threshold        value) and <tt>-z</tt> (invert split), see        <a href="#options">Options</a>.</td></tr><tr><td valign="top"><tt>&lt;desc&gt;</tt>&nbsp;&nbsp;</td>    <td>is a description of the molecule in a linear notation        language. Currently the SMILES format and the SLN format        are supported (see <a href="#langs">Molecule Description        Languages</a>). By default the SMILES format is used, but        this may be changed using the option <tt>-i</tt> (see        <a href="#options">Options</a>).</td></tr></td></tr></table></p><p>Instead of commas, spaces and tabulators may also be used to separatethe three fields. This implies that the molecule identifier must notcontain spaces or tabulators, since otherwise an input line cannot besplit correctly into the three fields.</p><p>Empty lines in the input file are simply ignored, as well as linesstarting with a <tt>#</tt>, which may be used to put comments into aninput file.</p><table width="100%" border=0 cellpadding=0 cellspacing=0><tr><td width="95%" align=right>    <a href="#top">back to the top</a>&nbsp;</td>    <td><a href="#top"><img src="uparrow.gif" border=0></a></td></tr></table><!-- =============================================================== --><p><img src="line.gif" alt="" height=7 width=704></p><h3><a name="output">Formats of the Output Files</a></h3><p>MoSS writes one or two output files, depending on how many filenames were provided on the command line: a substructure file (alwayswritten) and an molecule identifier file (optional).</p><h4>Substructure File</h4><p>The substructure output file contains the found substructures,one per line, together with some additional information. Each foundsubstructure is described by eight fields.</p><p>The first line of this output file is always</p><pre>id,desc,atoms,bonds,s_abs,s_rel,c_abs,c_rel</pre><p>which indicates the meaning of the fields in the following lines.<br>Consequently the following lines have the general form</p><pre>&lt;id&gt; , &lt;desc&gt; , &lt;atoms&gt; , &lt;bonds&gt; , &lt;s_abs&gt; , &lt;s_rel&gt; , &lt;c_abs&gt; , &lt;c_rel&gt;</pre><p><table border=0 cellpadding=0 cellspacing=0><tr><td valign="top"><tt>&lt;id&gt;</tt></td>    <td>is an identifier for the substructure, which is a simple        consecutive number, starting with 1 for the first substructure.        Note that the order in which substructures are reported depends        on the search process and may differ depending on the selected        search mode.</td></tr><tr><td valign="top"><tt>&lt;desc&gt;</tt></td>    <td>is a description of the substructure in either SMILES        or SLN format (see <a href="#langs">Molecule Description        Languages</a>). By default SMILES is used, but this may be        changed using the option <tt>-o</tt>.</td></tr><tr><td valign="top"><tt>&lt;atoms&gt;</tt>&nbsp;&nbsp;</td>    <td>is the number of atoms in the substructure.</td></tr><tr><td valign="top"><tt>&lt;bonds&gt;</tt>&nbsp;&nbsp;</td>    <td>is the number of bonds in the substructure.</td></tr><tr><td valign="top"><tt>&lt;s_abs&gt;</tt></td>    <td>is the absolute support of the substructure in the focus part        of the database, that is, the number of molecules in the focus        part that contain this substructure.</td></tr><tr><td valign="top"><tt>&lt;s_rel&gt;</tt></td>    <td>is the relative support of the substructure in the focus part        of the database, that is, the percentage of molecules in the        focus part that contain this substructure.</td></tr><tr><td valign="top"><tt>&lt;c_abs&gt;</tt></td>    <td>is the absolute support of the substructure in the complement        part of the database, that is, the number of molecules in the        complement part that contain this substructure.</td></tr><tr><td valign="top"><tt>&lt;c_rel&gt;</tt></td>    <td>is the relative support of the substructure in the complement        part of the database, that is, the percentage of molecules in        the complement part that contain this substructure.</td></tr></table></p><h4>Molecule Identifier File</h4><p>The molecule identifier output file contains, for each foundsubstructure, a list of the molecules the substructure is contained in.Each line corresponds to one substructure, which is referred to by aits identifier (see above). The containing molecules are also referredto by their identifiers as they were specified in the input file (see<a href="#input">Input File</a>).</p><p>Each line of this output file has the general format</p><pre>&lt;subid&gt; , &lt;molid&gt; [ , &lt;molid&gt; ]<sup>*</sup></pre><p>where</p><p><table border=0 cellpadding=0 cellspacing=0><tr><td valign="top"><tt>&lt;subid&gt;</tt>&nbsp;&nbsp;</td>    <td>is the identifier of a substructure as it is specified        in the substructure output file, that is, a number between        1 and the number of found substructures.</td></tr><tr><td valign="top"><tt>&lt;molid&gt;</tt>&nbsp;&nbsp;</td>    <td>is the identifier of a molecule that contains the substructure        as it was specified in the input file.</td></tr></table></p><p>The order in which the molecules are listed is the same as theorder in the input file, with the only exception that the moleculesin the focus part of the database precede the molecules in thecomplement part.</p><table width="100%" border=0 cellpadding=0 cellspacing=0><tr><td width="95%" align=right>    <a href="#top">back to the top</a>&nbsp;</td>    <td><a href="#top"><img src="uparrow.gif" border=0></a></td></tr></table><!-- =============================================================== --><p><img src="line.gif" alt="" height=7 width=704></p><h3><a name="example1">Example&nbsp;1: Artificial Data</a></h3><p>As a first example of how to apply the MoSS program, we considerthe artificial data set used in <a href="#borgelt_and_berthold_2002">[Borgelt and Berthold 2002]</a>, which consists of six simple molecules.(Note that these molecules were constructed to demonstrate certainproperties of the search algorithm and have no chemical significance,may not even be possible as actual molecules.)</p><table border=1 cellpadding=4><tr><th>id</th><th>molecule</th><th>SMILES description</th></tr><tr><td>a</td><td><img src="ex_a.png"></td><td>CCS(O)(O)N</td></tr><tr><td>b</td><td><img src="ex_b.png"></td><td>CCS(O)(C)N</td></tr><tr><td>c</td><td><img src="ex_c.png"></td><td>CS(O)(C)N</td></tr><tr><td>d</td><td><img src="ex_d.png"></td><td>CCS(=N)N</td></tr><tr><td>e</td><td><img src="ex_e.png"></td><td>CS(=N)N</td></tr><tr><td>f</td><td><img src="ex_f.png"></td><td>CS(=N)O</td></tr></table><p>This data set can be found on the<a href="http://fuzzy.cs.uni-magdeburg.de/~borgelt/moss.html">download page</a> for the MoSS program and in the source package(directory <tt>moss/data</tt>). It is available in both SMILES andSLN format.</p><p>For the SMILES format, the file is called <tt>example1.smiles</tt>and looks like this:</p>
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -