📄 moss.html

📁 A program to find frequent molecular substructures and discriminative fragments in a database of mol
💻 HTML
📖 第 1 页 / 共 5 页
字号:
<pre>a,0,CCS(O)(O)Nb,0,CCS(O)(C)Nc,0,CS(O)(C)Nd,0,CCS(=N)Ne,0,CS(=N)Nf,0,CS(=N)O</pre><p>For the SLN format, the file is called <tt>example1.sln</tt>,and it looks the same, since I omitted hydrogen atoms and thus thereis no difference in this case between the two description languages.</p><p>The first column states the identifiers of the molecules, which aresimply the letters <tt>a</tt> to <tt>f</tt>. The value associated witheach molecule (second column) is 0, so that by default all moleculesare placed into the focus part of the database while the complementpart is empty.</p><p>We now process the file in SMILES format with the graphical userinterface. To do so, select <tt>example1.smiles</tt> as the "Moleculeinput file", <tt>example1.sub</tt> as the "Substructure output file",and <tt>example1.ids</tt> as the "Molecule identifier file" (all onfirst tab). Then go to the second tab and set the minimum support infocus to 50%. Finally go to the "Miscellaneous" tab and check"Verbose message output". Then press the <tt>Run</tt> button atthe bottom of the dialog window.</p><p>To process the file in SMILES format with the command line version,use the command</p><pre>java -classpath moss.jar moss.Miner -s50 -v "S" example1.smiles example1.sub example1.ids</pre><p>The option <tt>-s50</tt> means that a substructure has to becontained in at least 50% of the molecules (that is, 3 molecules)in order to be reported. The option <tt>-v</tt> enables us to inspectthe search tree that is traversed. The first argument, <tt>"S"</tt>,specifies that the search should be started from a sulphur atom, whichis contained in all molecules. <tt>example1.smiles</tt> obviously isthe input file with the database to process. <tt>example1.sub</tt> isthe name of the output file to which the found substructures are tobe written, <tt>examples2.ids</tt> the name of the output file, towhich the lists of molecule identifiers are to be written.</p><p>Running the MoSS program from the graphical user interface orinvoking it with the above command line yields the following terminaloutput:</p><pre>moss.Miner - molecular substructure miner (MoSS)version 4.9 (2006.08.12)    (c) 2002-2006 Christian Borgelt / Tripos, Inc.parsing seed description ... [1 atom(s), 0 bond(s)] done.reading example1.smiles ... [6 (6+0) molecule(s)] done [0.0020s].converting Kekule representations ... [0 molecule(s)] done [0.0010s].marking bridges ... [6 molecule(s)] done [0.0010s].masking atom and bond types ... [6 molecule(s)] done [0.0s].preparing molecules ... [6 molecule(s)] done [0.0060s].embedding the seed ... [6 (6+0) molecule(s)] done [0.0070s].</pre><p>After having loaded the data file, but before starting its actualwork, MoSS performs several preprocessing tasks, the progress of whichis reported as show above. Then the actual search starts. Since wespecified the option <tt>-v</tt> the search tree is printed to theterminal:</p><pre>searching for substructures ... S  abcdef (6)   S-O  a2bcf (4)      S(-O)-N  a2bc (3)         S(-O)(-N)-C  a2b2c2 (3)      S(-O)-C  a2b2c2f (4)   S-N  abcde (5)      S(-N)-C  ab2c2de (5)         S(-N)-C-C  abd (3)   S-C  ab2c2def (6)      S(-C)=N  def (3)      S-C-C  abd (3)   S=N  def (3)[6 substructure(s)] done [0.017s].</pre><p>Each line refers to one node of the search tree, with the indentationindicating the level of the search tree the node is located on.The line starts with a description of the substructure (which isalways in SMILES format), followed by a list of identifiers of themolecules that contain the substructure. If a molecule containsthe substructure more than once, the number of occurrences of thesubstructure in the molecule is printed after the molecule identifier.At the end of the line the number of molecules containing the fragmentis printed in parentheses.</p><p>After the MoSS program finishes the search, it prints the numberof found substructures as well as some statistics about the search(in the terminal window):</p><pre>search statistics:number of search tree nodes : 12number of created fragments : 24number of created embeddings: 91insufficient support pruning: 12perfect extension pruning   : 0equivalent sibling pruning  : 0canonical form pruning      : 0ring order pruning          : 0duplicate fragment pruning  : 0non-closed fragments        : 6fragments with open rings   : 0auxiliary invalid fragments : 0comparisons with repository : 0</pre><p>These statistics give an impression of the complexity of thesearch and of how effective the different pruning methods are.</p><p>The first output file written by the above execution of the MoSSprogram is the substructure file <tt>example1.sub</tt>. It lookslike this:</p><pre>id,desc,atoms,bonds,s_abs,s_rel,c_abs,c_rel1,S(-O)(-N)-C,4,3,3,50.0,0,0.02,S(-O)-C,3,2,4,66.66666666666666,0,0.03,S(-N)-C-C,4,3,3,50.0,0,0.04,S(-N)-C,3,2,5,83.33333333333334,0,0.05,S(-C)=N,3,2,3,50.0,0,0.06,S-C,2,1,6,100.0,0,0.0</pre><p>It lists the six closed substructures that were found, which aredepicted in the table below.</p><p><table border=1 cellpadding=4><tr><th>id</th><th>fragment</th><th>SMILES description</th></tr><tr><td>1</td><td><img src="ex_1.png"></td><td>CS(O)N</td></tr><tr><td>2</td><td><img src="ex_4.png"></td><td>CSO</td></tr><tr><td>3</td><td><img src="ex_2.png"></td><td>CCSN</td></tr><tr><td>4</td><td><img src="ex_3.png"></td><td>CSN</td></tr><tr><td>5</td><td><img src="ex_5.png"></td><td>CS=N</td></tr><tr><td>6</td><td><img src="ex_6.png"></td><td>CS</td></tr></table></p><p>Note that, for example, <tt>O-S-N</tt> is not reported, sinceit is not closed (the superstructure <tt>C-S(-O)-N</tt> has thesame support).</p><p>The second output file (that is, <tt>example1.ids</tt>) lists themolecules each found substructure is contained in. It looks likethis:</p><pre>1,a,b,c2,a,b,c,f3,a,b,d4,a,b,c,d,e5,d,e,f6,a,b,c,d,e,f</pre><p>It tells us, for example, that substructure&nbsp;1 (that is,<tt>S(-O)(-N)-C</tt>) is contained in the molecules <tt>a</tt>,<tt>b</tt>, and <tt>c</tt> and that substructure&nbsp;5 (that is,<tt>S(-C)=N</tt>) is contained in the molecules <tt>d</tt>,<tt>e</tt>, and <tt>f</tt>.</p><table width="100%" border=0 cellpadding=0 cellspacing=0><tr><td width="95%" align=right>    <a href="#top">back to the top</a>&nbsp;</td>    <td><a href="#top"><img src="uparrow.gif" border=0></a></td></tr></table><!-- =============================================================== --><p><img src="line.gif" alt="" height=7 width=704></p><h3><a name="example2">Example&nbsp;2: Steroids Data</a></h3><p>The steroids data set consists of 17 molecules, each of which hasat least 4 rings. These molecules, which are shown in the table below,provide an excellent test data set for ring mining.</p><p><table border=1 cellpadding=4><tr><th>id</th><th>molecule</th>    <th>SMILES description</th></tr><tr><td>a</td><td><img src="steroid_a.png"></td>    <td>c1:c:c(:c:c2:c:1C1C(CC2)C2C(CC1)(C(CC2)(O)C#C)C)O</td></tr><tr><td>b</td><td><img src="steroid_b.png"></td>    <td>c1:c:c(:c:c2:c:1C1C(CC2)C2C(CC1)(C(CC2)O)C)Br</td></tr><tr><td>c</td><td><img src="steroid_c.png"></td>    <td>c1:c:c(:c:c2:c:1C1C(CC2)C2C(CC1)(C(CC2)O)C)F</td></tr><tr><td>d</td><td><img src="steroid_d.png"></td>    <td>c1:c(:c(:c:c2:c:1C1C(CC2)C2C(CC1)(C(CC2)(O)C#C)C)O)OC</td></tr><tr><td>e</td><td><img src="steroid_e.png"></td>    <td>c1:c(:c(:c:c2:c:1C1C(CC2)C2C(CC1)(C(CC2)O)C)O)OC</td></tr><tr><td>f</td><td><img src="steroid_f.png"></td>    <td>c1:c(:c(:c:c2:c:1C1C(CC2)C2C(CC1)(C(CC2)O)C)OC)OC</td></tr><tr><td>g</td><td><img src="steroid_g.png"></td>    <td>c1:c(:c(:c:c2:c:1C1C(CC2)C2C(CC1)(C(C(C2)O)O)C)O)OC</td></tr><tr><td>h</td><td><img src="steroid_h.png"></td>    <td>c1:c(:c(:c:c2:c:1C1C(CC2)C2C(CC1)(C(=O)CC2)C)O)OC</td></tr><tr><td>i</td><td><img src="steroid_i.png"></td>    <td>c1:c(:c(:c:c2:c:1C1C(CC2)C2C(CC1)(C(CC2)O)C)OC)O</td></tr><tr><td>j</td><td><img src="steroid_j.png"></td>    <td>c1:c:c(:c(:c2:c:1C1C(CC2)C2C(CC1)(C(CC2)O)C)OC)O</td></tr><tr><td>k</td><td><img src="steroid_k.png"></td>    <td>C1CC(CC2(CCC3C(C12C)CCC1(C23C(C(C1C1=COC(=O)C=C1)OC(=O)C)O2)C)O)O</td></tr><tr><td>l</td><td><img src="steroid_l.png"></td>    <td>c1:c:c(:c:c2:c:1C1C(CC2)C2C(CC1)(C(CC2)O)C)O</td></tr><tr><td>m</td><td><img src="steroid_m.png"></td>    <td>c1:c:c(:c:c2:c:1C1C(CC2)C2C(CC1)(C(CC2)O)C)OC</td></tr><tr><td>n</td><td><img src="steroid_n.png"></td>    <td>c1:c:c(:c:c2:c:1C1C(CC2)C2C(CC1)(C(C(C2)O)O)C)O</td></tr><tr><td>o</td><td><img src="steroid_o.png"></td>    <td>c1:c:c(:c:c2:c:1C1C(CC2)C2C(CC1)(C(=O)CC2)C)O</td></tr><tr><td>p</td><td><img src="steroid_p.png"></td>    <td>C1CC(=O)C=C2CCC3C(C12)CCC1(C3CCC1(OC(=O)C)C#C)C</td></tr><tr><td>q</td><td><img src="steroid_q.png"></td>    <td>C1C(N2(C)CCCCC2)C(OC(=O)C)CC2C1(C)C1C(C3C(C)(CC1)C(C(N1(CCCCC1)C)C3)OC(=O)C)CC2</td></tr></table></p><p>This data set can be found on the<a href="http://fuzzy.cs.uni-magdeburg.de/~borgelt/moss.html">download page</a> for the MoSS program and in the source package(directory <tt>moss/data</tt>). It is available in both SMILES andSLN format. We process this dataset in two ways, both times using thering mining capabilities of the MoSS program.</p>In the graphical user interface we set in both cases the "Moleculeinput file" in the first tab to <tt>steroids.smiles</tt> and the"Substructure output file" to <tt>steroids.sub</tt>. In addition, weset the minimal support in the focus (second tab) to 100% (that is,substructures must be contained in all molecules), switch on fullring extensions on the fourth tab, and specify a ring size rangefrom 5 to 6 bonds. All other fields keep their default setting.The search is stated by pressing the <tt>Run</tt> button.</p><p>Alternatively, we may use the command line</p><pre>java -classpath moss.jar moss.Miner -s100 -r5:6 -R "" steroids.smiles steroids.sub</pre><p>By this we try to find all substructures that are contained in allof the molecules (minimum support = 100%, option <tt>-s</tt>). Ringsof sizes 5 and 6 atoms (option <tt>-r</tt>) are marked, and extensionsleading into rings add the whole ring in one step, not just individualbonds (option <tt>-R</tt>). After the program terminates, the file<tt>steroids.sub</tt> looks like this:</p><pre>1,O-C,2,1,17,100.0,0,0.02,C12(-C(-C-C-C-2)-C-C-C-C-1)-C,10,11,17,100.0,0,0.0</pre><p>These substructures are depicted in the table below.</p><p><table border=1 cellpadding=4><tr><th>id</th><th>fragment</th><th>SMILES description</th></tr><tr><td>1</td><td><img src="common_1.png"></td>              <td>OC</td></tr><tr><td>2</td><td><img src="common_2.png"></td>              <td>C12(C(CCC2)CCCC1)C</td></tr></table></p><p>Maybe it is surprising that the second fragment consists only oftwo rings, since there is a third 6-bond ring, which is attached to6-bond ring in the fragment, and which looks (at first sight) thesame in all fragments. However, this is not the case, as is explainedbelow.</p><p>In a second run, we set (in addition to the settings described above)"Ignore bond type" on the third tab of the graphical user interface to"in rings" and run the program again.</p><p>Alternatively, we may use the command line</p><pre>java -classpath moss.jar moss.Miner -s100 -r5:6 -R -B "" steroids.smiles steroids.sub</pre><p>which differs from the first run only in the additional option<tt>-B</tt>. It instructs the program to ignore the bond type within(marked) rings. This call yields the output</p>
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -