📄 cluster.c

📁 hmmer源程序
💻 C
📖 第 1 页 / 共 2 页
字号:
12 下一页
/***************************************************************** * HMMER - Biological sequence analysis with profile HMMs * Copyright (C) 1992-1999 Washington University School of Medicine * All Rights Reserved *  *     This source code is distributed under the terms of the *     GNU General Public License. See the files COPYING and LICENSE *     for details. *****************************************************************//* cluster.c * SRE, Sun Jul 18 09:49:47 1993 * moved to squid Thu Mar  3 08:42:57 1994 * RCS $Id: cluster.c,v 1.3 1999/07/15 22:32:16 eddy Exp $ * * almost identical to bord.c, from fd * also now contains routines for constructing difference matrices * from alignments *  * "branch ordering": Input a symmetric or upper-right-diagonal  * NxN difference matrix (usually constructed by pairwise alignment  * and similarity calculations for N sequences). Use the simple  * cluster analysis part of the Fitch/Margoliash tree-building algorithm * (as described by Fitch and Margoliash 1967 as well as Feng * and Doolittle 1987) to calculate the topology of an "evolutionary * tree" consistent with the difference matrix. Returns an array * which represents the tree. *  * The input difference matrix is just an NxN matrix of floats. * A good match is a small difference score (the algorithm is going * to search for minima among the difference scores). The original difference * matrix remains unchanged by the calculations. *  * The output requires some explanation. A phylogenetic * tree is a binary tree, with N "leaves" and N-1 "nodes". The * topology of the tree may be completely described by N-1 structures * containing two pointers; each pointer points to either a leaf * or another node. Here, this is implemented with integer indices * rather than pointers. An array of N-1 pairs of ints is returned. * If the index is in the range (0..N-1), it is a "leaf" -- the * number of one of the sequences. If the index is in the range * (N..2N-2), it is another "node" -- (index-N) is the index * of the node in the returned array. *  * If both indices of a member of the returned array point to * nodes, the tree is "compound": composed of more than one * cluster of related sequences. *  * The higher-numbered elements of the returned array were the * first constructed, and hence represent the distal tips * of the tree -- the most similar sequences. The root * is node 0. ****************************************************************** * * Algorithm *  * INITIALIZATIONS: *  - copy the difference matrix (otherwise the caller's copy would *       get destroyed by the operations of this algorithm). If *       it's asymmetric, make it symmetric. *  - make a (0..N-1) array of ints to keep track of the indices in *       the difference matrix as they get swapped around. Initialize *       this matrix to 0..N-1. *  - make a (0..N-2) array of int[2] to store the results (the tree *       topology). Doesn't need to be initialized. *  - keep track of a "N'", the current size of the difference *       matrix being operated on. * * PROCESSING THE DIFFERENCE MATRIX: *  - for N' = N down to N' = 2  (N-1 steps): *    - in the half-diagonal N'xN' matrix, find the indices i,j at which *      there's the minimum difference score *       *     Store the results: *    - at position N'-2 of the result array, store coords[i] and  *         coords[j]. *     *     Move i,j rows, cols to the outside edges of the matrix: *    - swap row i and row N'-2 *    - swap row j and row N'-1    *    - swap column i and column N'-2 *    - swap column j and column N'-1 *    - swap indices i, N'-2 in the index array *    - swap indices j, N'-1 in the index array *     *     Build a average difference score for differences to i,j: *    - for all columns, find avg difference between rows i and j and store in row i:  *       row[i][col] = (row[i][col] + row[j][col]) / 2.0 *    - copy the contents of row i to column i (it's a symmetric *       matrix, no need to recalculate) *    - store an index N'+N-2 at position N'-2 of the index array: means *       that this row/column is now a node rather than a leaf, and *       contains minimum values *        *     Continue: *    - go to the next N' *     * GARBAGE COLLECTION & RETURN. *  ********************************************************************** * * References: *  * Feng D-F and R.F. Doolittle. "Progressive sequence alignment as a *    prerequisite to correct phylogenetic trees." J. Mol. Evol.  *    25:351-360, 1987. *     * Fitch W.M. and Margoliash E. "Construction of phylogenetic trees." *    Science 155:279-284, 1967. *     ********************************************************************** * * SRE, 18 March 1992 (bord.c) * SRE, Sun Jul 18 09:52:14 1993 (cluster.c) * added to squid Thu Mar  3 09:13:56 1994 ********************************************************************** * Mon May  4 09:47:02 1992: keep track of difference scores at each node */#include <stdio.h>#include <string.h>#include <math.h>#include "squid.h"#include "sqfuncs.h"#ifdef MEMDEBUG#include "dbmalloc.h"#endif/* Function: Cluster() *  * Purpose:  Cluster analysis on a distance matrix. Constructs a *           phylogenetic tree which contains the topology *           and info for each node: branch lengths, how many *           sequences are included under the node, and which *           sequences are included under the node. *            * Args:     dmx     - the NxN distance matrix ( >= 0.0, larger means more diverged) *           N       - size of mx (number of sequences) *           mode    - CLUSTER_MEAN, CLUSTER_MAX, or CLUSTER_MIN *           ret_tree- RETURN: the tree  * * Return:   1 on success, 0 on failure.           *           The caller is responsible for freeing the tree's memory, *           by calling FreePhylo(tree, N). */intCluster(float **dmx, int N, enum clust_strategy mode, struct phylo_s **ret_tree){  struct phylo_s *tree;         /* (0..N-2) phylogenetic tree          */  float    **mx;                /* copy of difference matrix           */  int       *coord;             /* (0..N-1), indices for matrix coords */  int        i, j;		/* coords of minimum difference        */  int        idx;		/* counter over seqs                   */  int        Np;                /* N', a working copy of N             */  int        row, col;          /* loop variables                      */  float      min;		/* best minimum score found            */  float     *trow;              /* tmp pointer for swapping rows       */  float      tcol;              /* tmp storage for swapping cols       */  float     *diff;		/* (0..N-2) difference scores at nodes */  int        swapfoo;		/* for SWAP() macro                    */  /**************************   * Initializations.   **************************/  /* We destroy the matrix we work on, so make a copy of dmx.   */  mx = MallocOrDie (sizeof(float *) * N);  for (i = 0; i < N; i++)    {      mx[i] = MallocOrDie (sizeof(float) * N);      for (j = 0; j < N; j++)	mx[i][j] = dmx[i][j];    }				/* coord array alloc, (0..N-1) */  coord = MallocOrDie (N     * sizeof(int));  diff  = MallocOrDie ((N-1) * sizeof(float));				/* init the coord array to 0..N-1 */  for (col = 0; col < N; col++)  coord[col] = col;  for (i = 0; i < N-1; i++)      diff[i] = 0.0;				/* tree array alloc, (0..N-2) */  if ((tree = AllocPhylo(N)) == NULL)  Die("AllocPhylo() failed");  /*********************************   * Process the difference matrix   *********************************/  				/* N-prime, for an NxN down to a 2x2 diffmx */  j= 0;				/* just to silence gcc uninit warnings */  for (Np = N; Np >= 2; Np--)    {				/* find a minimum on the N'xN' matrix*/      min = 999999.;      for (row = 0; row < Np; row++)	for (col = row+1; col < Np; col++)	  if (mx[row][col] < min)	    {	      min = mx[row][col];	      i   = row;	      j   = col;	    }      /* We're clustering row i with col j. write necessary       * data into a node on the tree       */				/* topology info */      tree[Np-2].left  = coord[i];      tree[Np-2].right = coord[j];      if (coord[i] >= N) tree[coord[i]-N].parent = N + Np - 2;      if (coord[j] >= N) tree[coord[j]-N].parent = N + Np - 2;				/* keep score info */      diff[Np-2] = tree[Np-2].diff = min;				/* way-simple branch length estimation */      tree[Np-2].lblen = tree[Np-2].rblen = min;      if (coord[i] >= N) tree[Np-2].lblen -= diff[coord[i]-N];      if (coord[j] >= N) tree[Np-2].rblen -= diff[coord[j]-N];				/* number seqs included at node */      if (coord[i] < N) 	{	  tree[Np-2].incnum ++;	  tree[Np-2].is_in[coord[i]] = 1;	}      else 	{	  tree[Np-2].incnum += tree[coord[i]-N].incnum;	  for (idx = 0; idx < N; idx++)	    tree[Np-2].is_in[idx] |= tree[coord[i]-N].is_in[idx];	}            if (coord[j] < N) 	{	  tree[Np-2].incnum ++;	  tree[Np-2].is_in[coord[j]] = 1;	}      else 	{	  tree[Np-2].incnum += tree[coord[j]-N].incnum;	  for (idx = 0; idx < N; idx++)	    tree[Np-2].is_in[idx] |= tree[coord[j]-N].is_in[idx];	}      /* Now build a new matrix, by merging row i with row j and       * column i with column j; see Fitch and Margoliash       */				/* Row and column swapping. */				/* watch out for swapping i, j away: */      if (i == Np-1 || j == Np-2)	SWAP(i,j);      if (i != Np-2)	{				/* swap row i, row N'-2 */	  trow = mx[Np-2]; mx[Np-2] = mx[i]; mx[i] = trow;				/* swap col i, col N'-2 */	  for (row = 0; row < Np; row++) 	    {	      tcol = mx[row][Np-2];	      mx[row][Np-2] = mx[row][i];	      mx[row][i] = tcol;	    }				/* swap coord i, coord N'-2 */	  SWAP(coord[i], coord[Np-2]);	}      if (j != Np-1)	{
12 下一页
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -