📄 setgscweights.cpp
字号:
/***
Gerstein/Sonnhammer/Chothia ad hoc sequence weighting.
The algorithm was deduced by reverse-engineering the
HMMer code.
I used an alternative representation that I prefer over
HMMer's. The HMMer code is full of tree manipulations
that do something to the left child and then the equivalent
thing to the right child. It was clear that there must be
a re-formulation that does everything once for each node,
which would reduce the number of operations expressed
in the code by a factor of two. This gives a more elegant
and less error-prone way to code it.
These notes explain the correspondence between my design
and Eddy's.
HMMer stores a data structure phylo_s for each non-leaf
node in the cluster tree. This structure contains the
following fields:
diff Weight of the node
lblen Left branch length
rblen Right branch length
The lblen and rblen branch lengths are calculated as:
this.lblen = this.diff - left.diff
this.rblen = this.diff - right.diff
My code stores one ClusterNode data structure per node
in the cluster tree, including leaves. I store only the
weight. I can recover the HMMer branch length fields
in a trivial O(1) calculation as follows:
lblen = Node.GetWeight() - Node.GetLeft()->GetWeight()
rblen = Node.GetWeight() - Node.GetRight()->GetWeight()
For the GSC weights calculation, HMMer constructs the
following vectors, which have entries for all nodes,
including leaves:
lwt Left weight
rwt Right weight
The "left weight" is calculated as the sum of the weights in
all the nodes reachable through the left branch, including
the node itself. (This is not immediately obvious from the
code, which does the calculation using branch lengths rather
than weights, but this is an equivalent, and to my mind clearer,
statement of what they are). Similarly, the "right weight" is
the sum of all weights reachable via the right branch. I define
the "cluster weight" to be the summed weight of all nodes in the
subtree under the node, including the node itself. I provide
a function Node.GetClusterWeight() which calculates the cluster
weight using a O(ln N) recursion through the tree. The lwt and
rwt values can be recovered as follows:
lwt = Node.GetLeft()->GetClusterWeight()
+ Node.GetWeight()
lwt = Node.GetLeft()->GetClusterWeight()
+ Node.GetWeight()
HMMer calculates a further vector fwt as follows.
this.fwt = parent.fwt * parent.lwt / (parent.lwt + parent.rwt)
This applies to nodes reached via a left branch, for nodes reached
via a right branch:
this.fwt = parent.fwt * parent.rwt / (parent.lwt + parent.rwt)
The values of fwt at the leaf nodes are the final GSC weights.
We derive the various terms using our equivalents.
parent.lwt = Parent.GetLeft()->GetClusterWeight()
+ Parent.GetWeight()
parent.rwt = Parent.GetRight()->GetClusterWeight()
+ Parent.GetWeight()
parent.lwt + parent.rwt =
{ Parent.GetLeft()->GetClusterWeight()
+ Parent.GetRight()->GetClusterWeight()
+ Parent.GetWeight() }
+ Parent.GetWeight()
We recognize the term {...} as the cluster weight of the
parent, so
parent.lwt + parent.rwt
= Parent.GetClusterWeight()
+ Parent.GetWeight()
As you would expect, repeating this exercise for parent.rwt gives
exactly the same expression.
The GSC weights (fwt) are stored in the Weight2 field of the cluster
tree, the Weight field stores the original (BLOSUM) weights used
as input to this algorithm.
***/
#include "muscle.h"
#include "msa.h"
#include "cluster.h"
#include "distfunc.h"
// Set weights of all sequences in the subtree under given node.
void MSA::SetSubtreeWeight2(const ClusterNode *ptrNode) const
{
if (0 == ptrNode)
return;
const ClusterNode *ptrRight = ptrNode->GetRight();
const ClusterNode *ptrLeft = ptrNode->GetLeft();
// If leaf, set weight
if (0 == ptrRight && 0 == ptrLeft)
{
unsigned uIndex = ptrNode->GetIndex();
double dWeight = ptrNode->GetWeight2();
WEIGHT w = DoubleToWeight(dWeight);
m_Weights[uIndex] = w;
return;
}
// Otherwise, recursively set subtrees
SetSubtreeWeight2(ptrLeft);
SetSubtreeWeight2(ptrRight);
}
void MSA::SetSubtreeGSCWeight(ClusterNode *ptrNode) const
{
if (0 == ptrNode)
return;
ClusterNode *ptrParent = ptrNode->GetParent();
double dParentWeight2 = ptrParent->GetWeight2();
double dParentClusterWeight = ptrParent->GetClusterWeight();
if (0.0 == dParentClusterWeight)
{
double dThisClusterSize = ptrNode->GetClusterSize();
double dParentClusterSize = ptrParent->GetClusterSize();
double dWeight2 =
dParentWeight2*dThisClusterSize/dParentClusterSize;
ptrNode->SetWeight2(dWeight2);
}
else
{
// Could cache cluster weights for better performance.
// We calculate cluster weight of each node twice, so this
// would give x2 improvement.
// As weighting is not very expensive, we don't care.
double dThisClusterWeight = ptrNode->GetClusterWeight();
double dParentWeight = ptrParent->GetWeight();
double dNum = dThisClusterWeight + dParentWeight;
double dDenom = dParentClusterWeight + dParentWeight;
double dWeight2 = dParentWeight2*(dNum/dDenom);
ptrNode->SetWeight2(dWeight2);
}
SetSubtreeGSCWeight(ptrNode->GetLeft());
SetSubtreeGSCWeight(ptrNode->GetRight());
}
void MSA::SetGSCWeights() const
{
ClusterTree CT;
CalcBLOSUMWeights(CT);
// Calculate weights and store in tree.
ClusterNode *ptrRoot = CT.GetRoot();
ptrRoot->SetWeight2(1.0);
SetSubtreeGSCWeight(ptrRoot->GetLeft());
SetSubtreeGSCWeight(ptrRoot->GetRight());
// Copy weights from tree to MSA.
SetSubtreeWeight2(ptrRoot);
}
void MSA::ListWeights() const
{
const unsigned uSeqCount = GetSeqCount();
Log("Weights:\n");
WEIGHT wTotal = 0;
for (unsigned n = 0; n < uSeqCount; ++n)
{
wTotal += GetSeqWeight(n);
Log("%6.3f %s\n", GetSeqWeight(n), GetSeqName(n));
}
Log("Total weights = %6.3f, should be 1.0\n", wTotal);
}
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -