📄 measure_option.hlp

📁 是一个经济学管理应用软件很难找的但是经济学学生又必须用到
💻 HLP
字号:
{smcl}
{* 07apr2005}{...}
{cmd:help measure_option}
{hline}

{title:Title}

{p2colset 5 28 30 2}{...}
{p2col:{hi:[MV] {it:measure_option}} {hline 2}}Option for similarity and
dissimilarity measures{p_end}
{p2colreset}{...}


{title:Syntax}

	{it:command} ...{cmd:,} ... {opt mea:sure(measure)} ...

    or

	{it:command} ...{cmd:,} ... {it:measure} ...


INCLUDE help measure_option_optstab


{title:Description}

{pstd}
Several commands allow the specification of a similarity or dissimilarity
measure; see {helpb cluster}, {helpb mds}, and {helpb matrix dissimilarity}.
The available {it:measure}s are either in similarity form or dissimilarity
form depending on what is the most common and natural.  
These options are documented here.  Most analysis
commands (e.g., {cmd:cluster} and {cmd:mds}) transform
similarity measures to dissimilarity measures as needed.


{title:Options}

{pstd}
Measures are divided into those for continuous data and binary data.
{it:measure} is not case sensitive.  Full definitions are presented
in the {hi:Similarity and dissimilarity measures for continuous data} and
{hi:Similarity measures for binary data} sections.

{pstd}
Most often the similarity or dissimilarity measure is used to determine the
similarity or dissimilarity between observations.  However, in some cases it
is the similarity or dissimilarity between variables that is of interest.


    {title:Similarity and dissimilarity measures for continuous data}

{pstd}
Here are the similarity and dissimilarity measures for continuous data
available in Stata.  The formulas in this section are for similarity or
dissimilarity between observations (not variables).  All summations and
maximums use subscript a or b and are over the p variables (not the N
observations).  x_iv denotes the value of observation i for variable v.  See
{hi:[MV] {it:measure_option}} to also see the formulas for the similarity and
dissimilarity measures between variables (not presented here).


{phang}
{opt L2} (aliases {opt Euclidean} and {cmd:L(2)}){break}
requests the Minkowski distance metric with argument 2

{center:sqrt(sum((x_ia - x_ja)^2))}

{pmore}
{opt L2} is best known as Euclidean distance and is the default
dissimilarity measure for {helpb mds}, {helpb matrix dissimilarity}, and
all the {helpb cluster} subcommands except for {cmd:centroidlinkage},
{cmd:medianlinkage}, and {cmd:wardslinkage}, which default to using
{cmd:L2squared}.

{phang}
{opt L2squared}{break}
requests the square of the Minkowski distance metric with argument 2

{center:sum((x_ia - x_ja)^2)}

{pmore}
{opt L2squared} is best known as squared Euclidean distance and is the
default dissimilarity measure for the {cmd:centroidlinkage},
{cmd:medianlinkage}, and {cmd:wardslinkage} subcommands of {helpb cluster}.

{phang}
{opt L1} (aliases {opt absolute}, {opt cityblock}, {opt manhattan}, and
{cmd:L(1)}){break}
requests the Minkowski distance metric with argument 1

{center:sum(|x_ia - x_ja|)}

{pmore}
which is best known as absolute-value distance.

{phang}
{opt Linfinity} (alias {opt maximum}){break}
requests the Minkowski distance metric with infinite argument

{center:max(|x_ia - x_ja|)}

{pmore}
and is best known as maximum-value distance.

{phang}
{opt L(#)}{break}
requests the Minkowski distance metric with argument {it:#}:

{center:(sum(|x_ia - x_ja|^{it:#})^(1/{it:#})     {it:#} >= 1}

{pmore}
We discourage the use of extremely large values for {it:#}.  Since the
absolute value of the difference is being raised to the value of {it:#},
depending on the nature of your data, you could experience numeric overflow or
underflow.  With a large value of {it:#}, the {opt L()} option will produce
similar results to the {opt Linfinity} option.  Use the numerically more
stable {opt Linfinity} option instead of a large value for {it:#} in the
{opt L()} option.

{phang}
{opt Lpower(#)}{break}
requests the Minkowski distance metric with argument {it:#}, raised to the
{it:#} power:

{center:sum(|x_ia - x_ja|^{it:#})     {it:#} >= 1}

{pmore}
As with {opt L(#)}, we discourage the use of extremely large values for
{it:#}; see the discussion above.

{phang}
{opt Canberra}{break}
requests the following distance metric

{center:sum(|x_ia - x_ja|/(|x_ia|+|x_ja|))}

{pmore}
which ranges from 0 to p, the number of variables.  The Canberra distance is
very sensitive to small changes near zero.

{phang}
{opt correlation}{break}
requests the correlation coefficient similarity measure

{center:sum((x_ia-xbar_i.)(x_ja-xbar_j.))}
{center:{hline 46}}
{center:sqrt(sum(x_ia-xbar_i.)^2 * sum(x_jb-xbar_j.)^2)}

{pmore}
where xbar_i. = sum(x_ia)/p.

{pmore}
The correlation similarity measure takes values between -1 and 1.  With this
measure, the relative direction of the two vectors is important.
The correlation similarity measure is related to the angular separation
similarity measure (described next).  The correlation similarity measure gives
the cosine of the angle between the two vectors measured from the mean.

{phang}
{opt angular} (alias {opt angle}){break}
requests the angular separation similarity measure

{center:sum(x_ia * x_ja)/sqrt(sum(x_ia^2) * sum(x_jb^2))}

{pmore}
which is the cosine of the angle between the two vectors measured
from zero and takes values from -1 to 1.


    {title:Similarity measures for binary data}

{pstd}
Similarity measures for binary data are based on the four values from the
cross-tabulation of observation i and j (when comparing observations) or
variables u and v (when comparing variables).  Restricting our discussion
to measures between observations, the cross-tabulation is

{center:       {c |} obs. j}
{center:       {c |}  1  0 }
{center:{hline 7}{c +}{hline 7}}
{center:obs. 1 {c |}  a  b }
{center: i   0 {c |}  c  d }

{pstd}
a is the number of variables where observations i and j both had ones, and d
is the number of variables where observations i and j both had zeros.  The
number of variables where observation i is one and observation j is zero is b,
and the number of variables where observation i is zero and observation j is
one is c.

{pstd}
See {hi:[MV] {it:measure_option}} to see a similar table for comparison between
variables.

{pstd}
Stata treats nonzero values as a one when a binary value is expected.
Specifying one of the binary similarity measures imposes this behavior unless
some other option overrides it (for instance, the {opt allbinary} option of
{helpb matrix dissimilarity}).  See {hi:[MV] {it:measure_option}} for a
discussion of binary similarity measures applied to averages.

{pstd}
The following binary similarity coefficients are available.  Unless stated
otherwise, the similarity measures range from 0 to 1.

{phang}
{opt matching}{break}
requests the simple matching binary similarity coefficient

{center:(a+d)/(a+b+c+d)}

{pmore}
which is the proportion of matches between the two vectors (observations or
variables).

{phang}
{opt Jaccard}{break}
requests the Jaccard binary similarity coefficient

{center:a/(a+b+c)}

{pmore}
which is the proportion of matches when at least one of the observations had
a one.  If both vectors are all zeros, this measure is
undefined.  In this case, Stata declares the answer to be one, meaning perfect
agreement.  This is a reasonable choice for most applications and will cause
an all-zero vector to have similarity of one only with another all-zero
vector.  In all other cases, an all-zero vector will have Jaccard
similarity of zero to the other vector.

{phang}
{opt Russell}{break}
requests the Russell & Rao binary similarity coefficient

{center:a/(a+b+c+d)}

{phang}
{opt Hamann}{break}
requests the Hamann binary similarity coefficient

{center:((a+d)-(b+c))/(a+b+c+d)}

{pmore}
which is the number of agreements minus disagreements divided by the total.
The Hamann coefficient ranges from -1, perfect disagreement, to 1, perfect
agreement.  The Hamann coefficient is equal to twice the simple matching
coefficient minus 1.

{phang}
{opt Dice}{break}
requests the Dice binary similarity coefficient

{center:2a/(2a+b+c)}

{pmore}
The Dice coefficient is similar
to the Jaccard similarity coefficient, but gives twice the weight to
agreements.  Like the Jaccard coefficient, the Dice coefficient is declared by
Stata to be one if both vectors are all zero, thus avoiding the case when
the formula is undefined.

{phang}
{opt antiDice}{break}
requests the following binary similarity coefficient

{center:a/(a+2(b+c))}

{pmore}
The name {opt antiDice} is our creation.  This coefficient takes the opposite
view from the Dice coefficient and gives double weight to disagreements.  As
with the Jaccard and Dice coefficients, the antiDice coefficient is declared
to be one if both observations are all zeros.

{phang}
{opt Sneath}{break}
requests the Sneath & Sokal binary similarity coefficient

{center:2(a+d)/(2(a+d)+(b+c))}

{pmore}
which is similar to the simple matching coefficient, but gives double weight
to matches.  Also compare the Sneath & Sokal coefficient to the Dice
coefficient, which differs only in whether it includes d.

{phang}
{opt Rogers}{break}
requests the Rogers & Tanimoto binary similarity coefficient

{center:(a+d)/((a+d)+2(b+c))}

{pmore}
which takes the opposite approach from the Sneath & Sokal coefficient and gives
double weight to disagreements.  Also compare the Rogers & Tanimoto coefficient
to the antiDice coefficient, which differs only in whether it includes d.

{phang}
{opt Ochiai}{break}
requests the Ochiai binary similarity coefficient

{center:a/sqrt((a+b)(a+c))}

{pmore}
The formula for the Ochiai coefficient is undefined when one, or both, of the
vectors being compared is all zeros.  If both are all zeros, Stata
declares the measure to be one, and if only one of the two vectors is
all zeros, the measure is declared to be zero.

{phang}
{opt Yule}{break}
requests the Yule binary similarity coefficient

{center:(ad-bc)/(ad+bc)}

{pmore}
which ranges from -1 to 1.  The formula for the Yule coefficient is undefined
when one or both of the vectors are either all zeros or all ones.  Stata
declares the measure to be 1 when b+c = 0, meaning there is complete agreement.
Stata declares the measure to be -1 when a+d = 0, meaning there is complete
disagreement.  Otherwise, if ad-bc = 0, Stata declares the measure to be 0.
These rules, applied before using the Yule formula, avoid the cases where the
formula would produce an undefined result.

{phang}
{opt Anderberg}{break}
requests the Anderberg binary similarity coefficient

{center:(a/(a+b) + a/(a+c) + d/(c+d) + d/(b+d))/4}

{pmore}
The Anderberg coefficient is undefined when one or both vectors are either
all zeros or all ones.  This difficulty is overcome by first applying the rule
that if both vectors are all ones (or both vectors are all zeros),
then the similarity measure is declared to be one.  Otherwise, if any of the
marginal totals (a+b, a+c, c+d, b+d) are zero, then the similarity measure is
declared to be zero.

{phang}
{opt Kulczynski}{break}
requests the Kulczynski binary similarity coefficient

{center:(a/(a+b) + a/(a+c))/2}

{pmore}
The formula for this measure is undefined when one or both of the vectors
are all zeros.  If both vectors are all zeros, Stata declares the
similarity measure to be one.  If only one of the vectors is all zeros,
the similarity measure is declared to be zero.

{phang}
{opt Pearson}{break}
requests Pearson's phi binary similarity coefficient

{center:(ad-bc)/sqrt((a+b)(a+c)(d+b)(d+c))}

{pmore}
which ranges from -1 to 1.  The formula for this coefficient is undefined when
one or both of the vectors are either all zeros or all ones.  Stata
declares the measure to be 1 when b+c = 0, meaning there is complete
agreement.  Stata declares the measure to be -1 when a+d = 0, meaning there
is complete disagreement.  Otherwise, if ad-bc = 0, Stata declares the measure
to be 0.  These rules, applied before using Pearson's phi coefficient formula,
avoid the cases where the formula would produce an undefined result.

{phang}
{opt Gower2}{break}
requests the following binary similarity coefficient

{center:ad/sqrt((a+b)(a+c)(d+b)(d+c))}

{pmore}
Stata uses the name
{opt Gower2} to avoid confusion with the better known Gower coefficient (not
currently in Stata), which is used to combine continuous and categorical
similarity or dissimilarity measures computed on a dataset into one measure.

{pmore}
The formula for this similarity measure is undefined when one or both of the
vectors are all zeros or all ones.  This is overcome by first applying
the rule that if both vectors are all ones (or both vectors are all
zeros), then the similarity measure is declared to be one.  Otherwise, if
ad = 0, then the similarity measure is declared to be zero.


{title:Also see}

{psee}
Manual:  {hi:[MV] {it:measure_option}}

{psee}
Online:  {helpb cluster},
{helpb mds};
{helpb matrix dissimilarity};
{helpb parse_dissim}
{p_end}
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -