📄 pa 765 logistic regression.mht

📁 介绍各种经典算法的代码。说明详细
💻 MHT
📖 第 1 页 / 共 5 页
字号:
    regression model will reduce our errors in classifying the dependent =
by 80%=20
    compared to classifying the dependent by always guessing a case is =
to be=20
    classed the same as the most frequent category of the dichotomous =
dependent.=20
    Lambda-p is an adjustment to classic lambda to assure that the =
coefficient=20
    will be positive when the model helps and negative when, as is =
possible, the=20
    model actually leads to worse predictions than simple guessing based =
on the=20
    most frequent class. Lambda-p varies from 1 to (1 - N), where N is =
the=20
    number of cases. Lambda-p =3D (f - e)/f, where f is the smallest row =
frequncy=20
    (smallest row marginal in the classification table) and e is the =
number of=20
    errors (the 1,0 and 0,1 cells in the classification table).=20
    <P><A name=3Dtaup></A></P>
    <LI><B>Tau-p</B> is an alternative measure of association. When the=20
    classification table has equal marginal distributions, tau-p varies =
from -1=20
    to +1, but otherwise may be less than 1. Negative values mean the =
logistic=20
    model does worse than expected by chance. Tau-p can be lower than =
lambda-p=20
    because it penalizes proportional reduction in error for non-random=20
    distribution of errors (that is, it wants an equal number of errors =
in each=20
    of the error quadrants in the table.)=20
    <P><A name=3Dphip></A></P>
    <LI><B>Phi-p</B> is a third alternative discussed by Menard (pp. =
29-30) but=20
    is not part of SPSS output. Phi-p varies from -1 to +1 for tables =
with equal=20
    marginal distributions.=20
    <P><A name=3Dbinomial></A></P>
    <LI><B>Binomial d </B>is a significance test for any of these =
measures of=20
    association, though in each case the number of "errors" is defined=20
    differently (see Menard, pp. 30-31).=20
    <P><A name=3Dseparation></A></P>
    <LI><B>Separation: </B>Note that when the independents completely =
predict=20
    the dependent, the error quadrants in the classification table will =
contain=20
    0's, which is called <I>complete separation</I>. When this is nearly =
the=20
    case, as when the error quadrants have only one case, this is called =

    <I>quasicomplete separation</I>. When separation occurs, one will =
get very=20
    large logit coefficients with very high standard errors. While =
separation=20
    may indicate powerful and valid prediction, often it is a sign of a =
problem=20
    with the independents, such as definitional overlap between the =
indicators=20
    for the independent and dependent variables. </LI></OL>
  <P><A name=3Dcstat></A></P>
  <LI>The <B>c statistic</B> is a measure of the discriminatory power of =
the=20
  logistic equation. It varies from .5 (the model's predictions are no =
better=20
  than chance) to 1.0 (the model always assigns higher probabilities to =
correct=20
  cases than to incorrect cases). Thus c is the percent of all possible =
pairs of=20
  cases in which the model assigns a higher probability to a correct =
case than=20
  to an incorrect case. The c statistic is not part of SPSS output but =
may be=20
  calculated using the COMPUTE facility, as described in the SPSS =
manual's=20
  chapter on logistic regression.=20
  <P><A name=3Dclassplot></A></P>
  <LI>The <B>classplot or histogram of predicted probabilities</B>, also =
called=20
  the "plot of observed groups and predicted probabilities," is part of =
SPSS=20
  output, and is an alternative way of assessing correct and incorrect=20
  predictions under logistic regression. The X axis is the predicted =
probability=20
  from 0.0 to 1.0 of the dependent being classified "1". The Y axis is=20
  frequency: the number of cases classified. Inside the plot are columns =
of=20
  observed 1's and 0's. Thus a column with one "1" and five "0's" set at =
p =3D .25=20
  would mean that six cases were predicted to be "1's" with a =
probability of=20
  .25, and thus were classified as "0's." Of these, five actually were =
"0's" but=20
  one (an error) was a "1" on the dependent variable. Examining this =
plot will=20
  tell such things as how well the model classifies difficult cases =
(ones near p=20
  =3D .5).=20
  <P><A name=3DLL></A></P>
  <LI><B>Log likelihood</B>: A "likelihood" is a probability, =
specifically the=20
  probability that the observed values of the dependent may be predicted =
from=20
  the observed values of the independents. Like any probability, the =
likelihood=20
  varies from 0 to 1. The log likelihood (LL) is its log and varies from =
0 to=20
  minus infinity (it is negative because the log of any number less than =
1 is=20
  negative). LL is calculated through <I>iteration</I>, using a maximum=20
  likelihood method.=20
  <P>
  <OL><A name=3Ddeviance></A>
    <LI><B>Deviance</B>. Because -2LL has approximately a chi-square=20
    distribution, -2LL can be used for assessing the significance of =
logistic=20
    regression, analogous to the use of the sum of squared errors in OLS =

    regression. The -2LL statistic is the "scaled deviance" statistic =
for=20
    logistic regression and is also called "deviation chi-square," =
D<FONT=20
    size=3D-2>M</FONT>, L-square, or "badness of fit." Deviance reflects =
error=20
    associated with the model even after the independents are included =
in the=20
    model. It thus has to do with the significance of the =
<U>unexplained</U>=20
    variance in the dependent. One wants -2LL <U>not</U> to be =
significant. That=20
    is, Significance(-2LL) should be worse than (greater than) .05. SPSS =
calls=20
    this "-2 Log Likelihood" in the chi-square output column. There are =
a number=20
    of variants:=20
    <P><A name=3Dinitial></A></P>
    <LI><B>Initial chi-square</B>, also called D<FONT size=3D-2>O</FONT> =
or=20
    deviance for the null model, reflects the error associated with the =
model=20
    when only the intercept is included in the model. D<FONT =
size=3D-2>O</FONT> is=20
    -2LL for the model which includes only the intercept. That is, =
initial=20
    chi-square is -2LL for the model which accepts the null hypothesis =
that all=20
    the b coefficients are 0. SPSS calls this the "initial log =
likelihood=20
    function -2 log likelihood".=20
    <P><A name=3Dbadness></A>
    <P><A name=3Dmodelchi></A></P>
    <LI><B>Model chi-square</B> is also known as G<FONT =
size=3D-2>M</FONT>, Hosmer=20
    and Lemeshow's G, -2LL<SUB>difference</SUB>, or just "goodness of =
fit."=20
    Model chi-square functions as a significance test, like the F test =
of the=20
    OLS regression model or the Likelihood Ratio (G<SUP>2</SUP>) test in =
<A=20
    =
href=3D"http://www2.chass.ncsu.edu/garson/pa765/logit.htm#ratio">loglinea=
r=20
    analysis</A>. Model chi-square provides the usual significance test =
for a=20
    logistic model. Model chi-square tests the null hypothesis that =
<U>none</U>=20
    of the independents are linearly related to the log odds of the =
dependent.=20
    That is, model chi-square tests the null hypothesis that all =
population=20
    logistic regression coefficients except the constant are zero. It is =
thus an=20
    overall model test which does not assure that <U>every</U> =
independent is=20
    significant.=20
    <P>Model chi-square is computed as -2LL for the null (initial) model =
minus=20
    -2LL for the researcher's model.The null model, also called the =
initial=20
    model, is logit(p) =3D the constant. Degrees of freedom equal the =
number of=20
    terms in the model minus the constant (this is the same as the =
difference in=20
    the number of terms between the two models, since the null model has =
only=20
    one term). Model chi-square measures the improvement in fit that the =

    explanatory variables make compared to the null model. Note SPSS =
calls -2LL=20
    for the null model "Initial Log Likelihood".=20
    <P>Model chi-square is a likelihood ratio test which reflects the =
difference=20
    between error not knowing the independents (initial chi-square) and =
error=20
    when the independents are included in the model (deviance). Thus, =
model=20
    chi-square =3D initial chi-square - deviance. Model chi-square =
follows a=20
    chi-square distribution (unlike deviance) with degrees of freedom =
equal to=20
    the difference in the number of parameters in the examined model =
compared to=20
    the model with only the intercept. Model chi-square is the =
denominator in=20
    the formula for R<FONT size=3D-2>L</FONT>-square (see below). When =
probability=20
    (model chi-square) le .05, we reject the null hypothesis that =
knowing the=20
    independents makes no difference in predicting the dependent in =
logistic=20
    regression, where "le" means less than or equal to. Thus we want =
model=20
    chi-square to be significant at the .05 level or better.=20
    <P><I>Block chi-square</I> is a likelihood ratio test also printed =
by SPSS,=20
    representing the change in model chi-square due to entering a block =
of=20
    variables. <I>Step chi-square</I> is the change in model chi-square =
due in=20
    <A=20
    =
href=3D"http://www2.chass.ncsu.edu/garson/pa765/logistic.htm#stepwise">st=
epwise=20
    logistic regression</A>. Earlier versions of SPSS referred to these =
as=20
    "improvement chi-square." If variables are added one at a time, then =
block=20
    and step chi-square will be equal, of course. <I>Note on categorical =

    variables: </I>block chi-square is used to test the effect of =
entering a=20
    categoical variable. In such a case, all dummy variables associated =
with the=20
    categorical variable are entered as a block. The resulting block =
chi-square=20
    value is considered more reliable than the Wald test, which can be=20
    misleading for large effects in finite samples.=20
    <P>These are alternatives to model chi-square for significance =
testing of=20
    logistic regression:
    <P>
    <UL><A name=3Dgoodness></A>
      <LI><B>Goodness of Fit</B>, also known as <I>Hosmer and Lemeshow's =

      Goodness of Fit Index</I> or <I>C-hat,</I> is an alternative to =
model=20
      chi-square for assessing the significance of a logistic regression =
model.=20
      Menard (p. 21) notes it may be better when the number of =
combinations of=20
      values of the independents is approximately equal to the number of =
cases=20
      under analysis. This measure was included in SPSS output as =
"Goodness of=20
      Fit" prior to Release 10. However, it was removed from the =
reformatted=20
      output for SPSS Release 10 because, as noted by David Nichols, =
senior=20
      statistician for SPSS, it "is done on individual cases and does =
not follow=20
      a known distribution under the null hypothesis that the data were=20
      generated by the fitted model, so it's not of any real use" =
(SPSSX-L=20
      listserv message, 3 Dec. 1999).=20
      <P><A name=3DHosmer></A></P>
      <LI><B>Hosmer and Lemeshow's Goodness of Fit Test</B>, not to be =
confused=20
      with ordinary Goodness of Fit above, tests the null hypothesis =
that the=20
      data were generated by the model fitted by the researcher. The =
test=20
      divides subjects into deciles based on predicted probabilities, =
then=20
      computes a chi-square from observed and expected frequencies. Then =
a=20
      probability (p) value is computed from the chi-square distribution =
with 8=20
      degrees of freedom to test the fit of the logistic model. If the =
Hosmer=20
      and Lemeshow Goodness-of-Fit test statistic is .05 or less, we =
reject the=20
      null hypothesis that there is no difference between the observed =
and=20
      model-predicted values of the dependent. (This means the model =
predicts=20
      values significantly different from what they ought to be, which =
is the=20
      observed values). If the H-L goodness-of-fit test statistic is =
greater=20
      than .05, as we want, we fail to reject the null hypothesis that =
there is=20
      no difference, implying that the model's estimates fit the data at =
an=20
      acceptable level. This does not mean that the model necessarily =
explains=20
      much of the variance in the dependent, only that however much or =
little it=20
      does explain is significant. As with other tests, as the sample =
size gets=20
      larger, the H-L test's power to detect differences from the null=20
      hypothesis improves=20
      <P><A name=3Dscore></A></P>
      <LI><B>The Score statistic</B> is another alternative similar in =
function=20
      to G<FONT size=3D-2>M</FONT> and is part of SAS's PROC LOGISTIC =
output.=20
      <P><A name=3Daic></A></P>
      <LI><B>The Akaike Information Criterion, AIC, </B>is another =
alternative=20
      similar in function to G<FONT size=3D-2>M</FONT> and is part of =
SAS's PROC=20
      LOGISTIC output.=20
      <P><A name=3DSchwartz></A></P>
      <LI><B>The Schwartz criterion</B> is a modified version of AIC and =
is part=20
      of SAS's PROC LOGISTIC output.=20
      <P></P></LI></UL></LI></OL>
  <P><A name=3Drsquared></A></P>
💿 文件大小 3929 K
👤 上传用户 Erlin
📂 所属分类数据结构
🏷️ 相关标签

#算法 #代码
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -