📄 gareadme.m

📁 偏最小二乘算法在MATLAB中的实现
💻 M
字号:
echo on
% GAREADME Release notes for GENALG in PLS_Toolbox 1.5
% September 2, 1995
%
% The function GENALG uses a genetic algorithm to select variables
% for multivariate calibration models. The criteria that the 
% genetic algorithm uses is predictive ability of the models based
% on the cross-validation PRESS or predictive residual sum of
% squares. 
%
% The I/O syntax of GENALG is: genalg(xdat,ydat,outfit,outpop);
% where xdat is the matrix of predictor variables, ydat is the
% vector of predicted variable and outfit and outpop are text
% strings with the names of the variables to save the final population
% fitness and selected variables under, respectively. Typically
% GENALG would be started by executing a statement like:
% 痝enalg(x,y,'fitness','selectedvariables');
%
% Note that GENALG does not scale the variables, so you will
% want to do the scaling before the modeling begins.
%
% Once GENALG is started you can select the parameters of the
% optimization using sliders and radio buttons. The controls and
% their functions are listed below:
% 
% Population Size: the number of binary strings representing potential
% solutions to the variable selection problem at any one time. As the
% population increases the number of solutions considered will be
% greater, and the liklihood of find a global optimum will increase.
% However, the time for finding a solution will also increase.
%
% Window Width: the number of original variables that will be grouped 
% together and selected or omitted as a group. This is particularly
% useful when it is known that adjacent variables are correlated, such
% as in spectroscopy. For instance, if a spectroscopic problem has
% 2000 variables, you may want to window the spectra into 50 windows
% of 40 variables each. Windowing helps avoid overfitting and 
% speeds the progression to a solution. 
%
% Percent Initial Terms: percentage of variables to be included (on
% average) in initial population. Can be set to a low value if 
% it is suspected that not very many variables are actually useful
% for prediction. Also, not that the routine runs fastest when
% not many terms are included. 
%
% Maximum Generations: maximum number of generations of the GA before 
% the function stops and evaluates the final population.
%
% Percent at Convergence: percentage of the population of solutions
% that are identical at convergence. For example, if this parameter
% is set to 50%, when 50% of the population consists of duplicates
% the algorithm will stop and evaluate the final population.
%
% Mutation Rate: the liklihood that a bit in a solution string will
% flip spontaneously from 0 to 1 or vice versa. A high mutation rate
% will cause the algorithm to search a wide area, but the algorithm
% may not converge if the rate is kept too high.
%
% Crossover: single or double cross over may be selected. It has been
% shown that double cross over makes the solution more independent of
% the position of the variables in the data record.
% 
% Regression Choice: may be set to MLR or PLS. If you have lots of samples
% and not very many variables, you may want to select the variables
% that work best for prediction with MLR. If you have the oposite
% case, you will probably want to use PLS.
%
% Number of Latent Variables: visible only when PLS is selected. Chooses
% the maximum number of latent variables to consider. Should generally
% be set on the low side of the number of factors you expect in the
% final model.
%
% Cross-Validation Parameters: the cross-validation can be set to use
% random re-ordering of the data or to keep the data in contiguous
% blocks. The latter choice is most appropriate when time series
% data is being considered.
%
% Number of Subsets: selects the number of fractions into which the
% data is divided. A good rule of thumb is to use sqrt(n) where
% n is the number of samples or 10, which ever is smaller. If the
% data is a time series, choose a number of subsets so that the
% number of samples in a subset is longer than the process settling
% time.
% 
% Number of Iterations: the number of times the cross-validation is
% repeated after re-ordering the data set each generation. Initially
% a low value for this number (1-3) is desirable.
%
% When all the parameters are set, hit the execute button. GENALG
% will display the statistics on the population at the end of each
% generation. A window will appear with four plots. The upper left
% plot shows the fitness versus number of variables or windows included
% for each member of the population. Note that the fitness scale is
% the PRESS, so a smaller fitness value is better. The upper right window
% shows the average and best fitness for each generation as the GA
% progresses. The average, of course, is always worse than the best,
% though they will get closer together as the run progresses. Note
% that the fitness does not appear to get steadily better because
% the test sets are changed at each generation. Thus there is a
% random element to the plot. The lower left window tracks the average
% number of variables or windows included in the population as a
% function of generation. The lower right window shows what variables
% or windows are being included in the solutions. The plot give the
% number of members of the population which include each variable
% or window. The number of duplicates, average and best fitness
% at each generation is also displayed in the command window.
%
% The function GENALG can be stopped in the middle of execution and
% some of the parameters can be reset. This is done by hitting the
% Stop button until the function responds by making it disappear.
% The function will then complete the current generation and stop.
% 
% Once the function is stopped, the Maximum Generations, Percent
% at Convergence, Mutation Rate, Crossover, Number of Latent Variables,
% Number of Subsets and Number of Iterations can all be changed.
% Once the new values are selected, hit Resume. A good way to use
% this feature is to start the function out with a high mutation rate
% and with few subsets and not many iterations. This will cause the
% the algorithm to search a large space. Once the population has
% started to converge, the mutation rate can be decreased and the
% number of iterations increased. This will promote convergence.
%
% The function will stop and evaluate the final population if either
% the Percent Convergence or Maximum Generations criteria are
% exceeded. The function will evaluate the final members of the 
% population using 3 times the number of iterations specified for
% the bulk of the run. The function outputs the fitness PRESS for
% each model in the final population (starting with the best
% first) and gives the variables included in each model as a matrix
% of ones and zeros, with one meaning the variable is included.
%
% A word of caution is advised when using GAs. Note that the solutions
% are not deterministic: different runs will net different solutions.
% It is a good idea to use the function several times and compare the
% results. In your final model you may wish to include only the terms
% that apear consistently. When using PLS type models, it may be best
% to include terms that appear in a good fraction of the solutions and
% only use the GA to exclude terms, i.e. exclude only terms that 
% don't appear in any of the solutions.
%
% Eigenvector Technologies would be interested in having your
% feedback on this routine. Please write to us at EVTech@delphi.com
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -