📄 mainreadme.txt
字号:
Help file for Matlab programs of Geometrical Correlation Learning
(1) main file "MainGcLearn.m"
===> rough introduction only for easy understanding of this program:
Geometric correlation information inherent in data is useful and important for data analysis fields such as knowledge discovery and pattern recognition. Geometrical Correlation Learning (GcLearn) is developed to mine this geometric information.
GcLearn mines value correlations between the data (or values of neighbor data are close) of a key variable, and projects the 1-dimensional data of each variable into a 2-dimensional curve manifold that represents intrinsic correlations and geometric structure of data. These curve manifolds are constructed as per the found optimal relation between variables.
This program is designed to mine curve manifolds, and show its significance by example of exploring relational models between variables from data (or regression models and predictions). The designed geometrical learning applies the geometric information to linear gression (GLR) and support vector machine regressions (ε-SVR, that with radial basis function kernels), and finds relational models.
Note: similar phrase like "line 50" indicates the line 50 in "MainGcLearn.m".
(2) Why and when to use geometric information for your applications
The geometric information that represents intrinsic correlations and geometric structure of data can make the relations between variables more obvious and stable, so it is useful when the relations and data are interfered by noise or other factors.
Firstly, you apply this program to your training data set and check whether there is a much better solution. If yes, you can benefit from GcLean and the geometric information. This is the recommended application case.
Another case could be tried: the solution is almost same as that of GLR or ε-SVR. After running "mainGcLearnApp.m" (see next step ==>), you check the figures of plotting response/target variable (by setting "plots=1;" in line 7 and setting a breakpoint in line 150 or 247) :
whether the predictions from GcLearn are close to their neighbors, while the predictions from GLR or ε-SVR are far away from their neighbor data (or bad predictions). If yes, you can benefit from GcLean, but maybe not much.
"mainGcLearnApp.m" (most codes are same as "MainGcLearn.m")
--- It is for new prediction tasks and correct answers are unknown ---
Secondly, please open "mainGcLearnApp.m" if you knew from first step that benefits are obtainable. In this file, you can name the training data as "traindata.txt" (line 23) and test data that need predictions as "testdata.txt" (line 24).
You need specify target variable in Z (line 25), e.g. Z=4 for 4th variable and Z=nx for last variable, and asign values like 1 for Z in "testdata.txt" so that two files are matching or let this program do it (see line 36 - 42). The prediction results are stored in 'Resulta' for GcLearn LR and 'Resultb' for GcLearn ε-SVR.
Notes: Good value correlations between data are crucial for GcLearn. If the data are fewer and less correlated (or data are far away in space), the benefits from GcLearn will decrease or vanish. Therefore, more test data with good correlations will bring more benefits when you run "mainGcLearnApp.m" for prediction tasks (please run "MainGcLearn.m" if performance evaluaton required).
(3) Inputs: data file
Its rows denote observations/samples and columns denote attributes/variables, same as the benchmark data sets in UCI source (www.ics.uci.edu/~mlearn/MLRepository.html). It stores numeric data only, please note that all the data values should be numeric.
When you use your own data file "myfile.txt", please change "exm=1;" in line 6 to "exm=9;", and then open file "loaddata.m", change line 38 to: " data=load('myfile.txt');"
If a data set is under correct value correlations, you need change line 29 from
" if exm>10 nosort=1; else nosort=0; end "
to " if exm>10 nosort=1; else nosort=1; end "
(4) Outputs: geometric information, prediction errors (PE) and result plotting
Curve manifolds---geometric information
The optimal curve manifolds "R" (returned in discrete-point form) are in line 126 for linear regression or in line 224 for ε-SVR (line 100 and 200 in "mainGcLearnApp.m").
You can see the curve manifold of a response variable by setting "plots=1;" in line 17 and setting a breakpoint in line 150 (line 129 in "mainGcLearnApp.m").
The optimal curve manifolds in quadratic-curve form are "Q" in line 224 and used in line 227 for ε-SVR. In "gclearngress.m", they are "bx" in line 11 and used in line 26.
One can easily understand that the curve manifolds are used as input instead of original data, and utilized by geometric learning. However, you need determine an optimal parameter "nu" in line 219 (also see (8) Choices ...). An example of the usage in line 224 - 234:
[R,Q,S,T]=datageometrize(XYsort(:,1:nx),nu3(ik),0); <--- projection
"[R,Q,terminals,terminals]=datageometrize(original-data,parameter-nu,0);"
Ycurve=R(Ylabel,:); <--- points on curve manifolds
S=[S;S(length(S))+1];
R=sumcurve(Q,S,T,'each'); <--- integral of Q, output integral values in unit length
Xcurve=R(Xlabel,:); <--- integral values of curve manifolds
Average prediction errors (appear in command window)
Prediction errors (PE) in mean squared errors are stored in variable 'Pea' for linear regression (row 2 of Pea) and Gclearn (row 5), and in 'Peb' for ε-SVR (row 2) and GcLearn (row 5). The Proportional reduction of PE (x100%) by GcLearn are stored in 'Pra' for linear regression, and 'Prb' for ε-SVR.
For simulated data sets, there are also deviations for 20 experimental repititions, which are stored in 'deviata' for linear regression (row 2) and Gclearn (row 5), and in 'deviatb' for ε-SVR (row 2) and GcLearn (row 5).
Result plotting:
For simulated data sets, there are a plotting of average prediction errors for different noise levels (variances of white Gaussian noise added from 0.2 to 2.2) and a plotting of error bars for 20 repeated experiments.
Prediction results
After running "mainGcLearnApp.m", the prediction results are stored in 'Resulta' for GcLearn LR and 'Resultb' for GcLearn ε-SVR.
(5) Mining value correlations between data or an approximate correlation order
This correlation information should be mined first before constructing curve manifolds. It is best to collect correlation information of data when the data are collected. If not available, the program mines it by an approximate way like sorting data values of a response/key variable from small to big values. What values for sorting are the prediction values from GLR or ε-SVR, which is controlled by "order=nx+1; " in line 68. The sorting work is carried out in line 107 and 206.
(6) Main steps for GcLearn programs
--- 4) general linear regression (GLR) analysis --- from line 72
This part is the traditional statistical linear regression analysis, using the available algorithm like "coef=Lxx\Lxz;" (or Matlab core "\" method) to find regression coefficients (see line 60 in "modeloutdata.m").
A linear relational model with its coefficients stored in 'coef' is found by "modeloutdata.m" (see line 77). Then, the model is used to predict for a test set (see line 84).
--- 5-7) Mining geometric information and Applying it to linear regression--- from line 91
At first, traditional linear regression gives all the estimated values of response variable Z (in line 84 and 93), and value correlations are mined by sorting these estimated values in line 107.
Then, the optimal curve manifolds R are constructed by "gclearngress.m" in line 126, and the optimal geometrization parameter is returned in 'ph', and optimal relational model "coef" is found accordingly.
Finally, the GcLearn model 'coef' is used for predictions in line 134.
--- 8-9) Mining geometric information and Applying it to SVM regression --- from line 156
At first, ε-SVR program from famous Libsvm software (www.csie.ntu.edu.tw/~cjlin/libsvm) gives all the estimated values of variable Z (in line 185 and 192), which are sorted for value correlations in line 206.
Then, optimal 'nu' is found by 1-fold cross validation for constructing optimal curve manifolds in line 219, and this 'nu' is used to project data to optimal curve manifolds R in line 224.
Finally, the GcLearn model in line 231 is used for predictions in line 234.
(7) Hyper parameters for ε-SVR
These parameters are tuned (following the suggestion in www.csie.ntu.edu.tw/~cjlin/libsvm) by searching a 2-dimensional parameter space from a loose grid (or C=[2^-3, 2^-1, …, 2^11], γ=[2^-9, 2^-7, …, 2^5]) to a finer grid (C=[0.8a, 0.9a, …, 1.2a], γ=[0.8b, 0.9b, …, 1.2b]), where a and b are found best parameters in loose grid. The best parameters with minimal PE are specified by 3-fold cross validation for every run. (see line 167 - 172)
(8) Choices for different designs or observations
You may make your own choices for different designs or observations of your applications from line 16 to 26. For example,
# gm=1: run linear regression only; gm=3: run SVR only; gm=2: run both.
# repeats=20 means repetition times of running for each noise level, but please note that 20 repetitions are time-consuming, you may set fewer repetitions.
# useNewParam=0 for saving running time when you run it again, use saved parameters from done experiments and set "%clear;" in line 5. However, it is only recommended for large data sets.
# randoms=1: 1/kfold test set is randomly selected from data instead of fixed divisions (randoms=0).
# scale=1: scale data to 0 mean & variance 1; scale=2: scale data to [-1,1].
# vmax='allcoef'; No other choice.
# if exm>10 nosort=1; means that there are no need for sorting data, since correct value correlations are available for simulated data sets.
# The default parameter nu=[5:14] is an optimal searching scope for most cases. We do not recommend its adjustments, except that you plan to study it by observing the figures of plotting response variable (by setting "plots=1;" and a breakpoint in line 150 or 247): you may enlarge it to e.g. nus=[5:21] if curve manifolds do not fit training data well and the training data are congregated well along the correlation order, and you may reduce it to e.g. nus=[2:7] if the training data are not congregated very well along the correlation order.
(9) Usefull tools
* data normalization: "datanormalize.m", e.g. line 41
* parameter tunning for ε-SVR: line 167 - 172, "parameterlivsvm.m" . The Matlab interface in (www.csie.ntu.edu.tw/~cjlin/libsvm) does not supply this tool.
* projection of 1-dimensional data to a 2-dimensional curve manifold: "datageometrize.m". e.g. line 224.
* plotting error bar: "ploterrorbar.m", e.g. line 306.
(10) Demo data sets & analysis
<1> One of two simulated data sets is linear relation between variables Z(t)= 2X(t)+Y(t)+ 2.0 (see line 8-11) and generated in "datagenerate.m" (see line 50), and another is non-linear relation Z(t)=4Y(t)+X(t)^2 - 3X(t)Y(t)+10, where t=1,2,3, ..., 300.
For demo data sets, their original benchmark data sets (pyrim, servo, cpu, autompg, ...) are from the UCI machine learning database (www.ics.uci.edu/~mlearn/MLRepository.html).
<2> If we run the program for data set "servo" and set exm=2 (if gm=2 chosen for both linear regression and ε-SVR), we will obtain following solution in command window (The contents are for validation or evalution of GcLearn performances):
===> running at: 1 total repeats: 1
---> runs/k-fold= 1 optimal nu= 13
---> runs/k-fold= 2 optimal nu= 8 ( 2nd fold validation )
...
Predicting the rise time of a servomechanism
==> Results for GcLearn linear regression
* Prediction errors (Pe) in mean squared errors:
(1st row for pure linear regression, 2nd row for GcLearn)
* Proportional reduction of Pe (Pr %) by GcLearn
Pe =
1.3171
0.9962
Pr =
24.3666
(this means much better solution --- 24.4% improvement --- from GcLearn)
==> Results for GcLearn SVM regression
* Prediction errors (Pe) in mean squared errors:
(1st row for SVR, 2nd row for GcLearn SVR)
* Proportional reduction of Pe (Pr %) by GcLearn
Pe =
0.4865
0.3761
Pr =
22.6958
(this means much better solution --- 22.7% improvement --- from GcLearn)
<3> If you need run a new prediction task other than validation, please refer to above (2) "Why and when to use ..." and follow its steps. For example, you have a training data set "traindata.txt" (correct answers are known), and a data set "testdata.txt" (correct answers are unknown) for predictions, and then:
Firstly, running "MainGcLearn.m" for "traindata.txt" (see line 44 of "dataload.m"), set "exm=8" in line 6, and run. The following apears in command window:
== == ==> Example 8
===> running at: 1 total repeats: 1
---> runs/k-fold= 1 optimal nu= 11
---> runs/k-fold= 2 optimal nu= 14
...
assessing GcLearn for traindata.txt
==> Results for GcLearn SVM regression
* Prediction errors (Pe) in mean squared errors
(1st row for SVR, 2nd row for GcLearn SVR)
* Proportional reduction of Pe (Pr %) by GcLearn
Pe =
16.5056
15.2810
Pr =
7.4195
(this means better solutions from GcLearn)
Secondly, open "mainGcLearnApp.m", and set line 23 & 24:
Xtrain=load('traindata.txt');
Ytest=load('testdata.txt');
and "Z=1;" in line 29 (Z indicates a target variable).
Then, run it, you may find the prediction results in 'Resulta' for GcLearn linear gresssion and 'Resultb' for GcLearn ε-SVR.
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -