📄 mass_spec_demo.html
字号:
<html xmlns:mwsh="http://www.mathworks.com/namespace/mcode/v1/syntaxhighlight.dtd"> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> <!--This HTML is auto-generated from an M-file.To make changes, update the M-file and republish this document. --> <title>Mass Spectrometry Data Analysis Demonstration</title> <meta name="generator" content="MATLAB 7.0"> <meta name="date" content="2004-06-25"> <meta name="m-file" content="mass_spec_demo"><style>body { background-color: white; margin:10px;}h1 { color: #990000; font-size: x-large;}h2 { color: #990000; font-size: medium;}p.footer { text-align: right; font-size: xx-small; font-weight: lighter; font-style: italic; color: gray;}pre.codeinput { margin-left: 30px;}span.keyword {color: #0000FF}span.comment {color: #228B22}span.string {color: #A020F0}span.untermstring {color: #B20000}span.syscmd {color: #B28C00}pre.showbuttons { margin-left: 30px; border: solid black 2px; padding: 4px; background: #EBEFF3;}pre.codeoutput { color: gray; font-style: italic;}pre.error { color: red;}/* Make the text shrink to fit narrow windows, but not stretch too far in wide windows. On Gecko-based browsers, the shrink-to-fit doesn't work. */ p,h1,h2,div { /* for MATLAB's browser */ width: 600px; /* for Mozilla, but the "width" tag overrides it anyway */ max-width: 600px; /* for IE */ width:expression(document.body.clientWidth > 620 ? "600px": "auto" );} </style></head> <body> <h1>Mass Spectrometry Data Analysis Demonstration</h1> <h2>Robert Henson and Lucio Cetto, The MathWorks, Inc.</h2> <introduction> <p>This example demonstrates a number of ways to classify mass spectrometry data and shows some statistical tools that can be used to look for potential disease markers. The data in this example are from the FDA-NCI Clinical Proteomics Program Databank (<a href="http://ncifdaproteomics.com)">http://ncifdaproteomics.com)</a>. </p> <p>There are a number of datasets in the Clinical Proteomics Program Databank. This example will use the Ovarian Dataset 8-7-02 that was generated using the WCX2 protein array. The sample set includes 91 controls and 162 ovarian cancers. The data can be downloaded from <a href="http://ncifdaproteomics.com/download-ovar.php">http://ncifdaproteomics.com/download-ovar.php</a> . </p> </introduction> <h2>Contents</h2> <div> <ul> <li><a href="#1">Importing and visualizing some of the samples</a></li> <li><a href="#5">Importing all the samples</a></li> <li><a href="#7">Plotting multiple spectra</a></li> <li><a href="#10">Baseline correction</a></li> <li><a href="#11">Resampling the data</a></li> <li><a href="#15">Finding significant features</a></li> <li><a href="#19">Classification methods</a></li> <li><a href="#20">K-Nearest Neighbor classifier</a></li> <li><a href="#21">Linear Discriminant Analysis</a></li> <li><a href="#22">Variation on PCA/LDA classifier</a></li> <li><a href="#23">Neural Network Classifier</a></li> <li><a href="#25">Other ideas</a></li> </ul> </div> <h2>Importing and visualizing some of the samples<a name="1"></a></h2> <p>After downloading and uncompressing the file, the data are stored in comma separated value files in two directories, 'Control', and 'Ovarian Cancer'. Each file contains two columns, the M/Z values and the intensity values corresponding to these mass/charge ratios. The <b>importdata</b> function can be used to read in one of the samples: </p><pre class="codeinput">close <span class="string">all</span> <span class="string">force</span>;clear <span class="string">all</span>;cd <span class="string">Control</span>daf_0181 = importdata(<span class="string">'Control daf-0181.csv'</span>)</pre><pre class="codeoutput">daf_0181 = data: [15154x2 double] textdata: {'M/Z' 'Intensity'} colheaders: {'M/Z' 'Intensity'}</pre><p>The data values are stored in the data field of the daf_0181 structure. The <b>plot</b> command creates a graph of the data. </p><pre class="codeinput">plot(daf_0181.data(:,1),daf_0181.data(:,2))<span class="comment">% The column headers are in the colheaders field. These can be used for the</span><span class="comment">% X and Y axis labels.</span>xAxisLabel = daf_0181.colheaders{1};yAxisLabel = daf_0181.colheaders{2};xlabel(xAxisLabel);ylabel(yAxisLabel);<span class="comment">% The default X axis limits are a little loose, these can be made tighter</span><span class="comment">% using the axis XLim property.</span>xAxisLimits = [daf_0181.data(1,1),daf_0181.data(end,1)];set(gca,<span class="string">'xlim'</span>,xAxisLimits )</pre><img vspace="5" hspace="5" src="mass_spec_demo_01.png"> <p>The spectra for the cancer samples are in the 'Ovarian Cancer' directory</p><pre class="codeinput">cd <span class="string">../'Ovarian Cancer'</span>daf_0601 = importdata(<span class="string">'Ovarian Cancer daf-0601.csv'</span>)hold <span class="string">on</span>plot(daf_0601.data(:,1),daf_0601.data(:,2),<span class="string">'r'</span>)legend({<span class="string">'Control'</span>,<span class="string">'Ovarian Cancer'</span>});hold <span class="string">off</span></pre><pre class="codeoutput">daf_0601 = data: [15154x2 double] textdata: {'M/Z' 'Intensity'} colheaders: {'M/Z' 'Intensity'}</pre><img vspace="5" hspace="5" src="mass_spec_demo_02.png"> <p>The area around 7000 to 9500 M/Z has some interesting features.</p><pre class="codeinput">set(gca,<span class="string">'xlim'</span>,[6500,10000],<span class="string">'ylim'</span>,[0,50]);<span class="comment">% There are a couple of peaks that are more pronounced in the cancer sample</span><span class="comment">% and other peaks that are more pronounced in the cancer sample than the</span><span class="comment">% control. If you pan along the plot, you will see several such peaks.</span></pre><img vspace="5" hspace="5" src="mass_spec_demo_03.png"> <h2>Importing all the samples<a name="5"></a></h2> <p>There are 254 sample files. The <b>csvread</b> function is slightly more efficient than importdata for reading a large number of similar files. </p><pre class="codeinput">OC_files = dir(<span class="string">'*.csv'</span>);<span class="comment">% Preallocate some space for the data.</span>numOC = numel(OC_files);numValues = size(daf_0601.data,1);OC_IN = zeros(numValues,numOC);<span class="comment">% The M/Z values are constant across all the samples.</span>OC_MZ = daf_0601.data(:,1);<span class="comment">% Loop over the files and read in the data.</span><span class="keyword">for</span> i = 1:numOC OC_IN(:,i) = csvread(OC_files(i).name,1,1);<span class="keyword">end</span></pre><p>Repeat this for the control data.</p><pre class="codeinput">cd <span class="string">../Control</span>NH_files = dir(<span class="string">'*.csv'</span>);<span class="comment">% Preallocate some space for the data.</span>numNH = numel(NH_files);numValues = size(daf_0181.data,1);NH_IN = zeros(numValues,numNH);NH_MZ = daf_0181.data(:,1);<span class="comment">% Loop over the files and read in the data.</span><span class="keyword">for</span> i = 1:numNH NH_IN(:,i) = csvread(NH_files(i).name,1,1);<span class="keyword">end</span>cd <span class="string">..</span></pre><h2>Plotting multiple spectra<a name="7"></a></h2> <p>We can plot multipl spectra on the same figure</p><pre class="codeinput">figurehNH = plot(NH_MZ,NH_IN(:,1:5) ,<span class="string">'b'</span>);hold <span class="string">on</span>;hOC = plot(OC_MZ,OC_IN(:,1:5),<span class="string">'r'</span>);set(gca,<span class="string">'xlim'</span>,[daf_0181.data(1,1),daf_0181.data(end,1)])xlabel(xAxisLabel);ylabel(yAxisLabel);set(gca,<span class="string">'xlim'</span>,xAxisLimits )legend([hNH(1),hOC(1)],{<span class="string">'Control'</span>,<span class="string">'Ovarian Cancer'</span>})</pre><img vspace="5" hspace="5" src="mass_spec_demo_04.png"> <p>Zooming in on the region from 7000 to 9500 M/Z shows some peaks that may be useful for classifying the data.</p><pre class="codeinput">set(gca,<span class="string">'xlim'</span>,[6500,10000],<span class="string">'ylim'</span>,[0,50]);</pre><img vspace="5" hspace="5" src="mass_spec_demo_05.png"> <p>Another way to visualize the whole data set is to look at the average signal for the control and cancer samples. We can plot the mean +- one standard deviation. </p><pre class="codeinput">mean_NH = mean(NH_IN,2);std_NH = std(NH_IN,0,2);mean_OC = mean(OC_IN,2);std_OC = std(OC_IN,0,2);hFig = figure;hNHm = plot(NH_MZ,mean_NH,<span class="string">'b'</span>);hold <span class="string">on</span>hOCm = plot(OC_MZ,mean_OC,<span class="string">'r'</span>);plot(NH_MZ,mean_NH+std_NH,<span class="string">'b:'</span>)plot(NH_MZ,mean_NH-std_NH,<span class="string">'b:'</span>)plot(OC_MZ,mean_OC+std_OC,<span class="string">'r:'</span>)plot(OC_MZ,mean_OC-std_OC,<span class="string">'r:'</span>)set(gca,<span class="string">'xlim'</span>,[daf_0181.data(1,1),daf_0181.data(end,1)])xlabel(xAxisLabel);ylabel(yAxisLabel);set(gca,<span class="string">'xlim'</span>,xAxisLimits )legend([hNHm,hOCm],{<span class="string">'Control'</span>,<span class="string">'Ovarian Cancer'</span>})</pre><img vspace="5" hspace="5" src="mass_spec_demo_06.png"> <h2>Baseline correction<a name="10"></a></h2> <p>Notice that the intensity values are never close to zero. We can correct for this in a number of ways. Here we give an example of how to correct the baseline for the samples. We use a windowed piecewise cubic interpolation method. This does not use the Signal Processing Toolbox but there are several functions in the toolbox that give greater flexibility in baseline correction and resampling. </p><pre class="codeinput">D = [NH_IN OC_IN];ns = size(D,1); <span class="comment">% number of points</span>nC = size(OC_IN,2); <span class="comment">% number of samples with cancer</span>nH = size(NH_IN,2); <span class="comment">% number of healty samples</span>tn = size(D,2); <span class="comment">% total number of samples</span>w = 75; <span class="comment">% window size</span>temp = zeros(w,ceil(ns/w))+NaN;<span class="keyword">for</span> i=1:tn temp(1:ns)=D(:,i); [m,h]=min(temp); g = h>1 & h<w; h=w*[0:numel(h)-1]+h; m = m(g); h = h(g); D0(:,i) = [temp(1:ns)-interp1(h,m,1:ns,<span class="string">'pchip'</span>)]';<span class="keyword">end</span>figureplot(NH_MZ,D0(:,1:50:end))set(gca,<span class="string">'xlim'</span>,[daf_0181.data(1,1),daf_0181.data(end,1)])xlabel(xAxisLabel);ylabel(yAxisLabel);set(gca,<span class="string">'xlim'</span>,xAxisLimits )</pre><img vspace="5" hspace="5" src="mass_spec_demo_07.png"> <h2>Resampling the data<a name="11"></a></h2> <p>In this case, the data is already normalized so that all datasets have the same m/z values. If this was not the case, MATLAB could easily be used to resample the data. Here we show how to resample the data so that only integer m/z values are considered. </p><pre class="codeinput"> nts = floor(max(NH_MZ)); DR = zeros(nts,tn); <span class="comment">% resample data using linear interpolation</span> <span class="keyword">for</span> i=1:tn
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -