readtext.m

来自「读入一个文本数据」· M 代码 · 共 449 行 · 第 1/2 页
449 行
function      [data, result]= readtext(text, delimiter, comment, quotes, options)

%  Usage:     [data, result]= readtext(source, delimiter, comment, quotes, options)
% 
% Whatever text (file) you give it, readtext returns an array of the contents (or send me a 
%   bug report). Matlab can't read variable length lines or variable type values with the standard 
%   library. readtext can read any text file. Any string (or even regexp) can be delimiting, 
%   default is a comma. Everything after (and including) a comment character, until the line end, 
%   is ignored. Quote characters may also be given, everything between them is treated as one item. 
%   There are options to control what will be converted to numbers and how empty items are saved. 
% 
% If you find any errors, please let me know: peder at axensten dot se.
% 
% source:     the file to be read. May be a file path or just the file name.
%         OR: The text itself, see 'textsource', below. 
% 
% delimiter:  (default: ',') any non-empty string. May be a regexp, but this is slow on large files. 
% 
% comment:    (default: '') zero or one character. Anything after (and including) this character, 
%   until the end of the line, will be ignored. 
% 
% quotes:     (default: '') zero, one (opening quote equals closing), or two characters (opening 
%   and closing quote) to be treated as paired braces. Everything between the quotes will be 
%   treated as one item. The quotes will remain. Quotes may be nested.
% 
% options:    (default: '') may contain (concatenate combined options): 
% - 'textsource': source contains the actual text to be processed, not the file name.
% - 'textual': no numeric conversion ('data' is a cell array of strings only), 
% - 'numeric': everything is converted to a number or NaN ('data' is a numeric array, empty items 
%   are converted to NaNs unless 'empty2zero' is given), 
% - 'empty2zero': an empty field is saved as zero, and 
% - 'empty2NaN': an empty field is saved as NaN. 
% - 'usewaitbar': call waitbar to report progress. If you find the wait bar annoying, get 'waitbar 
%   alternative' at http://www.mathworks.com/matlabcentral/fileexchange/loadFile.do?objectId=11398
% 
% data:       A cell array containing the read text, divided into cells by delimiter and line 
%   endings. 'data' will be empty if the file is not found, could not be opened, or is empty. 
%   With the option 'numeric', 'data' will be a numeric array, with 'textual', 'data' will be a 
%   cell array of strings only, and otherwise it will be a mixed cell array. For Matlab < version 7, 
%   returned strings may contain leading white-space.
% 
% result:     a structure:
% .min: minimum number of columns found in a line.
% .max: number of columns in 'data', before removing empty columns.
% .rows: number of rows in 'data', before removing empty rows. 
% .numberMask: true, if numeric conversion ('NaN' converted to NaN counts).
% .number: number of numeric conversions ('NaN' converted to NaN counts).
% .emptyMask: true, if empty item in file.
% .empty: number of empty items in file.
% .stringMask: true, if non-number and non-empty.
% .string: number of non-number, non-empty items.
% .quote: number of quotes. 
% 
% EXAMPLE 1:  [a,b]= readtext('txtfile', '[,\t]', '#', '"', 'numeric-empty2zero')
% This will load the file 'txtfile' into variable a, treating any of tab or comma as
%   delimiters. Everything from and including # to the next newline will be ignored. 
%   Everything between two double quotes will be treated as a string. Everything will 
%   be converted to numbers and a numeric array returned. Non-numeric items will become 
%   NaNs and empty items are converted to zero. 
% 
% EXAMPLE 2:  a= readtext('The, actual, text, to, process', ',', '', '', 'textsource')
% This will process the actual text string, returning a cell string of the five words. 
% 
% COPYRIGHT (C) Peder Axensten (peder at axensten dot se), 2006-2007.

% INSPIRATION: loadcell.m (id 1965). The goal of readtext is to be at least as flexible (you be 
%   the judge) and quicker. On my test file (see below), readtext is about 3--4 times 
%   as quick, maybe even more on large files. In readtext you may use a regexp as 
%   delimiter and it can ignore comments in the text file. 
% 
% SPEED:      Reading a 1MB file (150000 items!) with 'numeric' takes about 100 seconds on a 
%   fairly slow system. Time scales approximately linearly with input file size. 
% - Conversion from string to numeric is slow (I can't do anything about this), but using the 
%   option 'textual' is a lot quicker (above case takes 12 seconds).
% - Using a regexp delimiter is slower (during initializing), it adds 250 seconds! 
% 
% HISTORY:
% Version 1.0, 2006-05-03.
% Version 1.1, 2006-05-07:
% - Made 'options' case independent. 
% - Removed result.errmess -- now use error(errmess) instead. 
% - Removed result.nan -- it was equivalent to result.string, so check this instead.
% - Added some rows', 'result' fields: 'numberMask', 'emptyMask', and 'stringMask' 
%   (see 'result:', above).
% - A few small bug fixes.
% Version 1.2, 2006-06-06:
% - Now works in Matlab 6.5.1 (R13SP1) (maybe 6.5 too), versions <6.5 will NOT work.
% Version 1.3, 2006-06-20:
% - Better error report when file open fails. 
% - Somewhat quicker. 
% - Recommends 'waitbar alternative'. Ok with Matlab orig. waitbar too, of course. 
% Version 1.4, 2006-07-14:
% - Close waitbar instead of deleting it, and some other minor waitbar compatibility fixes. 
% Version 1.5, 2006-08-13:
% - No more (La)TeX formatting of file names. 
% - Prefixed waitbar messages with '(readtext)'. 
% Version 1.6, 2006-10-02:
% - Better removal of comments. Could leave an empty first row before. 
% - Added a 'usewaitbar' option. 
% - Now removes empty last columns and rows. 
% Version 1.7, 2006-01-04:
% - Quicker in 'mixed' and 'numeric' modes. It's still the numeric conversion that's slow. 
% - Made newline handling more robust. 
% - Made numeric conversion more robust (now ignores leading space chars). 
% - Now emits an error if delimiter is empty.
% - Added some stuff to the help text.
% - Simplified code somewhat. 
% Version 1.8, 2007-03-08:
% - Fixed a problem when the comment character was a regexp character, such as '*'. 
% - Fixed a problem when reading files with no data. 
% Version 1.9, 2007-03-26:
% - Can now handle a string directly, not just files. See 'textsource', in help readtext.
% - Improved compatibility with Matlab 6.x regarding waitbar calls. 
% - Fixed a bug where readtext failed to recognize a quote if there was only one in the text.
% - Fixed a bug when processing source of delimiters only.
% - Help text: added an example, edited some in general.
% 
% TO DO:
% - Add result.quoteMask.
% - Add 'removeemptycolumns' and 'removeemptyrows' options. 
% - Use optional id/value pairs for options:
%   * 'Delimiter', <string>                  (default is ',')
%   * 'RegExpDelimiter', <regexp>            (default is '')
%   * 'CommentChar', <string>                (default is '')
%   * 'Quote', ''/<one char>/<two chars>     (default is '')
%   * 'Output', 'Mixed'/'Textual'/'Numeric'  (default is 'Mixed')
%   * 'EmptyNum', NaN/Inf/-Inf/<number>      (default is NaN for 'mixed' and '' for 'textual')
%   * 'WaitBar', 'off'/'on'                  (default is 'off')
%   'Delimiter' and 'RegExpDelimiter' are exclusive. 

% KEYWORDS:     import, read, load, text, delimited, cell, numeric, array, flexible
% 
% REQUIREMENTS: Matlab 7.0.1 (R14SP1).

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
	
	% Read (or set to default) the input arguments.
	if((nargin < 1) || ~ischar(text))		% Is there a file name?
		error('First argument must be the source!'); 
	end
	opts.delimiter=		',';									% Default delimiter value.
	opts.comment=		'';										% Default comment value.
	opts.quotes=		'';										% Default quotes value.
	opts.file=			true;									% Default is file name in source.
	opts.format=		'mixed';								% Default format value.
	opts.op_empty=		[];										% Ignore empties. 
	opts.fname=			'text';
	if(nargin >= 2), opts.delimiter=	delimiter;		end
	if(nargin >= 3), opts.comment=		comment;		end
	if(nargin >= 4), opts.quotes=		quotes;			end
	if(nargin >= 5), options=			options;
	else			 options=			'';
	end
	
	if    (~ischar(opts.delimiter) || isempty(opts.delimiter))
		error('Argument ''delimiter'' must be a non-empty string.');
	elseif(~ischar(opts.comment)   || (length(opts.comment) > 1))
		error('Argument ''comment'' must be a string of maximum one character.');
	elseif(~ischar(opts.quotes)    || (length(opts.quotes) > 2))
		error('Argument ''quotes'' must be a string of maximum two characters.');
	elseif(~ischar(options)   )
		error('Argument ''options'' must be a string.');
	end
	mywaitbar=			@emptywaitbar;							% Default is using no waitbar ...
	th=					[];										% ... so empty waitbar handle.
	options=			lower(options);
	if(~isempty(strfind(options, 'textsource'))),	opts.file= false;			end
	if(~isempty(strfind(options, 'numeric'))),		opts.format= 'numeric';		end
	if(~isempty(strfind(options, 'textual'))),		opts.format= 'textual';		end
	if(~isempty(strfind(options, 'empty2zero'))),	opts.op_empty= 0;	% Replace by zero
	elseif(strcmp(opts.format, 'numeric') || ~isempty(strfind(options, 'empty2nan')))
													opts.op_empty= NaN;	% Replace by NaN.
	end
	if(strcmp(opts.format, 'textual')), opts.op_empty= num2str(opts.op_empty);	end	% Textual 'empty'.
	
	% Set the default return values.
	result.min=		Inf;
	result.max=		0;
	result.quote=	0;
	
	% Read the file.
	if(opts.file)
		opts.fname=		text;
		[fid, errmess]=	fopen(text, 'r');							% Open the file.
		if(fid < 0), error(['Trying to open ''' opts.fname ''': ' errmess]); end
		text=			transpose(fread(fid, 'uchar=>char'));		% Read the file.
		fclose(fid);												% Close the file.
	end
	
	if(~isempty(strfind(options, 'usewaitbar')))
		mywaitbar=		@waitbar;
		th= 			mywaitbar(0, '(readtext) Initialising...');% Show waitbar.
		set(findall(th, '-property', 'Interpreter'), 'Interpreter', 'none');% No (La)TeX formatting. 
	end
	
	% Clean up the text.
	eol=			char(10);									% Using unix-style eol. 
	text=			strrep(text, [char(13) char(10)], eol);		% Replace Windows-style eol.
	text=			strrep(text, char(13), eol);				% Replace MacClassic-style eol.
	if(~isempty(opts.comment))									% Remove comments.
		text=	regexprep(text, ['^\' opts.comment '[^' eol ']*' eol], '');	% Remove entire commented lines. 
		text=	regexprep(text, [ '\' opts.comment '[^' eol ']*'], '');		% Remove commented line endings. 
	end
	if(isempty(text) || text(end) ~= eol),	text= [text eol];	end	% End string with eol, if none.
	
	% Find column and row dividers.
	opts.delimiter=		strrep(opts.delimiter, '\t', char( 9));	% Convert to one char, quicker?
	opts.delimiter=		strrep(opts.delimiter, '\n', char(10));
	opts.delimiter=		strrep(opts.delimiter, '\r', char(13));
	opts.delimiter=		strrep(opts.delimiter, '\f', char(12));
	opts.delimiter=		strrep(opts.delimiter, [char(13) char(10)], eol);% Replace Windows-style eol.
	opts.delimiter=		strrep(opts.delimiter, char(13), eol);	% Replace MacClassic-style eol.
	if(1 == length(opts.delimiter))								% Find column dividers quickly.
		delimS=		find((text == opts.delimiter) | (text == eol));
		delimE=		delimS;
	elseif(isempty(regexp(opts.delimiter, '[\+\*\?\|\[\^\$<>\.\\]', 'once'))) % Find them rather quickly.
		delimS=		strfind(text, opts.delimiter);
		eols=		find(text == eol);
		delimE=		union(eols, delimS + length(opts.delimiter) - 1);
		delimS=		union(eols, delimS);
	else														% Find them with regexp.
		[delimS, delimE]=	regexp(text, [opts.delimiter '|' eol]);
	end
	divRow=			[0, find(text == eol)];						% Find row dividers+last.
readtext.m - 源码说明

本页面展示了「读入一个文本数据」中的 readtext.m 源码文件，采用 M 编程语言编写，共 449 行代码。您可以在线阅读完整代码内容，也可以返回资源详情页下载完整源码包进行本地学习和开发。
虫虫开发者社区收录了大量与文本处理相关的技术资源，包括源代码、技术文档、电路图等，是电子工程师和嵌入式开发者的专业学习平台。
⌨️ 快捷键说明

复制代码Ctrl + C
搜索代码Ctrl + F
全屏模式F11
增大字号Ctrl + =
减小字号Ctrl + -
显示快捷键?