📄 libxmlparser.pas
字号:
(**
===============================================================================================
Name : LibXmlParser
===============================================================================================
Project : All Projects
===============================================================================================
Subject : Progressive XML Parser for all types of XML Files
===============================================================================================
Author : Stefan Heymann
Eschenweg 3
72076 T黚ingen
GERMANY
E-Mail: stefan@destructor.de
URL: www.destructor.de
===============================================================================================
Source, Legals ("Licence")
--------------------------
The official site to get this parser is http://www.destructor.de/
Usage and Distribution of this Source Code is ruled by the
"Destructor.de Source code Licence" (DSL) which comes with this file or
can be downloaded at http://www.destructor.de/
IN SHORT: Usage and distribution of this source code is free.
You use it completely on your own risk.
Postcardware
------------
If you like this code, please send a postcard of your city to my above address.
===============================================================================================
!!! All parts of this code which are not finished or not conforming exactly to
the XmlSpec are marked with three exclamation marks
-!- Parts where the parser may be able to detect errors in the document's syntax are
marked with the dash-exlamation mark-dash sequence.
===============================================================================================
Terminology:
------------
- Start: Start of a buffer part
- Final: End (last character) of a buffer part
- DTD: Document Type Definition
- DTDc: Document Type Declaration
- XMLSpec: The current W3C XML Recommendation (version 1.0 as of 1998-02-10), Chapter No.
- Cur*: Fields concerning the "Current" part passed back by the "Scan" method
===============================================================================================
Scanning the XML document
-------------------------
- Create TXmlParser Instance MyXml := TXmlParser.Create;
- Load XML Document MyXml.LoadFromFile (Filename);
- Start Scanning MyXml.StartScan;
- Scan Loop WHILE MyXml.Scan DO
- Test for Part Type CASE MyXml.CurPartType OF
- Handle Parts ... : ;;;
- Handle Parts ... : ;;;
- Handle Parts ... : ;;;
END;
- Destroy MyXml.Free;
===============================================================================================
Loading the XML document
------------------------
You can load the XML document from a file with the "LoadFromFile" method.
It is beyond the scope of this parser to perform HTTP or FTP accesses. If you want your
application to handle such requests (URLs), you can load the XML via HTTP or FTP or whatever
protocol and hand over the data buffer using the "LoadFromBuffer" or "SetBuffer" method.
"LoadFromBuffer" loads the internal buffer of TXmlParser with the given null-terminated
string, thereby creating a copy of that buffer.
"SetBuffer" just takes the pointer to another buffer, which means that the given
buffer pointer must be valid while the document is accessed via TXmlParser.
===============================================================================================
Encodings:
----------
This XML parser kind of "understands" the following encodings:
- UTF-8
- ISO-8859-1
- Windows-1252
Any flavor of multi-byte characters (and this includes UTF-16) is not supported. Sorry.
Every string which has to be passed to the application passes the virtual method
"TranslateEncoding" which translates the string from the current encoding (stored in
"CurEncoding") into the encoding the application wishes to receive.
The "TranslateEncoding" method that is built into TXmlParser assumes that the application
wants to receive Windows ANSI (Windows-1252, about the same as ISO-8859-1) and is able
to convert UTF-8 and ISO-8859-1 encodings.
For other source and target encodings, you will have to override "TranslateEncoding".
===============================================================================================
Buffer Handling
---------------
- The document must be loaded completely into a piece of RAM
- All character positions are referenced by PChar pointers
- The TXmlParser instance can either "own" the buffer itself (then, FBufferSize is > 0)
or reference the buffer of another instance or object (then, FBuffersize is 0 and
FBuffer is not NIL)
- The Property DocBuffer passes back a pointer to the first byte of the document. If there
is no document stored (FBuffer is NIL), the DocBuffer returns a pointer to a NULL character.
===============================================================================================
Whitespace Handling
-------------------
The TXmlParser property "PackSpaces" determines how Whitespace is returned in Text Content:
While PackSpaces is true, all leading and trailing whitespace characters are trimmed of, all
Whitespace is converted to Space #x20 characters and contiguous Whitespace characters are
compressed to one.
If the "Scan" method reports a ptContent part, the application can get the original text
with all whitespace characters by extracting the characters from "CurStart" to "CurFinal".
If the application detects an xml:space attribute, it can set "PackSpaces" accordingly or
use CurStart/CurFinal.
Please note that TXmlParser does _not_ normalize Line Breaks to single LineFeed characters
as the XmlSpec requires (XmlSpec 2.11).
The xml:space attribute is not handled by TXmlParser. This is on behalf of the application.
===============================================================================================
Non-XML-Conforming
------------------
TXmlParser does not conform 100 % exactly to the XmlSpec:
- UTF-16 is not supported (XmlSpec 2.2)
(Workaround: Convert UTF-16 to UTF-8 and hand the buffer over to TXmlParser)
- As the parser only works with single byte strings, all Unicode characters > 255
can currently not be handled correctly.
- Line breaks are not normalized to single Linefeed #x0A characters (XmlSpec 2.11)
(Workaround: The Application can access the text contents on its own [CurStart, CurFinal],
thereby applying every normalization it wishes to)
- The attribute value normalization does not work exactly as defined in the
Second Edition of the XML 1.0 specification.
- See also the code parts marked with three consecutive exclamation marks. These are
parts which are not finished in the current code release.
This list may be incomplete, so it may grow if I get to know any other points.
As work on the parser proceeds, this list may also shrink.
===============================================================================================
Things Todo
-----------
- Introduce a new event/callback which is called when there is an unresolvable
entity or character reference
- Support Unicode
- Use Streams instead of reading the whole XML into memory
===============================================================================================
Change History, Version numbers
-------------------------------
The Date is given in ISO Year-Month-Day (YYYY-MM-DD) order.
Versions are counted from 1.0.0 beginning with the version from 2000-03-16.
Unreleased versions don't get a version number.
Date Author Version Changes
-----------------------------------------------------------------------------------------------
2000-03-16 HeySt 1.0.0 Start
2000-03-28 HeySt 1.0.1 Initial Publishing of TXmlParser on the destructor.de Web Site
2000-03-30 HeySt 1.0.2 TXmlParser.AnalyzeCData: Call "TranslateEncoding" for CurContent
2000-03-31 HeySt 1.0.3 Deleted the StrPosE function (was not needed anyway)
2000-04-04 HeySt 1.0.4 TDtdElementRec modified: Start/Final for all Elements;
Should be backwards compatible.
AnalyzeDtdc: Set CurPartType to ptDtdc
2000-04-23 HeySt 1.0.5 New class TObjectList. Eliminated reference to the Delphi 5
"Contnrs" unit so LibXmlParser is Delphi 4 compatible.
2000-07-03 HeySt 1.0.6 TNvpNode: Added Constructor
2000-07-11 HeySt 1.0.7 Removed "Windows" from USES clause
Added three-exclamation-mark comments for Utf8ToAnsi/AnsiToUtf8
Added three-exclamation-mark comments for CHR function calls
2000-07-23 HeySt 1.0.8 TXmlParser.Clear: CurAttr.Clear; EntityStack.Clear;
(This was not a bug; just defensive programming)
2000-07-29 HeySt 1.0.9 TNvpList: Added methods: Node(Index), Value(Index), Name(Index);
2000-10-07 HeySt Introduced Conditional Defines
Uses Contnrs unit and its TObjectList class again for
Delphi 5 and newer versions
2001-01-30 HeySt Introduced Version Numbering
Made LoadFromFile and LoadFromBuffer BOOLEAN functions
Introduced FileMode parameter for LoadFromFile
BugFix: TAttrList.Analyze: Must add CWhitespace to ExtractName call
Comments worked over
2001-02-28 HeySt 1.0.10 Completely worked over and tested the UTF-8 functions
Fixed a bug in TXmlParser.Scan which caused it to start over when it
was called after the end of scanning, resulting in an endless loop
TEntityStack is now a TObjectList instead of TList
2001-07-03 HeySt 1.0.11 Updated Compiler Version IFDEFs for Kylix
2001-07-11 HeySt 1.0.12 New TCustomXmlScanner component (taken over from LibXmlComps.pas)
2001-07-14 HeySt 1.0.13 Bugfix TCustomXmlScanner.FOnTranslateEncoding
2001-10-22 HeySt Don't clear CurName anymore when the parser finds a CDATA section.
2001-12-03 HeySt 1.0.14 TObjectList.Clear: Make call to INHERITED method (fixes a memory leak)
2001-12-05 HeySt 1.0.15 TObjectList.Clear: removed call to INHERITED method
TObjectList.Destroy: Inserted SetCapacity call.
Reduces need for frequent re-allocation of pointer buffer
Dedicated to my father, Theodor Heymann
2002-06-26 HeySt 1.0.16 TXmlParser.Scan: Fixed a bug with PIs whose name is beginning
with 'xml'. Thanks to Uwe Kamm for submitting this bug.
The CurEncoding property is now always in uppercase letters (the XML
spec wants it to be treated case independently so when it's uppercase
comparisons are faster)
2002-03-04 HeySt 1.0.17 Included an IFDEF for Delphi 7 (VER150) and Kylix
There is a new symbol HAS_CONTNRS_UNIT which is used now to
distinguish between IDEs which come with the Contnrs unit and
those that don't.
*)
// --- Delphi/Kylix Version Numbers
// As this is no code, this does not blow up your object or executable code at all
(*$IFDEF LINUX *)
(*$DEFINE K1_OR_NEWER *)
(*$ENDIF *)
(*$IFDEF MSWINDOWS *)
(*$DEFINE D1_OR_NEWER *)
(*$IFNDEF VER80 *)
(*$DEFINE D2_OR_NEWER *)
(*$IFNDEF VER90 *)
(*$DEFINE D3_OR_NEWER *)
(*$IFNDEF VER100 *)
(*$DEFINE D4_OR_NEWER *)
(*$IFNDEF VER120 *)
(*$DEFINE D5_OR_NEWER *)
(*$IFNDEF VER130 *)
(*$IFNDEF VER140 *)
(*$IFNDEF VER150 *)
If the compiler gets stuck here,
you are using a compiler version unknown to this code.
You will probably have to change this code accordingly.
At first, try to comment out these lines and see what will happen.
(*$ENDIF *)
(*$ENDIF *)
(*$ENDIF *)
(*$ENDIF *)
(*$ENDIF *)
(*$ENDIF *)
(*$ENDIF *)
(*$ENDIF *)
(*$IFDEF D5_OR_NEWER *)
(*$DEFINE HAS_CONTNRS_UNIT *)
(*$ENDIF *)
(*$IFDEF K1_OR_NEWER *)
(*$DEFINE HAS_CONTNRS_UNIT *)
(*$ENDIF *)
UNIT LibXmlParser;
INTERFACE
USES
SysUtils, Classes,
(*$IFDEF HAS_CONTNRS_UNIT *) // The Contnrs Unit was introduced in Delphi 5
Contnrs,
(*$ENDIF*)
Math;
CONST
CVersion = '1.0.17'; // This variable will be updated for every release
// (I hope, I won't forget to do it everytime ...)
TYPE
TPartType = // --- Document Part Types
(ptNone, // Nothing
ptXmlProlog, // XML Prolog XmlSpec 2.8 / 4.3.1
ptComment, // Comment XmlSpec 2.5
ptPI, // Processing Instruction XmlSpec 2.6
ptDtdc, // Document Type Declaration XmlSpec 2.8
ptStartTag, // Start Tag XmlSpec 3.1
ptEmptyTag, // Empty-Element Tag XmlSpec 3.1
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -