📄 convertcharset.class.php
字号:
<?php
/**
* @author Mikolaj Jedrzejak <mikolajj@op.pl>
* @copyright Copyright Mikolaj Jedrzejak (c) 2003-2004
* @version 1.0 2004-07-27 00:37
* @link http://www.unicode.org Unicode Homepage
* @link http://www.mikkom.pl My Homepage
*
**/
$PATH_TO_CLASS = dirname(ereg_replace("\\\\","/",__FILE__)) . "/" . "ConvertTables" . "/";
define ("CONVERT_TABLES_DIR", $PATH_TO_CLASS);
define ("DEBUG_MODE", 1);
/**
* -- 1.0 2004-07-28 --
*
* -- The most important thing --
* I want to thank all people who helped me fix all bugs, small and big once.
* I hope that you don't mind that your names are in this file.
*
* -- Some Apache issues --
* I get info from Lukas Lisa, that in some cases with special apache configuration
* you have to put header() function with proper encoding to get your result
* displayed correctly.
* If you want to see what I mean, go to demo.php and demo1.php
*
* -- BETA 1.0 2003-10-21 --
*
* -- You should know about... --
* For good understanding this class you shouls read all this stuff first :) but if you are
* in a hurry just start the demo.php and see what's inside.
* 1. That I'm not good in english at 03:45 :) - so forgive me all mistakes
* 2. This class is a BETA version because I haven't tested it enough
* 3. Feel free to contact me with questions, bug reports and mistakes in PHP and this documentation (email below)
*
* -- In a few words... --
* Why ConvertCharset class?
*
* I have made this class because I had a lot of problems with diferent charsets. First because people
* from Microsoft wanted to have thair own encoding, second because people from Macromedia didn't
* thought about other languages, third because sometimes I need to use text written on MAC, and of course
* it has its own encoding :)
*
* Notice & remember:
* - When I'm saying 1 byte string I mean 1 byte per char.
* - When I'm saying multibyte string I mean more than one byte per char.
*
* So, this are main FEATURES of this class:
* - conversion between 1 byte charsets
* - conversion from 1 byte to multi byte charset (utf-8)
* - conversion from multibyte charset (utf-8) to 1 byte charset
* - every conversion output can be save with numeric entities (browser charset independent - not a full truth)
*
* This is a list of charsets you can operate with, the basic rule is that a char have to be in both charsets,
* otherwise you'll get an error.
*
* - WINDOWS
* - windows-1250 - Central Europe
* - windows-1251 - Cyrillic
* - windows-1252 - Latin I
* - windows-1253 - Greek
* - windows-1254 - Turkish
* - windows-1255 - Hebrew
* - windows-1256 - Arabic
* - windows-1257 - Baltic
* - windows-1258 - Viet Nam
* - cp874 - Thai - this file is also for DOS
*
* - DOS
* - cp437 - Latin US
* - cp737 - Greek
* - cp775 - BaltRim
* - cp850 - Latin1
* - cp852 - Latin2
* - cp855 - Cyrylic
* - cp857 - Turkish
* - cp860 - Portuguese
* - cp861 - Iceland
* - cp862 - Hebrew
* - cp863 - Canada
* - cp864 - Arabic
* - cp865 - Nordic
* - cp866 - Cyrylic Russian (this is the one, used in IE "Cyrillic (DOS)" )
* - cp869 - Greek2
*
* - MAC (Apple)
* - x-mac-cyrillic
* - x-mac-greek
* - x-mac-icelandic
* - x-mac-ce
* - x-mac-roman
*
* - ISO (Unix/Linux)
* - iso-8859-1
* - iso-8859-2
* - iso-8859-3
* - iso-8859-4
* - iso-8859-5
* - iso-8859-6
* - iso-8859-7
* - iso-8859-8
* - iso-8859-9
* - iso-8859-10
* - iso-8859-11
* - iso-8859-12
* - iso-8859-13
* - iso-8859-14
* - iso-8859-15
* - iso-8859-16
*
* - MISCELLANEOUS
* - gsm0338 (ETSI GSM 03.38)
* - cp037
* - cp424
* - cp500
* - cp856
* - cp875
* - cp1006
* - cp1026
* - koi8-r (Cyrillic)
* - koi8-u (Cyrillic Ukrainian)
* - nextstep
* - us-ascii
* - us-ascii-quotes
*
* - DSP implementation for NeXT
* - stdenc
* - symbol
* - zdingbat
*
* - And specially for old Polish programs
* - mazovia
*
* -- Now, to the point... --
* Here are main variables.
*
* DEBUG_MODE
*
* You can set this value to:
* - -1 - No errors or comments
* - 0 - Only error messages, no comments
* - 1 - Error messages and comments
*
* Default value is 1, and during first steps with class it should be left as is.
*
* CONVERT_TABLES_DIR
*
* This is a place where you store all files with charset encodings. Filenames should have
* the same names as encodings. My advise is to keep existing names, because thay
* were taken from unicode.org (www.unicode.org), and after update to unicode 3.0 or 4.0
* the names of files will be the same, so if you want to save your time...uff, leave the
* names as thay are for future updates.
*
* The directory with edings files should be in a class location directory by default,
* but of course you can change it if you like.
*
* @package All about charset...
* @author Mikolaj Jedrzejak <mikolajj@op.pl>
* @copyright Copyright Mikolaj Jedrzejak (c) 2003-2004
* @version 1.0 2004-07-27 23:11
* @access public
*
* @link http://www.unicode.org Unicode Homepage
**/
class ConvertCharset {
var $RecognizedEncoding; //This value keeps information if string contains multibyte chars.
var $Entities; // This value keeps information if output should be with numeric entities.
/**
* CharsetChange::NumUnicodeEntity()
*
* Unicode encoding bytes, bits representation.
* Each b represents a bit that can be used to store character data.
* - bytes, bits, binary representation
* - 1, 7, 0bbbbbbb
* - 2, 11, 110bbbbb 10bbbbbb
* - 3, 16, 1110bbbb 10bbbbbb 10bbbbbb
* - 4, 21, 11110bbb 10bbbbbb 10bbbbbb 10bbbbbb
*
* This function is written in a "long" way, for everyone who woluld like to analize
* the process of unicode encoding and understand it. All other functions like HexToUtf
* will be written in a "shortest" way I can write tham :) it does'n mean thay are short
* of course. You can chech it in HexToUtf() (link below) - very similar function.
*
* IMPORTANT: Remember that $UnicodeString input CANNOT have single byte upper half
* extended ASCII codes, why? Because there is a posibility that this function will eat
* the following char thinking it's miltibyte unicode char.
*
* @param string $UnicodeString Input Unicode string (1 char can take more than 1 byte)
* @return string This is an input string olso with unicode chars, bus saved as entities
* @see HexToUtf()
**/
function UnicodeEntity ($UnicodeString)
{
$OutString = "";
$StringLenght = strlen ($UnicodeString);
for ($CharPosition = 0; $CharPosition < $StringLenght; $CharPosition++)
{
$Char = $UnicodeString [$CharPosition];
$AsciiChar = ord ($Char);
if ($AsciiChar < 128) //1 7 0bbbbbbb (127)
{
$OutString .= $Char;
}
else if ($AsciiChar >> 5 == 6) //2 11 110bbbbb 10bbbbbb (2047)
{
$FirstByte = ($AsciiChar & 31);
$CharPosition++;
$Char = $UnicodeString [$CharPosition];
$AsciiChar = ord ($Char);
$SecondByte = ($AsciiChar & 63);
$AsciiChar = ($FirstByte * 64) + $SecondByte;
$Entity = sprintf ("&#%d;", $AsciiChar);
$OutString .= $Entity;
}
else if ($AsciiChar >> 4 == 14) //3 16 1110bbbb 10bbbbbb 10bbbbbb
{
$FirstByte = ($AsciiChar & 31);
$CharPosition++;
$Char = $UnicodeString [$CharPosition];
$AsciiChar = ord ($Char);
$SecondByte = ($AsciiChar & 63);
$CharPosition++;
$Char = $UnicodeString [$CharPosition];
$AsciiChar = ord ($Char);
$ThidrByte = ($AsciiChar & 63);
$AsciiChar = ((($FirstByte * 64) + $SecondByte) * 64) + $ThidrByte;
$Entity = sprintf ("&#%d;", $AsciiChar);
$OutString .= $Entity;
}
else if ($AsciiChar >> 3 == 30) //4 21 11110bbb 10bbbbbb 10bbbbbb 10bbbbbb
{
$FirstByte = ($AsciiChar & 31);
$CharPosition++;
$Char = $UnicodeString [$CharPosition];
$AsciiChar = ord ($Char);
$SecondByte = ($AsciiChar & 63);
$CharPosition++;
$Char = $UnicodeString [$CharPosition];
$AsciiChar = ord ($Char);
$ThidrByte = ($AsciiChar & 63);
$CharPosition++;
$Char = $UnicodeString [$CharPosition];
$AsciiChar = ord ($Char);
$FourthByte = ($AsciiChar & 63);
$AsciiChar = ((((($FirstByte * 64) + $SecondByte) * 64) + $ThidrByte) * 64) + $FourthByte;
$Entity = sprintf ("&#%d;", $AsciiChar);
$OutString .= $Entity;
}
}
return $OutString;
}
/**
* ConvertCharset::HexToUtf()
*
* This simple function gets unicode char up to 4 bytes and return it as a regular char.
* It is very similar to UnicodeEntity function (link below). There is one difference
* in returned format. This time it's a regular char(s), in most cases it will be one or two chars.
*
* @param string $UtfCharInHex Hexadecimal value of a unicode char.
* @return string Encoded hexadecimal value as a regular char.
* @see UnicodeEntity()
**/
function HexToUtf ($UtfCharInHex)
{
$OutputChar = "";
$UtfCharInDec = hexdec($UtfCharInHex);
if($UtfCharInDec<128) $OutputChar .= chr($UtfCharInDec);
else if($UtfCharInDec<2048)$OutputChar .= chr(($UtfCharInDec>>6)+192).chr(($UtfCharInDec&63)+128);
else if($UtfCharInDec<65536)$OutputChar .= chr(($UtfCharInDec>>12)+224).chr((($UtfCharInDec>>6)&63)+128).chr(($UtfCharInDec&63)+128);
else if($UtfCharInDec<2097152)$OutputChar .= chr($UtfCharInDec>>18+240).chr((($UtfCharInDec>>12)&63)+128).chr(($UtfCharInDec>>6)&63+128). chr($UtfCharInDec&63+128);
return $OutputChar;
}
/**
* CharsetChange::MakeConvertTable()
*
* This function creates table with two SBCS (Single Byte Character Set). Every conversion
* is through this table.
*
* - The file with encoding tables have to be save in "Format A" of unicode.org charset table format! This is usualy writen in a header of every charset file.
* - BOTH charsets MUST be SBCS
* - The files with encoding tables have to be complet (Non of chars can be missing, unles you are sure you are not going to use it)
*
* "Format A" encoding file, if you have to build it by yourself should aplly these rules:
* - you can comment everything with #
* - first column contains 1 byte chars in hex starting from 0x..
* - second column contains unicode equivalent in hex starting from 0x....
* - then every next column is optional, but in "Format A" it should contain unicode char name or/and your own comment
* - the columns can be splited by "spaces", "tabs", "," or any combination of these
* - below is an example
*
* <code>
* #
* # The entries are in ANSI X3.4 order.
* #
* 0x00 0x0000 # NULL end extra comment, if needed
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -