unicode.php

来自「CMS系统提供学习研究修改最好了比流行的一些CMS简单但是更容易理解是」· PHP 代码 · 共 1,242 行 · 第 1/4 页
PHP
1,242 行
<?php/******************************************************************************** Filename:     Unicode.php** Description:  Provides functions for handling Unicode strings in PHP without*               needing to configure the non-default mbstring extension** Author:       Evan Hunter** Date:         27/7/2004** Project:      JPEG Metadata** Revision:     1.10** Changes:      1.00 -> 1.10 : Added the following functions:*                              smart_HTML_Entities*                              smart_htmlspecialchars*                              HTML_UTF16_UnEscape*                              HTML_UTF8_UnEscape*                              changed HTML_UTF8_Escape and HTML_UTF16_Escape to*                              use smart_htmlspecialchars, so that characters which*                              were already escaped would remain intact*** URL:          http://electronics.ozhiker.com** License:      This file is part of the PHP JPEG Metadata Toolkit.**               The PHP JPEG Metadata Toolkit is free software; you can*               redistribute it and/or modify it under the terms of the*               GNU General Public License as published by the Free Software*               Foundation; either version 2 of the License, or (at your*               option) any later version.**               The PHP JPEG Metadata Toolkit is distributed in the hope*               that it will be useful, but WITHOUT ANY WARRANTY; without*               even the implied warranty of MERCHANTABILITY or FITNESS*               FOR A PARTICULAR PURPOSE.  See the GNU General Public License*               for more details.**               You should have received a copy of the GNU General Public*               License along with the PHP JPEG Metadata Toolkit; if not,*               write to the Free Software Foundation, Inc., 59 Temple*               Place, Suite 330, Boston, MA  02111-1307  USA**               If you require a different license for commercial or other*               purposes, please contact the author: evan@ozhiker.com*******************************************************************************/// TODO: UTF-16 functions have not been tested fully/******************************************************************************** Unicode UTF-8 Encoding Functions** Description:  UTF-8 is a Unicode encoding system in which extended characters*               use only the upper half (128 values) of the byte range, thus it*               allows the use of normal 7-bit ASCII text.*               7-Bit ASCII will pass straight through UTF-8 encoding/decoding without change*** The encoding is as follows:* Unicode Value          :  Binary representation (x=data bit)*--------------------------------------------------------------------------------* U-00000000 - U-0000007F:  0xxxxxxx                      <- This is 7-bit ASCII* U-00000080 - U-000007FF:  110xxxxx 10xxxxxx* U-00000800 - U-0000FFFF:  1110xxxx 10xxxxxx 10xxxxxx* U-00010000 - U-001FFFFF:  11110xxx 10xxxxxx 10xxxxxx 10xxxxxx* U-00200000 - U-03FFFFFF:  111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx* U-04000000 - U-7FFFFFFF:  1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx*--------------------------------------------------------------------------------*******************************************************************************//******************************************************************************** Unicode UTF-16 Encoding Functions** Description:  UTF-16 is a Unicode encoding system uses 16 bit values for representing*               characters.*               It also has an extended set of characters available by the use*               of surrogate pairs, which are a pair of 16 bit values, giving a*               total data length of 20 useful bits.*** The encoding is as follows:* Unicode Value          :  Binary representation (x=data bit)*--------------------------------------------------------------------------------* U-000000 - U-00D7FF:  xxxxxxxx xxxxxxxx* U-00D800 - U-00DBFF:  Not available - used for high surrogate pairs* U-00DC00 - U-00DFFF:  Not available - used for low surrogate pairs  U-00E000 - U-00FFFF:  xxxxxxxx xxxxxxxx* U-010000 - U-10FFFF:  110110ww wwxxxxxx  110111xx xxxxxxxx      ( wwww = (uni-0x10000)/0x10000 )*--------------------------------------------------------------------------------**  Surrogate pair Calculations**  $hi = ($uni - 0x10000) / 0x400 + 0xD800;*  $lo = ($uni - 0x10000) % 0x400 + 0xDC00;***  $uni = 0x10000 + ($hi - 0xD800) * 0x400 + ($lo - 0xDC00);********************************************************************************//******************************************************************************** Function:     UTF8_fix** Description:  Checks a string for badly formed Unicode UTF-8 coding and*               returns the same string containing only the parts which*               were properly formed UTF-8 data.** Parameters:   utf8_text - a string with possibly badly formed UTF-8 data** Returns:      output - the well formed UTF-8 version of the string*******************************************************************************/function UTF8_fix( $utf8_text ){        // Initialise the current position in the string        $pos = 0;        // Create a string to accept the well formed output        $output = "" ;        // Cycle through each group of bytes, ensuring the coding is correct        while ( $pos < strlen( $utf8_text ) )        {                // Retreive the current numerical character value                $chval = ord($utf8_text{$pos});                // Check what the first character is - it will tell us how many bytes the                // Unicode value covers                if ( ( $chval >= 0x00 ) && ( $chval <= 0x7F ) )                {                        // 1 Byte UTF-8 Unicode (7-Bit ASCII) Character                        $bytes = 1;                }                else if ( ( $chval >= 0xC0 ) && ( $chval <= 0xDF ) )                {                        // 2 Byte UTF-8 Unicode Character                        $bytes = 2;                }                else if ( ( $chval >= 0xE0 ) && ( $chval <= 0xEF ) )                {                        // 3 Byte UTF-8 Unicode Character                        $bytes = 3;                }                else if ( ( $chval >= 0xF0 ) && ( $chval <= 0xF7 ) )                {                        // 4 Byte UTF-8 Unicode Character                        $bytes = 4;                }                else if ( ( $chval >= 0xF8 ) && ( $chval <= 0xFB ) )                {                        // 5 Byte UTF-8 Unicode Character                        $bytes = 5;                }                else if ( ( $chval >= 0xFC ) && ( $chval <= 0xFD ) )                {                        // 6 Byte UTF-8 Unicode Character                        $bytes = 6;                }                else                {                        // Invalid Code - skip character and do nothing                        $bytes = 0;                        $pos++;                }                // check that there is enough data remaining to read                if (($pos + $bytes - 1) < strlen( $utf8_text ) )                {                        // Cycle through the number of bytes specified,                        // copying them to the output string                        while ( $bytes > 0 )                        {                                $output .= $utf8_text{$pos};                                $pos++;                                $bytes--;                        }                }                else                {                        break;                }        }        // Return the result        return $output;}/******************************************************************************* End of Function:     UTF8_fix******************************************************************************//******************************************************************************** Function:     UTF16_fix** Description:  Checks a string for badly formed Unicode UTF-16 coding and*               returns the same string containing only the parts which*               were properly formed UTF-16 data.** Parameters:   utf16_text - a string with possibly badly formed UTF-16 data*               MSB_first - True will cause processing as Big Endian UTF-16 (Motorola, MSB first)*                           False will cause processing as Little Endian UTF-16 (Intel, LSB first)** Returns:      output - the well formed UTF-16 version of the string*******************************************************************************/function UTF16_fix( $utf16_text, $MSB_first ){        // Initialise the current position in the string        $pos = 0;        // Create a string to accept the well formed output        $output = "" ;        // Cycle through each group of bytes, ensuring the coding is correct        while ( $pos < strlen( $utf16_text ) )        {                // Retreive the current numerical character value                $chval1 = ord($utf16_text{$pos});                // Skip over character just read                $pos++;                // Check if there is another character available                if ( $pos  < strlen( $utf16_text ) )                {                        // Another character is available - get it for the second half of the UTF-16 value                        $chval2 = ord( $utf16_text{$pos} );                }                else                {                        // Error - no second byte to this UTF-16 value - end processing                        continue 1;                }                // Skip over character just read                $pos++;                // Calculate the 16 bit unicode value                if ( $MSB_first )                {                        // Big Endian                        $UTF16_val = $chval1 * 0x100 + $chval2;                }                else                {                        // Little Endian                        $UTF16_val = $chval2 * 0x100 + $chval1;                }                if ( ( ( $UTF16_val >= 0x0000 ) && ( $UTF16_val <= 0xD7FF ) ) ||                     ( ( $UTF16_val >= 0xE000 ) && ( $UTF16_val <= 0xFFFF ) ) )                {                        // Normal Character (Non Surrogate pair)                        // Add it to the output                        $output .= chr( $chval1 ) . chr ( $chval2 );                }                else if ( ( $UTF16_val >= 0xD800 ) && ( $UTF16_val <= 0xDBFF ) )                {                        // High surrogate of a surrogate pair                        // Now we need to read the low surrogate                        // Check if there is another 2 characters available                        if ( ( $pos + 3 ) < strlen( $utf16_text ) )                        {                                // Another 2 characters are available - get them                                $chval3 = ord( $utf16_text{$pos} );                                $chval4 = ord( $utf16_text{$pos+1} );                                // Calculate the second 16 bit unicode value                                if ( $MSB_first )                                {                                        // Big Endian                                        $UTF16_val2 = $chval3 * 0x100 + $chval4;                                }                                else
unicode.php - 源码说明

本页面展示了「CMS系统提供学习研究修改最好了比流行的一些CMS简单但是更容易理解是帮助你学习PHPCMS系统的好东东哦」中的 unicode.php 源码文件，采用 PHP 编程语言编写，共 1,242 行代码。您可以在线阅读完整代码内容，也可以返回资源详情页下载完整源码包进行本地学习和开发。
虫虫下载站收录了大量与CMS相关的技术资源，包括源代码、技术文档、电路图等，是电子工程师和嵌入式开发者的专业学习平台。
⌨️ 快捷键说明

复制代码Ctrl + C
搜索代码Ctrl + F
全屏模式F11
增大字号Ctrl + =
减小字号Ctrl + -
显示快捷键?