📄 index.lxp@lxpwrap=x2183_252ehtm.htm
字号:
<table border="0" cellspacing="0" cellpadding="3" width="100%"><tr><td> <div align="center" id="bldcontent"> <a href="../default.htm"><img src="../images/opendocs.png" width="63" height="76" border="0"></a> <br> <div class="symbol">Your OpenSource Publisher™</div> </div> </td></tr></table> <div align="center" class="author"> <a href="../products.lxp">Products</a> | <a href="../wheretobuy.lxp">Where to buy</a> | <a href="../bookstore.lxp">Retailers</a> | <a href="../faq.lxp">FAQ</a> | <a href="../writeforus.lxp">Write for Us.</a> | <a href="#contact">Contact Us.</a> </div> <table border="0" cellspacing="3" cellpadding="0" width="100%"><tr><td width="100%"> <div class="content"> <table border="0" cellspacing="2" cellpadding="0" width="100%"><tr><td width="100%"> <div align="center"><H4 CLASS="AUTHOR"><A NAME="AEN5">Boudewijn Rempt</A><br><a href="../../https@secure.linuxports.com/opendocs/default.htm"><img src=odpyqt125.png></a><br>ISBN: 0-97003300-4-4<br><a href="../../https@secure.linuxports.com/opendocs/default.htm">Available from bookstores everywhere or you can order it here.</a><p>You can download the source files for the book <a href="pyqtsrc.tgz">(code / eps) here.</a><hr></div> <HTML><HEAD><TITLE>Unicode strings</TITLE><METANAME="GENERATOR"CONTENT="Modular DocBook HTML Stylesheet Version 1.72"><LINKREL="HOME"TITLE="GUI Programming with Python: QT Edition"HREF="book1.htm"><LINKREL="UP"TITLE="String Objects in Python and Qt"HREF="c2029.htm"><LINKREL="PREVIOUS"TITLE="QCString — simple strings in PyQt"HREF="x2104.htm"><LINKREL="NEXT"TITLE="Python Objects and Qt Objects"HREF="c2341.htm"></HEAD><BODYCLASS="SECT1"BGCOLOR="#FFFFFF"TEXT="#000000"LINK="#0000FF"VLINK="#840084"ALINK="#0000FF"><DIVCLASS="NAVHEADER"><TABLESUMMARY="Header navigation table"WIDTH="100%"BORDER="0"CELLPADDING="0"CELLSPACING="0"><TR><THCOLSPAN="3"ALIGN="center">GUI Programming with Python: QT Edition</TH></TR><TR><TDWIDTH="10%"ALIGN="left"VALIGN="bottom"><A accesskey="P" href="index.lxp@lxpwrap=x2104_252ehtm.htm">Prev</A></TD><TDWIDTH="80%"ALIGN="center"VALIGN="bottom">Chapter 8. String Objects in Python and Qt</TD><TDWIDTH="10%"ALIGN="right"VALIGN="bottom"><A accesskey="N" href="index.lxp@lxpwrap=c2341_252ehtm.htm">Next</A></TD></TR></TABLE><HRALIGN="LEFT"WIDTH="100%"></DIV><DIVCLASS="SECT1"><H1CLASS="SECT1">Unicode strings</A></H1><DIVCLASS="SECT2"><H2CLASS="SECT2">Introduction to Unicode</A></H2><P>All text that is handled by computers must be encoded. Every letter in a text has to be represented by a numeric value. For a long time, it was assumed that 7 bits would provide enough values to encode all necessary letters; this was the basis for the ASCII character set. However, with the spread of computers all over the world, it became clear that this was not enough. A whole host of different encodings were designed, varying from the obscure (TISCII) to the pervasive (latin-1). Of course, this leads to problems when you are trying to exchange texts. A western-european latin-1 user cannot easily read a Russian koi-8 text on his system. Another problem is that those small, one-byte, eight-bit character sets don't have room for useful stuff, such as extensive mathematical symbols. The solution has been to create a monster character set consisting of at least 65000 code-points including every possible character someone might want to use. This is ISO/IED-10646. The Unicode standard (http://www.unicode.org) is the official implementation of ISO/IED-10646.</P><P>Unicode is an essential feature of any modern application. Unicode is mandatory for every e-mail client, for instance, but also for all XML processing, web browsers, many modern programming languages, all Windows applications (such as Word), and KDE 2.0 translation files.</P><P>Unicode is not perfect, though. Some programmers, such as Jamie Zawinski of XEmacs and Netscape fame, lament the extra bytes that Unicode needs — two bytes for every character instead of one. Japanese experts oppose the unification of Chinese characters and Japanese characters. Japanese characters are derived from Chinese characters, historically, and even their modern meaning is often identical, but there are some slight visual differences. These complainers are often very vociferous, but Unicode is the best solution we have for representing the wide variety of scripts humanity has invented.</P><P>There are a few other practical problems concerning Unicode. Since the character set is so very large, there are no fonts that include all characters. The best font available is Microsoft's Arial Unicode, which can be downloaded for free. The Unicode character set also includes interesting scripts such as Devanagari, a script where single letters combine to from complicated ligatures. The total number of Devanagari letters is fairly small, but the set of ligatures runs into the hundreds. Those ligatures are not defined in the character set, but have to be present in fonts. Scripts like Arabic or Burmese are even more complicated. For those scripts, special rendering engines have to be written in order to display a text correctly.</P><P>From version 3, Qt includes capable rendering engines for a number of scripts, such as Arabic, and promises to include more. With Qt 3, you can also combine several fonts to form a more complete set of characters, which means that you no longer have use have one monster font with tens of thousands of glyphs.</P><P>The next problem is inputting those texts. Even with remappable keyboards, it's still a monster job to support all scripts. Japanese, for instance, needs a special-purpose input mechanism with dictionary lookups that decide which combination of sounds must be represented using Kanji (Chinese-derived characters) or one of the two syllabic scripts, kana and katakana.</P><P>There are still more complications, that have to do with sort order, bidirectional text (Hebrew going from right to left, Latin from left to right) — then there are vested problems with determining which language is the language of preference for the user, which country he is in (I prefer to write in English, but have the dates show up in the Dutch format, for instance). All these problems have their bearing upon programming using Unicode, but are so complicated that a separate book should be written to deal with them.</P><P>However, both Python strings and Qt strings support Unicode — and both Python and Qt strings support conversion from Unicode to legacy character sets such as the wide-spread Latin-1, and vice-versa. As said above, Unicode is a multi-byte encoding: that means that a single Unicode character is encoded using <SPAN><ICLASS="EMPHASIS">two</I></SPAN> bytes. Of course, this doubles memory requirements compared to single-byte character sets such as Latin-1. This can be circumvented by encoding Unicode using a variable number of bytes, known as UTF-8. In this scheme, Unicode characters that are equivalent to ASCII characters use just one byte, while other characters take up to three bytes. UTF-8 is a wide-spread standard, and both Qt and Python support it.</P><P>I'll first describe the pitfalls of working with Unicode from Python, and then bring in the Qt complications.</P></DIV><DIVCLASS="SECT2"><H2CLASS="SECT2">Python and Unicode</A></H2><P>Python actually makes a difference between Unicode strings and 'normal' strings — that is, strings where every byte represents one character. Plain Python strings are often used as character arrays representing immutable binary data. In fact, plain strings are semantically very similar to Java's byte array, or Qt's <TTCLASS="CLASSNAME">QByteArray</TT> class — they represent a simple sequence of bytes, where every byte <SPAN><ICLASS="EMPHASIS">may</I></SPAN> represent a character, but could also represent something quite different, not a human readable text at all.</P><P>Creating a Unicode string is a bootstrapping problem. Whether you use BlackAdder's Scintilla editor or another editor, it will probably not support Unicode input, so you cannot type Chinese characters directly. However, there are clever ways around this problem: you can either type hex codes, or construct your strings from other sources. In the third part of this book we will create a small but fully functional Unicode editor.</P><DIVCLASS="SECT3"><H3CLASS="SECT3">String literals</A></H3><P>You can create a Unicode string literal by prefixing the string with the letter <SPAN><ICLASS="EMPHASIS">u</I></SPAN>, or convert a plain string to Unicode with the <TTCLASS="FUNCTION">unicode</TT> keyword. You cannot, however, write Python code using anything but ASCII. If you look at the following script, you will notice that there is a function defined in Chinese characters (yin4shua1 means print), that tries to print the opening words of the Nala —, a Sanskrit epos. Python cannot handle this, so all actual code must be in ASCII.</P><DIVCLASS="MEDIAOBJECT"><P><DIVCLASS="CAPTION"><P>A Python script written in Unicode.</P></DIV></P></DIV><P>Of course, it would be nice if we could at least type the strings directly in UTF-8, as shown in the next screenshot:</P><DIVCLASS="MEDIAOBJECT"><P><DIVCLASS="CAPTION"><P>A Python script with the strings written in Unicode.</P></DIV></P></DIV><P>Unfortunately, this won't work either. Hidden deep in the bowels of the Python startup process, a default encoding is set for all strings. This encoding is used to convert from Unicode whenever the Unicode string has to be presented to outside world components that don't talk Unicode, such as <TTCLASS="FUNCTION">print</TT>. By default this is 7-bits ASCII. Running the script gives the following error:</P><PRECLASS="SCREEN">boudewijn@maldar:~/doc/opendoc/ch4 > python unicode2.pyTraceback (most recent call last): File "unicode2.py", line 4, in ? nala() File "unicode2.py", line 2, in nala print u"啶啶膏ムう 啶班ぞ啶啶
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -