📄 促使我写此正则表达式解析库的由来.htm
字号:
<p class=MsoNormal><span lang=EN-US> </span><span
style='font-family:宋体'>兰兄的经历我很有经验,以前也曾经努力寻找一套好用的正则表达式的</span><span lang=EN-US>C++</span><span
style='font-family:宋体'>库,然</span></p>
<p class=MsoNormal><span style='font-family:宋体'>而用过以后都不太满意。</span> </p>
<p class=MsoNormal><span style='font-family:宋体'>正则表达式中公认的</span><span
lang=EN-US>perl</span><span style='font-family:宋体'>是做的最好的(现在很多库都声称可以支持</span><span
lang=EN-US>perl</span><span style='font-family:宋体'>的正则表达</span></p>
<p class=MsoNormal><span style='font-family:宋体'>式),比如懒惰匹配就很有用。</span></p>
<p class=MsoNormal><span lang=EN-US> </span><span
style='font-family:宋体'>如果兰兄不是必须用</span><span lang=EN-US>C++ </span><span
style='font-family:宋体'>做的话,可以用内嵌</span><span lang=EN-US>python</span><span
style='font-family:宋体'>引擎,然后用</span><span lang=EN-US>python</span><span
style='font-family:宋体'>里的正则</span></p>
<p class=MsoNormal><span style='font-family:宋体'>表达式</span><span lang=EN-US>module
re</span></p>
<p class=MsoNormal><span style='font-family:宋体'>按你的要求的话,需要使用</span><span
lang=EN-US>python 2.4</span><span style='font-family:宋体'>以上版本,因为中文的</span><span
lang=EN-US>unicode</span><span style='font-family:宋体'>在</span><span lang=EN-US>2.4</span><span
style='font-family:宋体'>才支持(</span><span lang=EN-US>2.</span></p>
<p class=MsoNormal><span lang=EN-US>4</span><span style='font-family:宋体'>还没有</span><span
lang=EN-US>release</span><span style='font-family:宋体'>。)</span></p>
<p class=MsoNormal><span lang=EN-US> </span></p>
<p class=MsoNormal><span lang=EN-US> </span><span style='font-family:
宋体'>关于</span><span lang=EN-US>C++</span><span style='font-family:宋体'>汉字查找的问题最近大话西游也遇到,因为要限制经济频道里的说话必须包</span></p>
<p class=MsoNormal><span style='font-family:宋体'>含“卖”。要精确判断的</span></p>
<p class=MsoNormal><span style='font-family:宋体'>话,需要先把</span><span lang=EN-US>char*</span><span
style='font-family:宋体'>或</span><span lang=EN-US>string</span><span
style='font-family:宋体'>的字符串先用</span><span lang=EN-US>MultiByteToWideChar</span><span
style='font-family:宋体'>转为</span><span lang=EN-US> WCHAR</span><span
style='font-family:宋体'>或</span></p>
<p class=MsoNormal><span lang=EN-US>wstring, </span><span style='font-family:
宋体'>然后再查找。</span></p>
<p class=MsoNormal><span lang=EN-US> </span></p>
<p class=MsoNormal><span style='font-family:宋体'>希望对你有用。</span></p>
<p class=MsoNormal><span lang=EN-US>_______________________________________________</span></p>
<p class=MsoNormal><span lang=EN-US>Cpp mailing list</span></p>
<p class=MsoNormal><span lang=EN-US> </span></p>
<p class=MsoNormal><span style='font-family:宋体'>孟岩</span><span lang=EN-US>,</span><span
style='font-family:宋体'>您好!</span></p>
<p class=MsoNormal><span lang=EN-US> </span></p>
<p class=MsoNormal><span lang=EN-US>======= 2004-06-02 13:22:29 </span><span
style='font-family:宋体'>您在来信中写道:</span><span lang=EN-US>=======</span></p>
<p class=MsoNormal><span lang=EN-US> </span></p>
<p class=MsoNormal><span lang=EN-US>>See www.pcre.org</span></p>
<p class=MsoNormal><span lang=EN-US> </span><span
style='font-family:宋体'>多谢!</span></p>
<p class=MsoNormal><span lang=EN-US>></span><span style='font-family:宋体'>我个人的感觉,不如静下心来写一个</span><span
lang=EN-US>iterator</span><span style='font-family:宋体'>,应该是很容易的。不过我也很久没干过</span></p>
<p class=MsoNormal><span lang=EN-US>></span><span style='font-family:宋体'>这种事情了,也就泛泛的说说算了。</span></p>
<p class=MsoNormal><span lang=EN-US> </span><span
style='font-family:宋体'>我着手写了一下,似乎我写一个</span><span lang=EN-US>iterator</span><span
style='font-family:宋体'>不起作用,需要把</span><span lang=EN-US>base_string</span><span
style='font-family:宋体'>也一起写了。而且有个很大的问题:</span><span lang=EN-US>++</span><span
style='font-family:宋体'>操作跟</span><span lang=EN-US>--</span><span
style='font-family:宋体'>操作不一致。</span><span lang=EN-US>++</span><span
style='font-family:宋体'>的时候我可以很容易判断当前字节是否是多字节吗,从而地址</span><span lang=EN-US>+1</span><span
style='font-family:宋体'>还是</span><span lang=EN-US>+2</span><span
style='font-family:宋体'>。但是,</span><span lang=EN-US>--</span><span
style='font-family:宋体'>的时候就不是那么好做了(考虑到支持如</span><span lang=EN-US>GIB5</span><span
style='font-family:宋体'>——其汉字的后半字节编码跟英文有重叠),如果单纯的地址</span><span lang=EN-US>-1</span><span
style='font-family:宋体'>,会不会出现问题,这个迭代子是否还是</span><span lang=EN-US>random_iterator?</span></p>
<p class=MsoNormal><span lang=EN-US> </span></p>
<p class=MsoNormal><span lang=EN-US>= = = = = = = = = = = = = = = = = = = =</span></p>
<p class=MsoNormal><span lang=EN-US> </span></p>
<p class=MsoNormal><span lang=EN-US> </span></p>
<p class=MsoNormal><span style='font-family:宋体'> 致</span></p>
<p class=MsoNormal><span style='font-family:宋体'>礼!</span></p>
<p class=MsoNormal><span lang=EN-US> </span></p>
<p class=MsoNormal><span lang=EN-US> </span></p>
<p class=MsoNormal><span style='font-family:宋体'> </span><span
lang=EN-US>lanzhengpeng</span></p>
<p class=MsoNormal><span style='font-family:宋体'> </span><span
lang=EN-US>2004-06-02</span></p>
<p class=MsoNormal><span lang=EN-US> </span></p>
<p class=MsoNormal><span lang=EN-US> </span></p>
<p class=MsoNormal><span lang=EN-US>_______________________________________________</span></p>
<p class=MsoNormal><span lang=EN-US>Cpp mailing list</span></p>
<p class=MsoNormal><span style='font-family:宋体'>发送时间</span><span lang=EN-US>:
2004</span><span style='font-family:宋体'>年</span><span lang=EN-US>6</span><span
style='font-family:宋体'>月</span><span lang=EN-US>2</span><span style='font-family:
宋体'>日</span><span lang=EN-US> 15:53</span></p>
<p class=MsoNormal><span style='font-family:宋体'>收件人</span><span lang=EN-US>:
C++ Discuss Group</span></p>
<p class=MsoNormal><span style='font-family:宋体'>主题</span><span lang=EN-US>:
Re[2]: </span><span style='font-family:宋体'>答复</span><span lang=EN-US>: [cpp]</span><span
style='font-family:宋体'>正则表达式和多字节码的问题</span></p>
<p class=MsoNormal><span lang=EN-US> </span></p>
<p class=MsoNormal><span lang=EN-US>Hello lanzhengpeng,</span></p>
<p class=MsoNormal><span lang=EN-US> </span></p>
<p class=MsoNormal><span lang=EN-US>Wednesday, June 2, 2004</span><span
lang=EN-US>, </span><span lang=EN-US>3:38:11 PM</span><span lang=EN-US>, you
wrote:</span></p>
<p class=MsoNormal><span lang=EN-US> </span></p>
<p class=MsoNormal><span style='font-family:宋体'>还是转一下吧</span><span lang=EN-US>,
</span><span style='font-family:宋体'>转成</span><span lang=EN-US> wstring.</span></p>
<p class=MsoNormal><span lang=EN-US> </span></p>
<p class=MsoNormal><span style='font-family:宋体'>我想到另外一个问题</span><span
lang=EN-US>, </span><span style='font-family:宋体'>也是我前段干过的</span><span
lang=EN-US>.</span></p>
<p class=MsoNormal><span lang=EN-US> </span></p>
<p class=MsoNormal><span style='font-family:宋体'>就是英文有</span><span lang=EN-US>
stricmp, </span><span style='font-family:宋体'>中文是否也应该有一个模糊查找</span><span
lang=EN-US>. </span><span style='font-family:宋体'>比如忽略掉同音字的</span><span
lang=EN-US>.</span></p>
<p class=MsoNormal><span lang=EN-US> </span></p>
<p class=MsoNormal><span style='font-family:宋体'>有时候也不用忽略所有同音字,高频字一般即使同音也不会混用</span><span
lang=EN-US>. </span><span style='font-family:宋体'>一些不常用到的字容</span></p>
<p class=MsoNormal><span style='font-family:宋体'>易用同音别字代替</span><span
lang=EN-US>.</span></p>
<p class=MsoNormal><span style='font-family:宋体'>另外汉字有多音字的问题,使这种模糊匹配的算法变得复杂。</span></p>
<p class=MsoNormal><span lang=EN-US> </span></p>
<p class=MsoNormal><span style='font-family:宋体'>我曾经花了一下午的时间整理资料,把大部分</span><span
lang=EN-US> GBK </span><span style='font-family:宋体'>字集里的汉字的汉语拼音都列出来</span></p>
<p class=MsoNormal><span style='font-family:宋体'>的</span><span lang=EN-US>(</span><span
style='font-family:宋体'>包括声调</span><span lang=EN-US>)</span><span
style='font-family:宋体'>,包括一字多音的。</span></p>
<p class=MsoNormal><span style='font-family:宋体'>还有一种最常用的</span><span
lang=EN-US> 1000 </span><span style='font-family:宋体'>多字的按使用频率排列的表。</span></p>
<p class=MsoNormal><span lang=EN-US> </span></p>
<p class=MsoNormal><span style='font-family:宋体'>有没有人感兴趣呀</span><span
lang=EN-US> :)</span></p>
<p class=MsoNormal><span lang=EN-US> </span></p>
<p class=MsoNormal><span lang=EN-US>Best regards,</span></p>
<p class=MsoNormal><span lang=EN-US> Cloudwu</span></p>
<p class=MsoNormal><span lang=EN-US> </span></p>
<p class=MsoNormal><span lang=EN-US>[</span><span style='font-family:宋体'>一人之断制</span><span
lang=EN-US>, </span><span style='font-family:宋体'>所见有限</span><span lang=EN-US>, </span><span
style='font-family:宋体'>犹目之一瞥</span><span lang=EN-US>, </span><span
style='font-family:宋体'>岂能尽万物之情乎</span><span lang=EN-US>]</span></p>
<p class=MsoNormal><span lang=EN-US> </span></p>
<p class=MsoNormal><span lang=EN-US> </span></p>
<p class=MsoNormal><span lang=EN-US>_______________________________________________</span></p>
<p class=MsoNormal><span lang=EN-US>Cpp mailing list</span></p>
<p class=MsoNormal><span lang=EN-US> </span></p>
<p class=MsoNormal><span lang=EN-US>> ======= 2004-06-02 11:18:58 </span><span
style='font-family:宋体'>您在来信中写道:</span><span lang=EN-US>=======</span></p>
<p class=MsoNormal><span lang=EN-US>> </span></p>
<p class=MsoNormal><span lang=EN-US>> > </span><span
style='font-family:宋体'>关于</span><span lang=EN-US>C++</span><span
style='font-family:宋体'>汉字查找的问题最近大话西游也遇到,因为要限制经济频道里的说话必须包含“卖”。要精确判断的</span></p>
<p class=MsoNormal><span lang=EN-US>> ></span><span style='font-family:
宋体'>话,需要先把</span><span lang=EN-US>char*</span><span style='font-family:宋体'>或</span><span
lang=EN-US>string</span><span style='font-family:宋体'>的字符串先用</span><span
lang=EN-US>MultiByteToWideChar</span><span style='font-family:宋体'>转为</span><span
lang=EN-US> WCHAR</span><span style='font-family:宋体'>或</span><span
lang=EN-US>wstring, </span><span style='font-family:宋体'>然后再查找。</span></p>
<p class=MsoNormal><span lang=EN-US>> </span><span
style='font-family:宋体'>这样只能判断有和无,实际上我需要精确位置。</span></p>
<p class=MsoNormal><span lang=EN-US> </span></p>
<p class=MsoNormal><span style='font-family:宋体'>是可以精确查找的呀。</span></p>
<p class=MsoNormal><span lang=EN-US> </span></p>
<p class=MsoNormal><span lang=EN-US>> </span><span style='font-family:宋体'>另外是否可以嵌入其他东西:我觉得没有必要,实际那些脚本语言最后也通过</span><span
lang=EN-US>C/C++</span><span style='font-family:宋体'>来做的,搞不好还就是用的我们已知的东西。而且正</span></p>
<p class=MsoNormal><span lang=EN-US>> </span><span style='font-family:宋体'>则表达式如此有用,以至于我到处都在使用——无论程序大小。如果为此在那些众多的程序中嵌入一个脚本,也是我所不愿意的。</span></p>
<p class=MsoNormal><span lang=EN-US>> </span></p>
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -