📄 langtags.pm
字号:
# Time-stamp: "2004-10-06 23:26:33 ADT"# Sean M. Burke <sburke@cpan.org>require 5.000;package I18N::LangTags;use strict;use vars qw(@ISA @EXPORT @EXPORT_OK %EXPORT_TAGS $VERSION %Panic);require Exporter;@ISA = qw(Exporter);@EXPORT = qw();@EXPORT_OK = qw(is_language_tag same_language_tag extract_language_tags super_languages similarity_language_tag is_dialect_of locale2language_tag alternate_language_tags encode_language_tag panic_languages implicate_supers implicate_supers_strictly );%EXPORT_TAGS = ('ALL' => \@EXPORT_OK);$VERSION = "0.35";sub uniq { my %seen; return grep(!($seen{$_}++), @_); } # a util function=head1 NAMEI18N::LangTags - functions for dealing with RFC3066-style language tags=head1 SYNOPSIS use I18N::LangTags();...or specify whichever of those functions you want to import, like so: use I18N::LangTags qw(implicate_supers similarity_language_tag);All the exportable functions are listed below -- you're free to importonly some, or none at all. By default, none are imported. If yousay: use I18N::LangTags qw(:ALL)...then all are exported. (This saves you from having to usesomething less obvious like C<use I18N::LangTags qw(/./)>.)If you don't import any of these functions, assume a C<&I18N::LangTags::>in front of all the function names in the following examples.=head1 DESCRIPTIONLanguage tags are a formalism, described in RFC 3066 (obsoleting1766), for declaring what language form (language and possiblydialect) a given chunk of information is in.This library provides functions for common tasks involving languagetags as they are needed in a variety of protocols and applications.Please see the "See Also" references for a thorough explanationof how to correctly use language tags.=over=cut###########################################################################=item * the function is_language_tag($lang1)Returns true iff $lang1 is a formally valid language tag. is_language_tag("fr") is TRUE is_language_tag("x-jicarilla") is FALSE (Subtags can be 8 chars long at most -- 'jicarilla' is 9) is_language_tag("sgn-US") is TRUE (That's American Sign Language) is_language_tag("i-Klikitat") is TRUE (True without regard to the fact noone has actually registered Klikitat -- it's a formally valid tag) is_language_tag("fr-patois") is TRUE (Formally valid -- altho descriptively weak!) is_language_tag("Spanish") is FALSE is_language_tag("french-patois") is FALSE (No good -- first subtag has to match /^([xXiI]|[a-zA-Z]{2,3})$/ -- see RFC3066) is_language_tag("x-borg-prot2532") is TRUE (Yes, subtags can contain digits, as of RFC3066)=cutsub is_language_tag { ## Changes in the language tagging standards may have to be reflected here. my($tag) = lc($_[0]); return 0 if $tag eq "i" or $tag eq "x"; # Bad degenerate cases that the following # regexp would erroneously let pass return $tag =~ /^(?: # First subtag [xi] | [a-z]{2,3} ) (?: # Subtags thereafter - # separator [a-z0-9]{1,8} # subtag )* $/xs ? 1 : 0;}###########################################################################=item * the function extract_language_tags($whatever)Returns a list of whatever looks like formally valid language tagsin $whatever. Not very smart, so don't get too creative withwhat you want to feed it. extract_language_tags("fr, fr-ca, i-mingo") returns: ('fr', 'fr-ca', 'i-mingo') extract_language_tags("It's like this: I'm in fr -- French!") returns: ('It', 'in', 'fr') (So don't just feed it any old thing.)The output is untainted. If you don't know what tainting is,don't worry about it.=cutsub extract_language_tags { ## Changes in the language tagging standards may have to be reflected here. my($text) = $_[0] =~ m/(.+)/ # to make for an untainted result ? $1 : '' ; return grep(!m/^[ixIX]$/s, # 'i' and 'x' aren't good tags $text =~ m/ \b (?: # First subtag [iIxX] | [a-zA-Z]{2,3} ) (?: # Subtags thereafter - # separator [a-zA-Z0-9]{1,8} # subtag )* \b /xsg );}###########################################################################=item * the function same_language_tag($lang1, $lang2)Returns true iff $lang1 and $lang2 are acceptable variant tagsrepresenting the same language-form. same_language_tag('x-kadara', 'i-kadara') is TRUE (The x/i- alternation doesn't matter) same_language_tag('X-KADARA', 'i-kadara') is TRUE (...and neither does case) same_language_tag('en', 'en-US') is FALSE (all-English is not the SAME as US English) same_language_tag('x-kadara', 'x-kadar') is FALSE (these are totally unrelated tags) same_language_tag('no-bok', 'nb') is TRUE (no-bok is a legacy tag for nb (Norwegian Bokmal))C<same_language_tag> works by just seeing whetherC<encode_language_tag($lang1)> is the same asC<encode_language_tag($lang2)>.(Yes, I know this function is named a bit oddly. Call it historicreasons.)=cutsub same_language_tag { my $el1 = &encode_language_tag($_[0]); return 0 unless defined $el1; # this avoids the problem of # encode_language_tag($lang1) eq and encode_language_tag($lang2) # being true if $lang1 and $lang2 are both undef return $el1 eq &encode_language_tag($_[1]) ? 1 : 0;}###########################################################################=item * the function similarity_language_tag($lang1, $lang2)Returns an integer representing the degree of similarity betweentags $lang1 and $lang2 (the order of which does not matter), wheresimilarity is the number of common elements on the left,without regard to case and to x/i- alternation. similarity_language_tag('fr', 'fr-ca') is 1 (one element in common) similarity_language_tag('fr-ca', 'fr-FR') is 1 (one element in common) similarity_language_tag('fr-CA-joual', 'fr-CA-PEI') is 2 similarity_language_tag('fr-CA-joual', 'fr-CA') is 2 (two elements in common) similarity_language_tag('x-kadara', 'i-kadara') is 1 (x/i- doesn't matter) similarity_language_tag('en', 'x-kadar') is 0 similarity_language_tag('x-kadara', 'x-kadar') is 0 (unrelated tags -- no similarity) similarity_language_tag('i-cree-syllabic', 'i-cherokee-syllabic') is 0 (no B<leftmost> elements in common!)=cutsub similarity_language_tag { my $lang1 = &encode_language_tag($_[0]); my $lang2 = &encode_language_tag($_[1]); # And encode_language_tag takes care of the whole # no-nyn==nn, i-hakka==zh-hakka, etc, things # NB: (i-sil-...)? (i-sgn-...)? return undef if !defined($lang1) and !defined($lang2); return 0 if !defined($lang1) or !defined($lang2); my @l1_subtags = split('-', $lang1); my @l2_subtags = split('-', $lang2); my $similarity = 0; while(@l1_subtags and @l2_subtags) { if(shift(@l1_subtags) eq shift(@l2_subtags)) { ++$similarity; } else { last; } } return $similarity;}###########################################################################=item * the function is_dialect_of($lang1, $lang2)Returns true iff language tag $lang1 represents a subform oflanguage tag $lang2.B<Get the order right! It doesn't work the other way around!> is_dialect_of('en-US', 'en') is TRUE (American English IS a dialect of all-English) is_dialect_of('fr-CA-joual', 'fr-CA') is TRUE is_dialect_of('fr-CA-joual', 'fr') is TRUE (Joual is a dialect of (a dialect of) French) is_dialect_of('en', 'en-US') is FALSE (all-English is a NOT dialect of American English) is_dialect_of('fr', 'en-CA') is FALSE is_dialect_of('en', 'en' ) is TRUE is_dialect_of('en-US', 'en-US') is TRUE (B<Note:> these are degenerate cases) is_dialect_of('i-mingo-tom', 'x-Mingo') is TRUE (the x/i thing doesn't matter, nor does case) is_dialect_of('nn', 'no') is TRUE (because 'nn' (New Norse) is aliased to 'no-nyn', as a special legacy case, and 'no-nyn' is a subform of 'no' (Norwegian))=cutsub is_dialect_of { my $lang1 = &encode_language_tag($_[0]); my $lang2 = &encode_language_tag($_[1]); return undef if !defined($lang1) and !defined($lang2); return 0 if !defined($lang1) or !defined($lang2); return 1 if $lang1 eq $lang2; return 0 if length($lang1) < length($lang2); $lang1 .= '-'; $lang2 .= '-'; return (substr($lang1, 0, length($lang2)) eq $lang2) ? 1 : 0;}###########################################################################=item * the function super_languages($lang1)Returns a list of language tags that are superordinate tags to $lang1-- it gets this by removing subtags from the end of $lang1 untilnothing (or just "i" or "x") is left. super_languages("fr-CA-joual") is ("fr-CA", "fr") super_languages("en-AU") is ("en") super_languages("en") is empty-list, () super_languages("i-cherokee") is empty-list, () ...not ("i"), which would be illegal as well as pointless.If $lang1 is not a valid language tag, returns empty-list ina list context, undef in a scalar context.A notable and rather unavoidable problem with this method:"x-mingo-tom" has an "x" because the whole tag isn't anIANA-registered tag -- but super_languages('x-mingo-tom') is('x-mingo') -- which isn't really right, since 'i-mingo' isregistered. But this module has no way of knowing that. (But notethat same_language_tag('x-mingo', 'i-mingo') is TRUE.)More importantly, you assume I<at your peril> that superordinates of$lang1 are mutually intelligible with $lang1. Consider thiscarefully.=cut sub super_languages { my $lang1 = $_[0]; return() unless defined($lang1) && &is_language_tag($lang1); # a hack for those annoying new (2001) tags: $lang1 =~ s/^nb\b/no-bok/i; # yes, backwards $lang1 =~ s/^nn\b/no-nyn/i; # yes, backwards $lang1 =~ s/^[ix](-hakka\b)/zh$1/i; # goes the right way # i-hakka-bork-bjork-bjark => zh-hakka-bork-bjork-bjark my @l1_subtags = split('-', $lang1); ## Changes in the language tagging standards may have to be reflected here. # NB: (i-sil-...)? my @supers = (); foreach my $bit (@l1_subtags) { push @supers, scalar(@supers) ? ($supers[-1] . '-' . $bit) : $bit; } pop @supers if @supers; shift @supers if @supers && $supers[0] =~ m<^[iIxX]$>s; return reverse @supers;}###########################################################################=item * the function locale2language_tag($locale_identifier)This takes a locale name (like "en", "en_US", or "en_US.ISO8859-1")and maps it to a language tag. If it's not mappable (as with,notably, "C" and "POSIX"), this returns empty-list in a list context,or undef in a scalar context. locale2language_tag("en") is "en" locale2language_tag("en_US") is "en-US" locale2language_tag("en_US.ISO8859-1") is "en-US" locale2language_tag("C") is undef or () locale2language_tag("POSIX") is undef or () locale2language_tag("POSIX") is undef or ()I'm not totally sure that locale names map satisfactorily to languagetags. Think REAL hard about how you use this. YOU HAVE BEEN WARNED.The output is untainted. If you don't know what tainting is,don't worry about it.=cut sub locale2language_tag { my $lang = $_[0] =~ m/(.+)/ # to make for an untainted result ? $1 : '' ; return $lang if &is_language_tag($lang); # like "en" $lang =~ tr<_><->; # "en_US" -> en-US $lang =~ s<(?:[\.\@][-_a-zA-Z0-9]+)+$><>s; # "en_US.ISO8859-1" -> en-US # it_IT.utf8@euro => it-IT return $lang if &is_language_tag($lang); return;}###########################################################################=item * the function encode_language_tag($lang1)This function, if given a language tag, returns an encoding of it suchthat:* tags representing different languages never get the same encoding.* tags representing the same language always get the same encoding.* an encoding of a formally valid language tag always is a stringvalue that is defined, has length, and is true if considered as aboolean.Note that the encoding itself is B<not> a formally valid language tag.Note also that you cannot, currently, go from an encoding back to alanguage tag that it's an encoding of.Note also that you B<must> consider the encoded value as atomic; i.e.,you should not consider it as anything but an opaque, unanalysablestring value. (The internals of the encoding method may change infuture versions, as the language tagging standard changes over time.)C<encode_language_tag> returns undef if given anything other than aformally valid language tag.The reason C<encode_language_tag> exists is because different languagetags may represent the same language; this is normally treatable withC<same_language_tag>, but consider this situation:
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -