Jcode: Japanese charset handler

SYNOPSIS

 use Jcode;
 #
 # traditional
 Jcode::convert(\$str, $ocode, $icode, "z");
 # or OOP!
 print Jcode->new($str)->h2z->tr($from, $to)->utf8;

DESCRIPTION

<Japanese document is now available as Jcode::Nihongo. >

Jcode.pm supports both object and traditional approach. With object approach, you can go like;

$iso_2022_jp = Jcode->new($str)->h2z->jis;

Which is more elegant than:

$iso_2022_jp = $str; &jcode::convert(\$iso_2022_jp, 'jis', &jcode::getcode(\$str), "z");

For those unfamiliar with objects, Jcode.pm still supports \*(C`getcode()\*(C' and \*(C`convert().\*(C'

If the perl version is 5.8.1, Jcode acts as a wrapper to Encode, the standard charset handler module for Perl 5.8 or later.

Methods

Methods mentioned here all return Jcode object unless otherwise mentioned.

\$1

Creates Jcode object $j from $str. Input code is automatically checked unless you explicitly set $icode. For available charset, see getcode below. For perl 5.8.1 or better, $icode can be any encoding name that Encode understands. $j = Jcode->new($european, 'iso-latin1'); When the object is stringified, it returns the EUC-converted string so you can <print $j> instead of <print $j->euc>.

Passing Reference: Instead of scalar value, You can use reference as Jcode->new(\$str); This saves time a little bit. In exchange of the value of $str being converted. (In a way, $str is now \*(L"tied\*(R" to jcode object).

Sets $j's internal string to $str. Handy when you use Jcode object repeatedly (saves time and memory to create object). # converts mailbox to SJIS format my $jconv = new Jcode; $/ = 00; while(<>){ print $jconv->set(\$_)->mime_decode->sjis; } Appends $str to $j's internal string. shortcut for Jcode->new() so you can go like; In general, you can retrieve encoded string as $j->encoded.

$sjis = jcode($str)->sjis

What you code is what you get :) Same as \*(C`$j->h2z->jis\*(C'. Hankaku Kanas are forcibly converted to Zenkaku. For perl 5.8.1 and better, you can also use any encoding names and aliases that Encode supports. For example: $european = $j->iso_latin1; # replace '-' with '_' for names. \s-1FYI\s0: Encode::Encoder uses similar trick.

$j->fallback($fallback): For perl is 5.8.1 or better, Jcode stores the internal string in \s-1UTF-8\s0. Any character that does not map to ->encoding are replaced with a '?', which is Encode standard. my $unistr = "\x{262f}"; # YIN YANG my $j = jcode($unistr); # $j->euc is '?' You can change this behavior by specifying fallback like Encode. Values are the same as Encode. \*(C`Jcode::FB_PERLQQ\*(C', \*(C`Jcode::FB_XMLCREF\*(C', \*(C`Jcode::FB_HTMLCREF\*(C' are aliased to those of Encode for convenice. print $j->fallback(Jcode::FB_PERLQQ)->euc; # '\x{262f}' print $j->fallback(Jcode::FB_XMLCREF)->euc; # '☯' print $j->fallback(Jcode::FB_HTMLCREF)->euc; # '☯' The global variable $Jcode::FALLBACK stores the default fallback so you can override that by assigning the value. $Jcode::FALLBACK = Jcode::FB_PERLQQ; # set default fallback scheme

folds lines in jcode string every $width (default: 72) where $width is the number of \*(L"halfwidth\*(R" character. Fullwidth Characters are counted as two. with a newline string spefied by $newline_str (default: \*(L"\n\*(R"). Rudimentary kinsoku suppport is now available for Perl 5.8.1 and better. returns character length properly, rather than byte length. To use methods below, you need MIME::Base64. To install, simply

perl -MCPAN -e 'CPAN::Shell->install("MIME::Base64")'

If your perl is 5.6 or better, there is no need since MIME::Base64 is bundled. Converts $str to MIME-Header documented in \s-1RFC1522\s0. When $lf is specified, it uses $lf to fold line (default: \n). When $bpl is specified, it uses $bpl for the number of bytes (default: 76; this number must be smaller than 76). For Perl 5.8.1 or better, you can also encode \s-1MIME\s0 Header as: $mime_header = $j->MIME_Header; In which case the resulting $mime_header is MIME-B-encoded \s-1UTF-8\s0 whereas \*(C`$j->mime_encode()\*(C' returnes MIME-B-encoded \s-1ISO-2022-JP\s0. Most modern MUAs support both.

$j->mime_decode;: Decodes MIME-Header in Jcode object. For perl 5.8.1 or better, you can also do the same as: Jcode->new($str, 'MIME-Header')
$j->h2z([$keep_dakuten]): Converts X201 kana (Hankaku) to X208 kana (Zenkaku). When $keep_dakuten is set, it leaves dakuten as is (That is, \*(L"ka + dakuten\*(R" is left as is instead of being converted to \*(L"ga\*(R") You can retrieve the number of matches via $j->nmatch;
$j->z2h: Converts X208 kana (Zenkaku) to X201 kana (Hankaku). You can retrieve the number of matches via $j->nmatch; To use \*(C`->m()\*(C' and \*(C`->s()\*(C', you need perl 5.8.1 or better. Applies \*(C`tr/$from/$to/\*(C' on Jcode object where $from and $to are EUC-JP strings. On perl 5.8.1 or better, $from and $to can also be flagged \s-1UTF-8\s0 strings. If $opt is set, \*(C`tr/$from/$to/$opt\*(C' is applied. $opt must be 'c', 'd' or the combination thereof. You can retrieve the number of matches via $j->nmatch; The following methods are available only for perl 5.8.1 or better. Applies \*(C`s/$pattern/$replace/$opt\*(C'. $pattern and \*(C`replace\*(C' must be in EUC-JP or flagged \s-1UTF-8\s0. $opt are the same as regexp options. See perlre for regexp options. Like \*(C`$j->tr()\*(C', \*(C`$j->s()\*(C' returns the object itself so you can nest the operation as follows; $j->tr("a-z", "A-Z")->s("foo", "bar"); Applies \*(C`m/$patter/$opt\*(C'. Note that this method \s-1DOES\s0 \s-1NOT\s0 \s-1RETURN\s0 \s-1AN\s0 \s-1OBJECT\s0 so you can't chain the method like \*(C`$j->s()\*(C'. If you need to access instance variables of Jcode object, use access methods below instead of directly accessing them (That's what \s-1OOP\s0 is all about)

\s-1FYI\s0, Jcode uses a ref to array instead of ref to hash (common way) to optimize speed (Actually you don't have to know as long as you use access methods instead; Once again, that's \s-1OOP\s0)

$j->r_str: Reference to the EUC-coded String.
$j->icode: Input charcode in recent operation.
$j->nmatch: Number of matches (Used in $j->tr, etc.)

Subroutines

($code, [$nmatch]) = getcode($str): Returns char code of $str. Return codes are as follows ascii Ascii (Contains no Japanese Code) binary Binary (Not Text File) euc EUC-JP sjis SHIFT_JIS jis JIS (ISO-2022-JP) ucs2 UCS2 (Raw Unicode) utf8 UTF8 When array context is used instead of scaler, it also returns how many character codes are found. As mentioned above, $str can be \$str instead. jcode.pl Users: This function is 100% upper-conpatible with jcode::getcode() \*(-- well, almost; * When its return value is an array, the order is the opposite; jcode::getcode() returns $nmatch first. * jcode::getcode() returns 'undef' when the number of EUC characters is equal to that of SJIS. Jcode::getcode() returns EUC. for Jcode.pm there is no in-betweens. Converts $str to char code specified by $ocode. When $icode is specified also, it assumes $icode for input string instead of the one checked by getcode(). As mentioned above, $str can be \$str instead. jcode.pl Users: This function is 100% upper-conpatible with jcode::convert() !

BUGS

For perl is 5.8.1 or later, Jcode acts as a wrapper to Encode. Meaning Jcode is subject to bugs therein.

ACKNOWLEDGEMENTS

This package owes a lot in motivation, design, and code, to the jcode.pl for Perl4 by Kazumasa Utashiro <[email protected]>.

Hiroki Ohzaki <[email protected]> has helped me polish regexp from the very first stage of development.

JEncode by [email protected] has inspired me to integrate Encode to Jcode. He has also contributed Japanese \s-1POD\s0.

And folks at Jcode Mailing list <[email protected]>. Without them, I couldn't have coded this far.

RELATED TO Jcode…

Encode

Jcode::Nihongo

<http://www.iana.org/assignments/character-sets>

COPYRIGHT

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

Jcode (3pm)