Docs For Class ConvertCharset

Description

-- 1.0 2004-07-28 --

-- The most important thing -- I want to thank all people who helped me fix all bugs, small and big once. I hope that you don't mind that your names are in this file.

-- Some Apache issues -- I get info from Lukas Lisa, that in some cases with special apache configuration you have to put header() function with proper encoding to get your result displayed correctly. If you want to see what I mean, go to demo.php and demo1.php

-- BETA 1.0 2003-10-21 --

-- You should know about... -- For good understanding this class you shouls read all this stuff first :) but if you are in a hurry just start the demo.php and see what's inside.

That I'm not good in english at 03:45 :) - so forgive me all mistakes
This class is a BETA version because I haven't tested it enough
Feel free to contact me with questions, bug reports and mistakes in PHP and this documentation (email below)

-- In a few words... -- Why ConvertCharset class?

I have made this class because I had a lot of problems with diferent charsets. First because people from Microsoft wanted to have thair own encoding, second because people from Macromedia didn't thought about other languages, third because sometimes I need to use text written on MAC, and of course it has its own encoding :)

Notice & remember:

When I'm saying 1 byte string I mean 1 byte per char.
When I'm saying multibyte string I mean more than one byte per char.

So, this are main FEATURES of this class:

conversion between 1 byte charsets
conversion from 1 byte to multi byte charset (utf-8)
conversion from multibyte charset (utf-8) to 1 byte charset
every conversion output can be save with numeric entities (browser charset independent - not a full truth)

This is a list of charsets you can operate with, the basic rule is that a char have to be in both charsets, otherwise you'll get an error.

WINDOWS
windows-1250 - Central Europe
windows-1251 - Cyrillic
windows-1252 - Latin I
windows-1253 - Greek
windows-1254 - Turkish
windows-1255 - Hebrew
windows-1256 - Arabic
windows-1257 - Baltic
windows-1258 - Viet Nam
cp874 - Thai - this file is also for DOS

DOS
cp437 - Latin US
cp737 - Greek
cp775 - BaltRim
cp850 - Latin1
cp852 - Latin2
cp855 - Cyrylic
cp857 - Turkish
cp860 - Portuguese
cp861 - Iceland
cp862 - Hebrew
cp863 - Canada
cp864 - Arabic
cp865 - Nordic
cp866 - Cyrylic Russian (this is the one, used in IE "Cyrillic (DOS)" )
cp869 - Greek2

MAC (Apple)
x-mac-cyrillic
x-mac-greek
x-mac-icelandic
x-mac-ce
x-mac-roman

ISO (Unix/Linux)
iso-8859-1
iso-8859-2
iso-8859-3
iso-8859-4
iso-8859-5
iso-8859-6
iso-8859-7
iso-8859-8
iso-8859-9
iso-8859-10
iso-8859-11
iso-8859-12
iso-8859-13
iso-8859-14
iso-8859-15
iso-8859-16

MISCELLANEOUS
gsm0338 (ETSI GSM 03.38)
cp037
cp424
cp500
cp856
cp875
cp1006
cp1026
koi8-r (Cyrillic)
koi8-u (Cyrillic Ukrainian)
nextstep
us-ascii
us-ascii-quotes

DSP implementation for NeXT
stdenc
symbol
zdingbat

And specially for old Polish programs
mazovia

-- Now, to the point... -- Here are main variables.

DEBUG_MODE

You can set this value to:

-1 - No errors or comments
0 - Only error messages, no comments
1 - Error messages and comments

Default value is 1, and during first steps with class it should be left as is.

CONVERT_TABLES_DIR

This is a place where you store all files with charset encodings. Filenames should have the same names as encodings. My advise is to keep existing names, because thay were taken from unicode.org (www.unicode.org), and after update to unicode 3.0 or 4.0 the names of files will be the same, so if you want to save your time...uff, leave the names as thay are for future updates.

The directory with edings files should be in a class location directory by default, but of course you can change it if you like.

author: Mikolaj Jedrzejak <mikolajj@op.pl>
version: 1.0 2004-07-27 23:11
copyright: Copyright Mikolaj Jedrzejak (c) 2003-2004
link: http://www.unicode.org Unicode Homepage
access: public

Located in /includes/core.classes.php (line 3139)

Variable Summary

mixed $Entities

mixed $RecognizedEncoding

Method Summary

string Convert (string $StringToChange, string $FromCharset, string $ToCharset, [boolean $TurnOnEntities = false])

string DebugOutput (int $Group, int $Number, [mix $Value = false])

string HexToUtf (string $UtfCharInHex)

array MakeConvertTable (string $FirstEncoding, [string $SecondEncoding = ""])

string UnicodeEntity (string $UnicodeString)

Variables

mixed $Entities (line 3141)

mixed $RecognizedEncoding (line 3140)

Methods

Convert (line 3366)

ConvertCharset::Convert()

This is a basic function you are using. I hope that you can figure out this function syntax :-)

return: Converted string in brand new encoding :)
version: 1.0 2004-07-27 01:09

string Convert (string $StringToChange, string $FromCharset, string $ToCharset, [boolean $TurnOnEntities = false])

string $StringToChange: The string you want to change :)
string $FromCharset: Name of $StringToChange encoding, you have to know it.
string $ToCharset: Name of a charset you want to get for $StringToChange.
boolean $TurnOnEntities: Set to true or 1 if you want to use numeric entities insted of regular chars.

DebugOutput (line 3557)

ConvertCharset::DebugOutput()

This function is not really necessary, the debug output could stay inside of source code but like this, it's easier to manage and translate. Besides I couldn't find good coment/debug class :-) Maybe I'll write one someday...

All messages depend on DEBUG_MODE level, as I was writing before you can set this value to:

-1 - No errors or notces are shown
0 - Only error messages are shown, no notices
1 - Error messages and notices are shown

return: String with a proper message.

string DebugOutput (int $Group, int $Number, [mix $Value = false])

int $Group: Message groupe: error - 0, notice - 1
int $Number: Following message number
mix $Value: This walue is whatever you want, usualy it's some parameter value, for better message understanding.

HexToUtf (line 3242)

ConvertCharset::HexToUtf()

This simple function gets unicode char up to 4 bytes and return it as a regular char. It is very similar to UnicodeEntity function (link below). There is one difference in returned format. This time it's a regular char(s), in most cases it will be one or two chars.

return: Encoded hexadecimal value as a regular char.
see: ConvertCharset::UnicodeEntity()

string HexToUtf (string $UtfCharInHex)

string $UtfCharInHex: Hexadecimal value of a unicode char.

MakeConvertTable (line 3290)

CharsetChange::MakeConvertTable()

This function creates table with two SBCS (Single Byte Character Set). Every conversion is through this table.

The file with encoding tables have to be save in "Format A" of unicode.org charset table format! This is usualy writen in a header of every charset file.
BOTH charsets MUST be SBCS
The files with encoding tables have to be complet (Non of chars can be missing, unles you are sure you are not going to use it)

"Format A" encoding file, if you have to build it by yourself should aplly these rules:

you can comment everything with #
first column contains 1 byte chars in hex starting from 0x..
second column contains unicode equivalent in hex starting from 0x....
then every next column is optional, but in "Format A" it should contain unicode char name or/and your own comment
the columns can be splited by "spaces", "tabs", "," or any combination of these
below is an example

#
# The entries are in ANSI X3.4 order.
#
0x00 0x0000 # NULL end extra comment, if needed
0x01 0x0001 # START OF HEADING
# Oh, one more thing, you can make comments inside of a rows if you like.
0x02 0x0002 # START OF TEXT
0x03 0x0003 # END OF TEXT
next line, and so on...

You can get full tables with encodings from http://www.unicode.org

return: Table necessary to change one encoding to another.

array MakeConvertTable (string $FirstEncoding, [string $SecondEncoding = ""])

string $FirstEncoding: Name of first encoding and first encoding filename (thay have to be the same)
string $SecondEncoding: Name of second encoding and second encoding filename (thay have to be the same). Optional for building a joined table.

UnicodeEntity (line 3167)

CharsetChange::NumUnicodeEntity()

Unicode encoding bytes, bits representation. Each b represents a bit that can be used to store character data.

bytes, bits, binary representation
1, 7, 0bbbbbbb
2, 11, 110bbbbb 10bbbbbb
3, 16, 1110bbbb 10bbbbbb 10bbbbbb
4, 21, 11110bbb 10bbbbbb 10bbbbbb 10bbbbbb

This function is written in a "long" way, for everyone who woluld like to analize the process of unicode encoding and understand it. All other functions like HexToUtf will be written in a "shortest" way I can write tham :) it does'n mean thay are short of course. You can chech it in HexToUtf() (link below) - very similar function.

IMPORTANT: Remember that $UnicodeString input CANNOT have single byte upper half extended ASCII codes, why? Because there is a posibility that this function will eat the following char thinking it's miltibyte unicode char.

return: This is an input string olso with unicode chars, bus saved as entities
see: ConvertCharset::HexToUtf()

string UnicodeEntity (string $UnicodeString)

string $UnicodeString: Input Unicode string (1 char can take more than 1 byte)