The online racing simulator
PHP4/5 - Parsing LFS Strings.
1
(29 posts, started )
PHP4/5 - Parsing LFS Strings.
I want some ideas on how to parse LFS strings in PHP. Right now I'm using a concoction of Victor's LFS Hostname Codepage Converter & D34N0's Format Host Colour.

That now looks like this.


<?php 
# Function by Mark 'Dygear' Tomlin;
function hostToHTML($hostName) {
    return 
format_host_colours(codepage_convert($hostName));
}
# Function By Victor van Vlaardingen: http://www.lfsforum.net/showthread.php?t=36628
function codepage_convert($str$conv_to 'UTF-8') {
    
$sets = array (
        
'L' => 'CP1252',
        
'G' => 'ISO-8859-7',
        
'C' => 'CP1251',
        
'E' => 'ISO-8859-2',
        
'T' => 'ISO-8859-9',
        
'B' => 'ISO-8859-13',
        
'J' => 'SJIS-win',
        
'S' => 'CP936',
        
'K' => 'CP949',
        
'H' => 'CP950'
    
);
    
$tr_ptrn = array ("/\^d/""/\^s/""/\^c/""/\^a/""/\^q/""/\^t/""/\^l/""/\^r/""/\^v/");
    
$tr_ptrn_r = array ("\\""/"":""*""?""\"""<"">""|");
    
$str preg_replace ($tr_ptrn$tr_ptrn_r$str);
    
$newstr $tmp '';
    
$current_cp 'L';
    
$len strlen ($str);
    for (
$i=0$i<$len$i++) {
        if (
$str{$i} == '^' && isset ($sets[$str{$i+1}]) && $str{$i-1} != "^") {
            if (
$tmp != '') {
                
$newstr .= mb_convert_encoding ($tmp$conv_to$sets[$current_cp]);
                
$tmp '';
            }
            
$current_cp $str{++$i};
        } else if (
ord($str{$i}) > 31)
            
$tmp .= $str{$i};
    }
    if (
$tmp != '')
        
$newstr .= mb_convert_encoding ($tmp$conv_to$sets[$current_cp]);
    return 
str_replace ('^^''^'$newstr);
}
# Function by D34N0: http://www.lfsforum.net/showthread.php?p=35947#post35947
# Function Edited by Mark 'Dygear' Tomlin;
function get_colour($ColourNum) {
    switch (
$ColourNum) {
        case 
0: return '#000000'# Black
        
case 1: return '#FF0000'# Red
        
case 2: return '#00FF00'# Green
        
case 3: return '#FFFF00'# Yellow
        
case 4: return '#0000FF'# Light Blue
        
case 5: return '#FF0080'# Light Purple
        
case 6: return '#00FFFF'# Turquoise
        
case 7; return '#FFFFFF'# White
        
case 8: return '#00FF00'# Pastel Green
    
}
}
# Function by D34N0: http://www.lfsforum.net/showthread.php?p=35947#post35947
# Funntion Edited by Mark 'Dygear' Tomlin;
function format_host_colours($HostName) {
    for (
$i 0$i strlen($HostName); $i++) {
        if (
substr($HostName$i1) == "^") {
            
$CharPos strpos($HostName"^"$i);
            
$ColNum substr($HostNamestrpos($HostName"^"$i) + 1,1);
            
$ColourString get_colour(substr($HostNamestrpos($HostName"^"$i) + 1,1));
            if (
$i == "0") {
                
$TmpString substr($HostName,$i+2);
                
$HostName "<span style=\"color: {$ColourString}\">{$TmpString}";
            } else {
                
$LTmpString substr($HostName,0,$i);
                
$RTmpString substr($HostName,$i+2);
                
$HostName "{$LTmpString}</span><span style=\"color: {$ColourString}\">{$RTmpString}";
            }
        }
    }
    
$HostName .= '</span>';
    return 
$HostName;
}
?>

We can do better then this. So, let's see what ya got!
I've changed some things around and this is what I got. I also find it interesting that ^8 should return the color back to default, but in this case it's handled differently. I'm going to change that, but I'm going to have to revamp the whole function to do that.


<?php 
# Function by Mark 'Dygear' Tomlin;
function hostToHTML($hostName) {
    return 
format_host_colours(codepage_convert($hostName));
}
# Function By Victor van Vlaardingen: http://www.lfsforum.net/showthread.php?t=36628
function codepage_convert($str$conv_to 'UTF-8') {
    
$sets = array (
        
'L' => 'CP1252',
        
'G' => 'ISO-8859-7',
        
'C' => 'CP1251',
        
'E' => 'ISO-8859-2',
        
'T' => 'ISO-8859-9',
        
'B' => 'ISO-8859-13',
        
'J' => 'SJIS-win',
        
'S' => 'CP936',
        
'K' => 'CP949',
        
'H' => 'CP950'
    
);
    
$tr_ptrn = array ("/\^d/""/\^s/""/\^c/""/\^a/""/\^q/""/\^t/""/\^l/""/\^r/""/\^v/");
    
$tr_ptrn_r = array ("\\""/"":""*""?""\"""<"">""|");
    
$str preg_replace ($tr_ptrn$tr_ptrn_r$str);
    
$newstr $tmp '';
    
$current_cp 'L';
    
$len strlen ($str);
    for (
$i=0$i<$len$i++) {
        if (
$str{$i} == '^' && isset ($sets[$str{$i+1}]) && $str{$i-1} != "^") {
            if (
$tmp != '') {
                
$newstr .= mb_convert_encoding ($tmp$conv_to$sets[$current_cp]);
                
$tmp '';
            }
            
$current_cp $str{++$i};
        } else if (
ord($str{$i}) > 31)
            
$tmp .= $str{$i};
    }
    if (
$tmp != '')
        
$newstr .= mb_convert_encoding ($tmp$conv_to$sets[$current_cp]);
    return 
str_replace ('^^''^'$newstr);
}
# Function by D34N0: http://www.lfsforum.net/showthread.php?p=35947#post35947
# Funntion Edited by Mark 'Dygear' Tomlin;
function format_host_colours($HostName) {
    
$color = array(
        
=> '000000'# Black
        
=> 'FF0000'# Red
        
=> '00FF00'# Green
        
=> 'FFFF00'# Yellow
        
=> '0000FF'# Light Blue
        
=> 'FF00FF'# Light Purple
        
=> '00FFFF'# Turquoise
        
=> 'FFFFFF'# White
        
=> '949494'# Default (Grey)
    
);

    for (
$i 0$i strlen($HostName); $i++) {
        if (
substr($HostName$i1) == '^') {
            
$ColourString $color[substr($HostNamestrpos($HostName,'^',$i)+1,1)];
            
            if (
$i == 0) {
                
$TmpString substr($HostName$i 2);
                
$HostName "<span style=\"color: #{$ColourString}\">{$TmpString}";
            } else {
                
$LTmpString substr($HostName0$i);
                
$RTmpString substr($HostName$i 2);
                
$HostName "{$LTmpString}</span><span style=\"color: #{$ColourString}\">{$RTmpString}";
            }
        }
    }
    
$HostName .= '</span>';
    return 
$HostName;
}
?>

^8 does NOT reset back to default colour, ^9 does BUT it does not the charset as lfsmanual suggests. ^8 is dark green which happens to be the default chat message colour.
In other words, ^9 is like color: inherit in CSS, while ^8 explicitly sets it to dark green.
Third Change, this time to Victor's function (preg tends to use more memory then the straight str_replace function.)


<?php 
# Function by Mark 'Dygear' Tomlin;
function hostToHTML($hostName) {
    return 
format_host_colours(codepage_convert($hostName));
}
# Function By Victor van Vlaardingen: http://www.lfsforum.net/showthread.php?t=36628
function codepage_convert($str$conv_to 'UTF-8') {
    
$sets = array (
        
'L' => 'CP1252',
        
'G' => 'ISO-8859-7',
        
'C' => 'CP1251',
        
'E' => 'ISO-8859-2',
        
'T' => 'ISO-8859-9',
        
'B' => 'ISO-8859-13',
        
'J' => 'SJIS-win',
        
'S' => 'CP936',
        
'K' => 'CP949',
        
'H' => 'CP950'
    
);
    
$str str_replace(
        array(
'^a','^c','^d','^h','^l','^q','^r','^s','^t','^v'),
        array( 
'*'':''\\','#''<''?''>''/''"''|'),
        
$str
    
);
    
$newstr $tmp '';
    
$current_cp 'L';
    for (
$i 0$len strlen($str); $i $len$i++) {
        if (
$str{$i} == '^' && isset ($sets[$str{$i+1}]) && $str{$i-1} != '^') {
            if (
$tmp != '') {
                
$newstr .= mb_convert_encoding ($tmp$conv_to$sets[$current_cp]);
                
$tmp '';
            }
            
$current_cp $str{++$i};
        } else if (
ord($str{$i}) > 31)
            
$tmp .= $str{$i};
    }
    if (
$tmp != '')
        
$newstr .= mb_convert_encoding ($tmp$conv_to$sets[$current_cp]);
    return 
str_replace ('^^''^'$newstr);
}
# Function by D34N0: http://www.lfsforum.net/showthread.php?p=35947#post35947
# Funntion Edited by Mark 'Dygear' Tomlin;
function format_host_colours($HostName) {
    
$color = array(
        
=> '000000'# Black
        
=> 'FF0000'# Red
        
=> '00FF00'# Green
        
=> 'FFFF00'# Yellow
        
=> '0000FF'# Light Blue
        
=> 'FF00FF'# Light Purple
        
=> '00FFFF'# Turquoise
        
=> 'FFFFFF'# White
        
=> '949494'# Default (Grey)
    
);

    for (
$i 0$i strlen($HostName); $i++) {
        if (
substr($HostName$i1) == '^') {
            
$ColourString $color[substr($HostNamestrpos($HostName,'^',$i)+1,1)];
            
            if (
$i == 0) {
                
$TmpString substr($HostName$i 2);
                
$HostName "<span style=\"color: #{$ColourString}\">{$TmpString}";
            } else {
                
$LTmpString substr($HostName0$i);
                
$RTmpString substr($HostName$i 2);
                
$HostName "{$LTmpString}</span><span style=\"color: #{$ColourString}\">{$RTmpString}";
            }
        }
    }
    
$HostName .= '</span>';
    return 
$HostName;
}
?>

Quote from morpha :^8 does NOT reset back to default colour, ^9 does BUT it does not the charset as lfsmanual suggests. ^8 is dark green which happens to be the default chat message colour.
In other words, ^9 is like color: inherit in CSS, while ^8 explicitly sets it to dark green.

Quick question then, how is it that ^8 is the same as the default text color. I mean, try this. Go into the game, type something, change the color, then change it back with ^8 I'm pretty sure it's the same as the default color anyway. Do the same with your username, put your clan tag in a color, then put your name after ^8 and it should to a default neutral color based on the context it is in.

Although this is all based on memory, so I could be wrong ...
^8 When using an insim app will show gray & some times dark green i dont have a clue what ^9 does ive never needed it.
Anyone know of a function that will allow me to replace a string within a string with another string in place? What I mean by that is, that I would like to replace a string that's 2 bytes long, with a string that could be 8 or 16 or any arbitrary length to take over those 2 bytes and expand the string's length. Or the string could be 8 bytes long, and I want to replace it with something that's 2 bytes, then it should also take the 8 bytes replace it with 2 bytes and decrease the total length of the string.

Anyone know of a build in PHP function that will do this?
-
(Dygear) DELETED by Dygear : I should really look before I ask.
Okay after further testing, I came to the conclusion that ^8 and ^9 do the same thing; no explicit colour.
  • When sending an MST via InSim without /msg, a string coloured ^8 or ^9 will be green.
  • When sending an MST via InSim with /msg, a string coloured ^8 or ^9 will be grey.
  • When sending an MTC via InSim, a string coloured ^8 or ^9 will be grey.
^9 should reset the charset as well according to LFSManual, this was added by Krane in 2007. Whether that's a mistake or really did work back then, I can't say, but it definitely does NOT reset the charset in Z28.

Anyway, here's what I'd do instead of spans with style="color:", a span with class="c0" (c0 through 9), CSS being:
.c0 { color: #000 }
.c1 { color: #F00 }
.c2 { color: #0F0 }
.c3 { color: #FF0 }
.c4 { color: #00F }
.c5 { color: #F0F }
.c6 { color: #0FF }
.c7 { color: #FFF }
.c8, .c9 { color: inherit }

This
  • makes the spans slightly shorter
  • allows caching of the colour code CSS
Dygear: str_replace / str_ireplace, or if you don't want to find something but replace starting at a position, substr_replace. How exactly would you use it? Could you provide some examples with original input and expected output?
Quote from morpha :Okay after further testing, I came to the conclusion that ^8 and ^9 do the same thing; no explicit colour.

[...]

^9 should reset the charset as well according to LFSManual, this was added by Krane in 2007. Whether that's a mistake or really did work back then, I can't say, but it definitely
does NOT reset the charset in Z28.

Intresting, I wonder if any of the devs would care to comment on the ^8 & ^9 thing.

Quote from morpha :Anyway, here's what I'd do instead of spans with style="color:", a span with class="c0" (c0 through 9).

While, I understand where that is coming from, I also would like to make the function it's self stand alone, without the help of outside files (such as a css file for the class of color.) I might take your suggestion on the shortened color code, but I think some browsers interpret the #F00 as #F00000 and some as #FF0000. (It's the little things that annoy me.)

Quote from morpha :Dygear: str_replace / str_ireplace, or if you don't want to find something but replace starting at a position, substr_replace. How exactly would you use it? Could you provide some examples with original input and expected output?

This is what I came up with. I've not tested any of this code tho. This just what I would think should happen.

<?php 
$subject 
'^1R^2G^4B';
# str_replace implementation.
$search = array('^0','^1','^2','^3','^4','^5','^6','^7','^8','^9');
$replace = array('<span style="color: #000;">',
'<span style="color: #F00;">',
'<span style="color: #0F0;">',
'<span style="color: #FF0;">',
'<span style="color: #00F;">',
'<span style="color: #F0F;">',
'<span style="color: #0FF;">',
'<span style="color: #FFF;">',
'<span style="color: inherit;">',
'<span style="color: inherit;">'
);
$subject str_replace($search$replace$subject$count);
# $Subject = '<span style="color: #F00;">R<span style="color: #0F0;">G<span style="color: #00F;">B';
if ($count) {
  
$subject str_replace('<span ''</span><span '$subject);
  
# $Subject = '</span><span style="color: #F00;">R</span><span style="color: #0F0;">G</span><span style="color: #00F;">B';
  
$subject substr($subject6) . '</span>';
  
# $Subject = '<span style="color: #F00;">R</span><span style="color: #0F0;">G</span><span style="color: #00F;">B</span>';
}
?>

Would return. RGB
this doesn't work for a string like 'regular^1R^2G^4B^9normal' though. That outputs :

r</span><span style="color: #F00;">R</span><span style="color: #0F0;">G</span><span style="color: #00F;">B</span><span style="color: inherit;">normal</span>

Indeed!


<?php 
lfsStrReplace
($subject)
{
    
$search = array('^0','^1','^2','^3','^4','^5','^6','^7','^8','^9');
    
$replace = array
    (
        
'<span style="color: #000;">',
        
'<span style="color: #F00;">',
        
'<span style="color: #0F0;">',
        
'<span style="color: #FF0;">',
        
'<span style="color: #00F;">',
        
'<span style="color: #F0F;">',
        
'<span style="color: #0FF;">',
        
'<span style="color: #FFF;">',
        
'<span style="color: inherit;">',
        
'<span style="color: inherit;">'
    
);
    
$subject str_replace($search$replace$subject$count);
    if (
$count)
    {
        
$subject str_replace('<span ''</span><span '$subject);
        
# Add closing span.
        
$subject .= '</span>';
        
# Remove extra front span
        
$strpos strpos($subject'</span>');
        
$subject substr($subject0$strpos) . substr($subject$strpos 7);
    }
    return 
$subject;
}

echo 
lfsStrReplace('regular^1R^2G^4B^9normal');
?>

Still not 100% happy with this as an answer, tho. While it may be quick (and that is is) it's also very, very dirty. I don't recommend this as a robust solution. I'm sure that a recursive function could do a much better job of this with it's str_replace implementation, but I have gone another route. (And let's see if I can find that code.)
I thought of a two pass system that would work pretty well for HTML markup. In this case, it find the first color, and find it it's used again later in the string. If it is then it keeps that color on as long as the string is not reset back to it's normal color. Even if there is another color that comes into the string, while a color is set then the function just does a quick switch to that color, then when it's over it end the HTML tag for that color and goes back to the color that is still set. This is very bandwidth efficient, but very processor intensive. Also, I'm thinking that I'll have to make this a recursive function that passes it's current length into the string that it's parsing so as not to cause an ∞ loop.
I'm having trouble converting text from LFS to HTML. I'm using the functions provided here in post #4 (I had to modify D34N0's function because it always adds an </span> at the end even if there's no color codes at all in the original string).

I don't really know what's happening, so I thought that maybe you guys can help me figure it out.

I'm saving text from LFS to a mysql table encoded in UTF-8, then I retrieve this text in PHP and show it on a UTF-8 encoded webpage. An example:

I save a player nickname, say "44# C.Mákaro" in all white. This is saved in my DB as "^744^h C.Mákaro".

Then I retrieve this nickname and I send it to HostToHTML to convert it to HTML. The problem comes in the function "codepage_convert()". I added a few debugging lines to it to see what was happening, and this is what I get:
Original: ^744^h C.Mákaro
^: 94
7: 55
4: 52
4: 52
#: 35
: 32
C: 67
.: 46
M: 77
�: 195
�: 161
k: 107
a: 97
r: 114
o: 111
Tmp: ^744# C.Mákaro
Converted: ^744# C.Mákaro

First is the original string as the function receives it, then the ASCII value of every individual character as in "ord($str{$i})", then the $tmp string just before mb_convert_encoding() and finally the resulting string.

First of all, the '^h' is converted to '#' in the beginning, which is allright, but then the 'á' is interpreted as two characters in the main loop? Why? But then the $tmp string which is passed to mb_convert_encoding() is ok again. But mb_convert_encoding() turns it into a wrong result...

Then again, if I skip the codepage conversion and use only "format_host_colours()" I get the 'á' character right, but I lose the # because '^h' gets stripped out of the string.

What is exactly happening? :-/
I think it might be related to your storing it in a utf-8 table. That's fine of course, but when you fetch that name again from db, do you keep that string utf-8? Or do you convert it back to iso88591? Ie. what are your mysql collation settings?

If you keep the string in utf-8 after fetching it from sql, then the á is no longer in iso88591 encoding, but instead it's value has moved to a whole other location within the utf-8 table.
That is why you already see
�: 195
�: 161

(the 195 gives away that it's in utf8 format).

the cure : Encode it to utf before storing it in the db.
My MySql collation settings are all set to UTF-8. All the tables are in UTF-8 and global collation setting is UTF-8 as well.

I really don't understand how encoding the string before storing it would change the result, and I wouldn't know how to encode "char *" C strings from my insim application to UTF-8. I wouldn't even know where to start on that...

What I have done is to add this line at the beginning of the function:

<?php 
$str 
mb_convert_encoding ($str'CP1252''UTF-8');
?>

That way the first thing I do is convert the string from UTF-8 to CP1252 and then I process it. This has supposed a step forward and the sample string I provided before now is converted successfully. I have tried other sample strings with rare characters from other codepages and some characters still won't show, but most of them do.

How nasty is my solution? I've thought that as long as all my data is stored in UTF-8 format and is retrieved in UTF-8 as well, I'd rather try to change the codepage_convert() function before trying other alternatives.
Your solution of converting from utf-8 to cp1252 before the conversion function isn't that bad. It's just extra work because you convert first from mixed-codepages (the LFS string) to utf8 while storing the string in DB, then back from utf8 to cp1252 after retrieving it from db. But it's not that heavy a process, so noone'll notice. It might have slight side effects though, because your database will probably not treat your string as cp1252 while converting it to utf-8, but rather as iso-8859-1. Not that much of a difference, but cp1252 is not 100% equal to iso-8859-1.

To further explain why it goes wrong :
The username string you receive from LFS contains only single-byte characters (from single byte character sets http://msdn.microsoft.com/en-us/goglobal/bb964654.aspx).
When you store it in the DB, the DB all of a sudden treats these characters like ISO-8859-1 (probably) and then converts them to utf-8, according to your collation settings. utf-8 is a multi-byte character set and the á in your name gets converted from a single byte character to a two-byte character during this process.

Then when you retrieve the string from DB into your php script, your DB will leave that string untouched (because your collation is utf-8 -> utf-8). It will remain in the utf-8 character set. Hence, your á remains a two-byte character. But, the function above expects single-byte characters. Therefore the output is not what you expected .. because what you put in was already not in the correct format.

I hope that clears things up.
OK, everything clear now. I'll think about some alternatives, like changing the collation setting of that field (which only contains text that comes directly from LFS) to latin1_general_ci (or the MySql collation setting equivalent to cp1252, if there's any). This way I won't have MySql convert the characters to a different charset when storing the text from LFS, but if my general setting is UTF-8 I will still need to convert to cp1252 after retrieving the data.

My knowledge on encoding systems and collation settings in MySql is rather limited, so I'll try to run some tests and read doc and see if I can get everything right.
Quote from Victor :... converts them to utf-8, according to your collation settings.

Collations deal with sorting, not encodings or conversions.

Quote from mysql.com :A character set is a set of symbols and encodings. A collation is a set of rules for comparing characters in a character set.

http://dev.mysql.com/doc/refman/5.0/en/charset.html
really?
well anyway, i mean 'the settings that take care of auto-charset conversions when passing data to and from sql'
Quote from MaKaKaZo :I'm having trouble converting text from LFS to HTML. I'm using the functions provided here in post #4 (I had to modify D34N0's function because it always adds an </span> at the end even if there's no color codes at all in the original string).

That function expects a string in the LFS Format, it can't be UTF-8. The string should be left in the LFS format for the best results at all times.
Quote from Dygear :That function expects a string in the LFS Format, it can't be UTF-8. The string should be left in the LFS format for the best results at all times.

Quite impossible given the settings of my database. At some point the single byte character string in cp1252 from LFS gets converted into UTF-8. The only way to avoid this that I found is to change the column charset to latin1 and to perform a "SET NAMES latin1" query before storing and retrieving the data and then do a "SET NAMES UTF-8" to change back. This is a big no-no, so I'm going to stick with my previous solution of using mb_convert_encondig() to encode the UTF-8 stored string into CP1252 before the parsing takes place.

I thought there was a problem with some characters, but turns out that I didn't have the asian charsets compatiblity pack installed in this Win XP box... just installed them and I see everything OK.

PS: Thx filur for the link, it provided good reference on the topic. I wish I could just do something in the "SELECT" statement that I use to retrieve these data to specify that I want the results in "latin1" without having to mess with connection variables. I tried with CONVERT but it didn't work.
Quote from MaKaKaZo :PS: Thx filur for the link, it provided good reference on the topic. I wish I could just do something in the "SELECT" statement that I use to retrieve these data to specify that I want the results in "latin1" without having to mess with connection variables. I tried with CONVERT but it didn't work.

filur is the man, I've been saying this for a while.
#24 - PoVo
Great code Dygear! Works very well
Quote from PoVo :Great code Dygear! Works very well

Thank you . By chance are you using the class that I attached or one of the functions posted on here?
1

PHP4/5 - Parsing LFS Strings.
(29 posts, started )
FGED GREDG RDFGDR GSFDG