LFS Forum - I need some advice: LFS Strings.

#1 - Dygear

I need some advice: LFS Strings.

Fri, 20 May 2011 07:28

I was working on Krammeh's bug report about MST/MTC/MSX Colors breaking in strings that span multiple lines. I'm not one to duck the hard questions, and this one did cause me quite some time to think through. The only conclusion I've come to is that much of PRISM is going to need to be rewritten to handle strings correct in these cases. I now feel that the best way to handle LFS strings is to make each LFS Datatype; Time, String, ect; into it's own class that will be handled and parsed at the packet level and thus make it available on all levels. This idea was one that I had for version 2.0 of LFSWorldSDK, so I might just prototype the system there and move it into PRISM when it's done. But is there an easier way, am I missing something that would allow me to do this without having to rewrite much of the packet system?

#2 - morpha

Fri, 20 May 2011 09:37

A specialised LFS String class is the proper way to do it. The magic method __toString can be used to make the class (or rather, instances of it) behave very elegantly within a normal string context.

Dealing with coloured and/or non-default charset strings is tricky though.
Basically what you need to do is strip unnecessary escape sequences (^L, ^8 or ^9 at the very beginning of the string).

If the string still exceeds the maximum length, scan MAXLEN-4 for escape sequences.

If there is a pair (charset AND colour), split at the first (from left to right) sequence and repeat.
Else, scan through the entire chunk (from 0 to MAXLEN) for a charset and/or colour sequence (which you store) and make sure the last char is not part of a multibyte character or special character escape sequence (^v, ^a, ^s ...).
If it is, split before the first byte of that character. Make sure to carry the charset/colour over to the next chunk.
Else, split normally and repeat, carrying over the last charset and colour.

Perhaps some gifted regular expression expert can find a way to achieve this with preg_split() which would certainly be the most efficient approach.

#3 - Dygear

Fri, 20 May 2011 18:28

Quote from morpha :Perhaps some gifted regular expression expert can find a way to achieve this with preg_split() which would certainly be the most efficient approach.

That's what I was thinking, where's filur when you need him?

#4 - GeForz

Sun, 22 May 2011 09:05

I hacked something together...
http://kamelstall.de/lfs.php

Can you provide any multibyte examples and more testcases?

#5 - morpha

Sun, 22 May 2011 21:46

I've done some prototyping as well, no actual splitting yet though.



<?php 
php

/**
* Trim reset, default colour and default charset sequences from
* the beginning and end of the string (or strings, if an array is
* passed, in which case it will return an array instead of a string).
* This does not trim whitespaces or any other characters trim() trims.
*/
function trim_sequences($string)
{
    return preg_replace('/^(?:\^[89L])*(.*?)(?:\^[0-9LGCJETBHSK])*$/', '$1', $string);
}

/**
* Strip redundant or chained sequences.
*
* Example:
* '^H^S^4Stuff^8^9Other stuff'
* becomes
* '^S^4Stuff^8Other stuff'
*/
function optimise($string)
{
    $patterns = array(
        '/(\^[012345679])+/',    // colours and colrst
        '/(\^[LGCJETBHSK])+/',    // charsets

        // full reset (col and cp) sequence ^8
        '/(?:\^[0-9LGCJETBHSK])*(\^8)(?:\^[89L])*/',
        '/\^8(\^[0-7]\^[GCJETBHSK]|\^[GCJETBHSK]\^[0-7])/',
    );

    // strip redundant / chained sequences
    return preg_replace($patterns, '$1', $string);
}

/**
* Determine if the given character is the lead byte of a multibyte
* character in the given codepage, passed as LFS language identifier.
*/
function isMultiByte($char, $cp)
{
    $char = (int)$char;
    switch($cp)
    {
        case 'J':
            return ($char > 0xE0 || $char > 0x80 && $char <= 0xA0);

        case 'K':
        case 'H':
        case 'S':
            return ($char > 0x80);

        default:
            return false;
    }
}
?>

trim_sequences passes all the test cases I could think of.

optimise also does what it's supposed to do, but does not detect redundancy over the full string width, it's limited to a sequence chain. This is because I wouldn't know how to achieve it with regex assertions and didn't want to iterate over the string, because the actual splitter is going to do that anyway. Put simply, it will correct: ^L^B^H^HSomething to ^HSomething, but ^L^B^H^HSome^Hthing will still have that redundant ^H between 'Some' and 'thing', i.e. ^HSome^Hthing

isMultiByte is completely untested, but it should work under the unconfirmed assumption that all multibyte codepages LFS uses do not have characters wider than 2 bytes. If your split point tests positive with isMultiByte, split one earlier and you should be fine.

As I said initially though, all of those are prototypes and haven't undergone any actual testing with PRISM / LFSWorldSDK, or raw captured strings from LFS for that matter.

#6 - Dygear

Mon, 23 May 2011 01:28

UTF-8 encodes each of the 1,11 ... in the Unicode Standard).

We really do need to have full support for the UTF-8 standard as this could be used on websites as well as in email, or pretty much any transmission. I still don't know what I am going to do about this myself, and I do appreciate the suggestions so far.

#7 - morpha

Mon, 23 May 2011 08:36

toUTF8 and fromUTF8 are already part of my LFSString class prototype, but that's far from ready. Note, however, that the multibyte charpages LFS uses are not UTF8 or UTF8 related and are indeed very likely to use 2 byte wide chars at most, I just haven't confirmed that yet.

#8 - Dygear

Mon, 23 May 2011 16:17

I'm really not sure myself, but I would love Victor to comment on this, as he deals with thing like this everyday.

#9 - morpha

Tue, 14 Jun 2011 14:12

Bringing this one back up.

http://msdn.microsoft.com/en-us/goglobal/bb964654

Quote :DBCS (Double Byte Character Set) Codepages

[...]

932 (Japanese Shift-JIS)
936 (Simplified Chinese GBK)
949 (Korean)
950 (Traditional Chinese Big5)

So that confirms it, they are indeed 2-byte wide at most.

€: Just so everyone's up to speed, here's a "short" summary of what you should know when dealing with LFS strings:

Escape sequences:

All escape sequences start with a circumflex ('^')
All escape sequences are 2 bytes wide

Code pages:

The codepages J (CP932, Japanese), S (CP936, Simplified Chinese), K (CP949, Korean) and H (CP950, Traditional Chinese) are double-byte character sets. That doesn't mean all their characters occupy 2 bytes, but a large portion of them does.
All codepages contain all 128 ASCII characters, but CP932 (J, Japanese) is a bit of a special case, see next list item.
CP932 character 0x5C is mapped to UNICODE U+005C, REVERSE SOLIDUS (backslash), but many fonts display a Yen-sign (U+00A5) instead, as does LFS. This can lead to inconsistent results when converting from other character sets because the mapping differs from the displayed character. Dealing with this is difficult because the Yen-sign (U+00A5) is not mapped in any of LFS's CPs, only the full width Yen-sign (U+FFE5). Basically this means:
- When converting LFS ^J strings to Unicode, convert 0x5C to U+00A5, not U+005C
- When converting from Unicode to LFS, you cannot use ^J to output a backslash as both 0x5C and the escape sequence ^d will output a Yen-sign
- When converting from Unicode to LFS, convert U+00A5 to ^J followed by 0x5C

Feel free to extend the list and add all the "gotchas" and "aha!"s you have

#10 - Dygear

Tue, 14 Jun 2011 20:42

Using UTF-8 strings natively would of killed Scawen?

#11 - morpha

Tue, 14 Jun 2011 22:27

Back when LFS development started (2001? earlier?), UTF-8 wasn't as widely used and supported as it is today. It wasn't the obvious choice, the better one no doubt, but not yet the obvious one.
Most applications of the time didn't support multiple languages (at the same time) at all, so LFS set a good example in that respect.

You can blame Scawen for making us find solutions to problems we wouldn't be having if he'd made the "more innovative" choice, but LFS's system does its job. And if we're honest, we all enjoy a good programming challenge, don't we?

Perhaps when (if) right-to-left language support is added, he'll consider a complete re-write of the entire text handling system, but that's probably fairly low down on his to-do list.

#12 - DarkTimes

Tue, 14 Jun 2011 22:40

Question is would the time needed to convert LFS to unicode be better spent working on something else, like better physics or more content? Sadly the answer is yes. It would be great if LFS natively used UTF-8, but I don't think it's worth it now, plus there are libraries available which have solved the problem.

#13 - morpha

Tue, 14 Jun 2011 22:47

Quote from DarkTimes :there are libraries available which have solved the problem.

"Libraries"? I don't even know of one

pyinsim's strmanip does the conversion stuff very well (not 100% accurate but close, I'll be improving it eventually) and so, I presume, does InSim.NET, but I don't know of any lib that does it literally "to the letter".

Splitting strings intelligently is the more pressing issue though, which, to my knowledge, none of the publicly available libraries are capable of. By "intelligently", I mean maintaining colour and codepages, not splitting escape sequences and double-byte characters, stripping redundant and trailing non-printable characters, etc.

#14 - Dygear

Tue, 14 Jun 2011 22:52

Quote from morpha :By "intelligently", I mean maintaining colour and codepages, not splitting escape sequences and double-byte characters, stripping redundant and trailing non-printable characters, etc.

This is something that I am really hoping to nail before 0.5.0 of PRISM. But I could use some help!

#15 - DarkTimes

Tue, 14 Jun 2011 23:01

Quote from morpha :"Libraries"? I don't even know of one

pyinsim's strmanip does the conversion stuff very well (not 100% accurate but close, I'll be improving it eventually) and so, I presume, does InSim.NET, but I don't know of any lib that does it literally "to the letter".

Meh, that's fair. I guess no library has it completely solved, so I retract that. Was a time a while back when no InSim library handled strings correctly, but now we have pyinsim and InSim.NET which do 'good enough' conversions. Frankly I think 'good enough' is acceptable.

Quote :Splitting strings intelligently is the more pressing issue though, which, to my knowledge, none of the publicly available libraries are capable of. By "intelligently", I mean maintaining colour and codepages, not splitting escape sequences and double-byte characters, stripping redundant and trailing non-printable characters, etc.

If I understand you correctly, then InSim.NET does this (except for colors admittedly), but that's only because all strings are converted to and from unicode when receiving and sending packets. InSim.NET does not give you access to the bytes, you just treat all strings as unicode and the conversion is done behind your back.

Edit: for me a more pressing issue is to figure out a fast way to determine how many bytes a unicode string will be once converted to a LFS string, without actually doing the conversion. This would allow me to do some useful optimisations in InSim.NET.

#16 - morpha

Sat, 18 Jun 2011 21:14

I don't think there is a way to reliably predict or analyse post-conversion string size that's less expensive than the actual conversion. For characters, it's known that they're at most 2 bytes each, but you won't know how many codepage-switches they'll necessitate without iterating over the Unicode-string and determining the language/cp of each character.

Anyway, I've now started working on strmanip and pretty much achieved what I set out to achieve, or so I thought. In theory, everything was 100% accurate, but when I tested it, LFS screwed me good.
The problem is that certain characters present in multiple codepages are rendered differently between them. For example, Simplified Chinese (S, CP936) contains 66 characters from the Cyrillic set, but their graphical representation in LFS is not identical to the actual Cyrillic (C, CP1251) one.
Here's a comparison screeny for the string 'ЁёАБВГДЕЖЗИЙКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯабвгдежзийклмнопрстуфхцчшщъыьэюя' (all 66 characters present in both C and S):

The first two lines are the S-version (134 bytes including '^S', hence 2 lines), the last line the C-one (68 bytes including '^C').

Obviously that renders strmanip's sort of "lazy" conversion useless, because it not only produces visually incorrect but also significantly larger results than it should. The problem is, while detecting this is relatively easy for L, E, T, B, C and G, it's not at all easy for J, H, S and K because these are unified in the Unicode standard as CJK Unified Ideographs.
I'm now in the process of rewriting strmanip entirely, not only because of the issues above but also because there are some reliability issues with Python 2.7's (haven't checked 3/3.2) native conversion mappings.
Among other things, fromUnicode() will determine the Unicode Block of a character and select the appropriate codepage based on that. It'll also attempt to guess which codepage to use for CJK input based on whether the majority of the string is kana (making it most likely Japanese) or hangul (Korean), otherwise it'll default to Chinese.

This is some complex stuff and I realise I've taken this a bit far off-topic, if it bothers anyone just split the topic

If all goes according to plan, I'll also be able to provide conversion mappings for PRISM, though obviously not to and from Unicode but rather UTF-8 directly, since PHP doesn't have native Unicode support (yet).

#17 - boothy

Sat, 18 Jun 2011 21:27

At least it's not like iRacing, where despite apparently using UTF-8 internally, they can't seem to get UTF-8 characters to display on their website, nor on the sim due to the fact "The sim uses pre-rendered fonts, and we include only the first 255 characters." :doh:

#18 - Dygear

Sat, 18 Jun 2011 23:32

Quote from morpha :If all goes according to plan, I'll also be able to provide conversion mappings for PRISM, though obviously not to and from Unicode but rather UTF-8 directly, since PHP doesn't have native Unicode support (yet).

That's great! I'm happy if you're willing to do the work. I wonder why Victor has not chimed in yet, seeing as he deals with this on the LFSWorld site, with hostnames, and usernames.

Quote from boothy :At least it's not like iRacing, where despite apparently using UTF-8 internally, they can't seem to get UTF-8 characters to display on their website, nor on the sim due to the fact "The sim uses pre-rendered fonts, and we include only the first 255 characters." :doh:

This might give me nightmares.

#19 - morpha

Sun, 19 Jun 2011 03:50

Quote from Dygear :That's great! I'm happy if you're willing to do the work. I wonder why Victor has not chimed in yet, seeing as he deals with this on the LFSWorld site, with hostnames, and usernames.

Probably because he's satisfied with his reasonably accurate solution

I must admit I haven't looked at a piece of actual PRISM code in really quite a while, so I'm not sure if this complies with its general coding standards and conventions, but this is the first take on the Encoding interface:



<?php 
interface Encoding
{
    public static function encode($utf8_encoded_string);
    public static function decode($codepage_encoded_string);
    public static function contains($character, $needle_is_utf8=false);
}
?>

CP932 and CP936 are already implemented and passed a few test cases, I'll finish up the rest after a few hours of sleep

Obviously this needs at least one additional layer of abstraction to be of any actual use, something similar to pyinsim's strmanip. I also haven't done any performance related testing yet, but being pure PHP, it'll likely lose to mbstring and iconv. The upside is it works standalone and it's the exact mapping LFS uses.

Right then -> :sleep2:

#20 - Dygear

Sun, 19 Jun 2011 17:55

That looks good to me.

#21 - Victor

Sun, 19 Jun 2011 19:18

Quote from Dygear :That's great! I'm happy if you're willing to do the work. I wonder why Victor has not chimed in yet, seeing as he deals with this on the LFSWorld site, with hostnames, and usernames.

I don't really have a need for splitting coloured LFS texts, though I may in the future, who knows.
But as I was reading this thread I was actually thinking "hmm, maybe sync the prism folder and see about doing this" :P But Morpha beat me to it, so i'll leave it alone for now

Quote from morpha :Probably because he's satisfied with his reasonably accurate solution

If you ever see a wrong conversion, let me know

I don't think I've ever seen one. But I'll happily be proven wrong.

#22 - morpha

Sun, 19 Jun 2011 21:01

Quote from Victor :If you ever see a wrong conversion, let me know

I don't think I've ever seen one. But I'll happily be proven wrong.

I've encountered a few back in DriftWars's prime, ISOs instead of MSCPs messed up a few of the exotic names, but nothing major. I genuinely meant the "reasonably accurate", that wasn't sarcastic

CP932 and CP936 are now 100% complete and produce identical results to mb_convert_encoding() and iconv(). Unfortunately, being pure PHP, it's quite slow, about 50 times slower than iconv()/mb actually. As such, I don't think it's worth finishing, except maybe as a fall-back solution for systems where neither mbstring nor iconv are available. I've attached it though, if anyone wants to have a look.

Attached files

#23 - Dygear

Mon, 20 Jun 2011 02:42

Quote from morpha :I've encountered a few back in DriftWars's prime, ISOs instead of MSCPs messed up a few of the exotic names, but nothing major. I genuinely meant the "reasonably accurate", that wasn't sarcastic

CP932 and CP936 are now 100% complete and produce identical results to mb_convert_encoding() and iconv(). Unfortunately, being pure PHP, it's quite slow, about 50 times slower than iconv()/mb actually. As such, I don't think it's worth finishing, except maybe as a fall-back solution for systems where neither mbstring nor iconv are available. I've attached it though, if anyone wants to have a look.

That's fantastic then. When loading up PRISM, I'll check to see if the mulibyte module is loaded in PHP, and if it's not use this. Thanks!

#24 - morpha

Mon, 20 Jun 2011 10:17

I've now tested it on my laptop and the results are quite different to my home server's. My pure PHP solution (which I shall dub "PRISIM" [PRISM String Internationalisation Module] :razz

is actually 16% faster than mb_convert_encoding(), but 30% slower than iconv().

Seeing as mbstring doesn't support CP1250, CP1253 and CP1257, and is apparently slower than PRISIM on modern machines, iconv should be used instead. It's the fastest and uses the system's underlying iconv-implementation, which nowadays usually means it'll support pretty much any registered character set you can throw at it.

iconv > PRISIM > mbstring

Alrighty, now knowing that it doesn't perform so poorly after all, I'm back on it

#25 - Dygear

Mon, 20 Jun 2011 18:10

So, then iconv if available, then PRISM's native PHP implementation otherwise.