LFS Forum - TextStart in IS_MSO is unreliable for non-latin characters

#1 - Bokujishin

TextStart in IS_MSO is unreliable for non-latin characters

Tue 25 Mar 2025, 11:15

When we want to strip the name from an IS_MSO message, we can use TextStart, and this works fine... as long as the sender's name only contains latin characters. If I use Japanese characters in my name, for instance (and I know this is pretty common for teams and people to use Japanese katakana to stylize text, e.g. ﾁSﾏ for FSR), then TextStart will have a greater value than it should.

Here's an example using "Cyk R", "Cyk ﾏ" (half-width "ma") and "Cyk マ" (full-width "ma") as a nickname and sending a message containing only "test":


full message:       ^7Cyk R ^7: ^8test
to utf16(?):        ^7Cyk R ^7^c ^8test
TextStart =         14
buffer[+TextStart]: R ^7: ^8test
substr(+TextStart): test
custom regex:       test

full message:       ^7Cyk ﾏ ^7: ^8test
to utf16(?):        ^7Cyk ^JÏ ^7^c ^8test
TextStart =         16
buffer[+TextStart]: Ï ^7: ^8test
substr(+TextStart): st
custom regex:       test

full message:       ^7Cyk マ ^7: ^8test
to utf16(?):        ^7Cyk ^J荽 ^7^c ^8test
TextStart =         17
buffer[+TextStart]: } ^7: ^8test
substr(+TextStart): t
custom regex:       test

buffer[+TextStart] is just me cutting the TextStart first bytes and converting again, for reference, while substr(+TextStart) is how I assume TextStart is supposed to be used, by removing the TextStart first characters from the string.

A single Japanese character caused an offset of 2 characters in TextStart, maybe because ^J is ignored? Also worth noting that full-width Japanese characters cause an additional offset in TextStart, as seen in the 3rd test.

I think this means any codepage change in the player's name will cause TextStart to report erroneous values, and full-width Japanese characters (at the very least) also throw TextStart off. A name like 日本人じゃないけど will cut 11 characters from the message.

With that said, it is possible to use a regular expression to fulfill the same purpose without relying on TextStart, that's what the custom regex line shows in the above tests.


\^7%s \^7: \^8  # Replace %s with player name
(?<!\\)\^  # Use this to escape non-escaped carets ^ in the player's name

#2 - Scawen

Sun 12 Oct 2025, 15:13

I can't see or reproduce any bug. Something appears to be wrong in your code or analysis.

The TextStart byte refers to the actual offset in the original binary representation of the string in the packet, but you seem to be comparing it with some interpreted or converted strings.

Your "to utf16(?)" strings don't mean much to me. For instance, in the original output packet there is no ^c - there is just a plain colon.

To deal with your examples:

1) If you restore the ^c to : then the offset of "test" is 14 as expected.

2) Single width katakana (1 byte for this katakana) again the offset it correct if you replace ^c with :

3) Double width character so it's one more than example (2).

Basically if you look at the original bytes in the packet before doing any conversion into other text encoding systems, you should find the offset points to the first character of the message. I've verified this in my own InSim packet checker.

#3 - Bokujishin

** Best answer **

Sun 12 Oct 2025, 19:10

Looking back, I think the issue stems from how Godot InSim handles text in the various InSim packets (and, I believe, many other InSim libraries): basically, codepages are a pain to deal with, so we store everything as UTF8, and only present UTF8 to the user, for both input and output; because of this, Godot InSim gives access to the InSimMSOPacket.msg property as a UTF8-encoded string only, and using TextStart to offset this string (or the "UTF16" intermediate string) doesn't work in the scenarios I presented.

I believe you are most likely right that there is in fact no issue here, and I was just not using TextStart as intended. I will, however, keep calling this property "unreliable" in Godot InSim, because it actually is, as a result of the text handling. I ended up adding utility functions to retrieve a message's author, and the message's contents, based on the regex shown above, which is a workaround I'm okay with.