Skip to content

Improved representation of strings #1160

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
myreen opened this issue Apr 21, 2025 · 0 comments
Open

Improved representation of strings #1160

myreen opened this issue Apr 21, 2025 · 0 comments

Comments

@myreen
Copy link
Contributor

myreen commented Apr 21, 2025

In CakeML strings are immutable vectors of characters. At present they are always represented at runtime as a pointer to an immutable byte array. For example, the string "hello" is represented by

ptr --> [header: immutable byte array (length 5);
         payload: 0x68 0x65 0x6C 0x6C 0x6F]

This issue is about making the compiler generate code with a more space efficient (and faster?) representation for short strings. The idea is to represent short strings as numbers directly in the register (or memory slot) where the pointer to the byte array would otherwise be. The string "hello" would be represented as:

0x568656C6C6F00000

which is 0x5 and then 0x68 0x65 0x6C 0x6C 0x6F followed by zeros. This encoding can represent strings up to length 7 directly as an integer with least significant bit 0 indicating to the CakeML GC that it's a number.

With this representation the empty string is number 0. A single character string such as "a" is 0x16100000000000 where 0x61 is the hex for character a.

This string representation pads zeros to the right in order to make the following string comparison fast:

fast_str_le s t =
  if strlen s = strlen t then s <= t else strlen s <= strlen t 

This can be implemented as follows:

  • When computing this for two short strings, we can implement this with a single machine instruction for 64-bit integer comparison.
  • When comparing a short and a long string, we can immediately tell which the result based on the least significant bit of the pointer / integer, since pointers in CakeML always have least significant bit 1.
  • When comparing two long strings, we first compare the headers. If they differ, then an integer comparison of the headers gives the result. If the headers are the same, then we need to walk the payload.

Using the space optimiation and fast_str_le in balanced binary maps where (mostly short) strings are keys should improve both space usage and speed.


For 32-bit architectures, the length of short strings would be at most 3, which is such a low limit that the optimised representation is probably less useful, but still good to have for 32-bit architectures as long as it can be turned off if users find it's harmful in their application.


To implement this in the CakeML compiler, one needs to:

  1. make BVL, BVI, DataLang aware (both in the compiler state and semantics) of whether we are running in 64-bit mode.
  2. update clos_to_bvl to map short ByteVectors to Number when possible
  3. update data_to_word for this new representation of strings; the implementation of some string operations might need to move to data_to_word from bvl_to_bvi or be supported by new primitive ops that allows them to be implemented in bvl_to_bvi.
  4. add fast_str_le to the source semantics and implement it in data_to_word
  5. add a new compiler flag that can be used to disable this optimisation
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant