Have you considered implementing an ALT-REP string class? I think done properly, you'd see a large increase in performance across the board. There are many reasons why:
- Simpler data structures compared to R's heavy CHARSXP and R's global string cache
- Short string optimization
- The possibility of true multithreading (you can't multithread R internals)
If there's interest, I'd be happy to develop and work on it.
To flesh it out a bit, I think you could use an ALT-REP class that's represented by standard STL structures:
You don't need to keep track of encoding, if you can assume UTF-8.
You'd probably want some global configuration parameter:
You'd have to replace every interaction with R string memory with a conditional.
CHAR
SET_STRING_ELT
STRING_ELT
mkChar*
Rf_allocVector(STRSXP,...)
And replace any comparison of address for testing string equality (not sure if stringi does so).
There are probably things I'm forgetting and it's a lot of work, but I think clearly defined.
Have you considered implementing an ALT-REP string class? I think done properly, you'd see a large increase in performance across the board. There are many reasons why:
If there's interest, I'd be happy to develop and work on it.
To flesh it out a bit, I think you could use an ALT-REP class that's represented by standard STL structures:
You don't need to keep track of encoding, if you can assume UTF-8.
You'd probably want some global configuration parameter:
You'd have to replace every interaction with R string memory with a conditional.
And replace any comparison of address for testing string equality (not sure if stringi does so).
There are probably things I'm forgetting and it's a lot of work, but I think clearly defined.