Converting C string to CFStringRef

Discussion:

(too old to reply)

none

2005-06-30 21:37:48 UTC

I have a stl::string (and thus, a null terminated C-string) and need to
create a CFStringRef. There's the CFSTR() macro, bit it only applies to
constant strings, not dynamic run-time strings. I have the following
code with some questions:

stl::string strTemp;
// .
// fill in strTemp
// .
// .
CFStringRef cfStringRef;
cfStringRef = CFStringCreateWithCString(kCFAllocatorDefault,
m_ strTemp.c_str(),
kCFStringEncodingMacRoman);
// .
// do something with cfStringRef
// .
// .
CFRelease(cfStringRef);

First, I'm using CFStringCreateWithCString() to create the CFStringRef.
Is there a better API to convert the C string to a CFStringRef?

Second, I'm passing kCFStringEncodingMacRoman for the CFStringEncoding
because, well, other code I've seen pass kCFStringEncodingMacRoman.
Will this hold up on non-English systems? What if I run my app on a
non-English system?

Thanks.

Chris Baum

2005-06-30 22:40:11 UTC

Permalink

Post by none
First, I'm using CFStringCreateWithCString() to create the CFStringRef.
Is there a better API to convert the C string to a CFStringRef?

No.

Post by none
Second, I'm passing kCFStringEncodingMacRoman for the CFStringEncoding
because, well, other code I've seen pass kCFStringEncodingMacRoman.
Will this hold up on non-English systems? What if I run my app on a
non-English system?

No, it will be fine. Many languages are roman. Your c-string can't
possibly be Japanese/Arabic/etc. with 8-bit chars -- MacRoman is
correct when you don't have anything else to consider.

David Phillip Oster

2005-07-01 03:12:47 UTC

Permalink

Post by Chris Baum

Post by none
First, I'm using CFStringCreateWithCString() to create the CFStringRef.
Is there a better API to convert the C string to a CFStringRef?

No.

You are, unfortunately, correct.

.c_str() often makes a copy into an internal buffer, since C+ standard
strings aren't necessarily null terminated. However, there is no call to
take an array of 8-bit characters and a length, so this is the best you
can do.

Post by Chris Baum

No, it will be fine. Many languages are roman. Your c-string can't
possibly be Japanese/Arabic/etc. with 8-bit chars -- MacRoman is
correct when you don't have anything else to consider.

You are, unfortunately, not correct. Just look at the encodings listed
in CFString.h : kCFStringEncodingWindowsLatin1,
kCFStringEncodingISOLatin1, kCFStringEncodingUTF8, and
kCFStringEncodingMacRoman all interpret high-bit-on characters
differently. If the string came from an old SimpleText file, in
Japanese, then it will have yet a different encoding.

The Unix system call interface to the Macintosh HFS+ file system uses
the UTF-8 encoding of Unicode, which any sequence of unicode characters
into a sequence of 8-bit characters with no nulls, which goes very
nicely into an std::string.

Since UTF-8 has no nulls, and no endian issues, it works very nicely
with printf and C++ << operations.

To turn std::strings into CFstrings correctly, you have to know what you
are doing.

--
David Phillip Oster

Chris Baum

2005-07-01 03:44:36 UTC

Permalink

In article

Post by David Phillip Oster
You are, unfortunately, not correct. Just look at the encodings listed
in CFString.h : kCFStringEncodingWindowsLatin1,
kCFStringEncodingISOLatin1, kCFStringEncodingUTF8, and
kCFStringEncodingMacRoman all interpret high-bit-on characters
differently. If the string came from an old SimpleText file, in
Japanese, then it will have yet a different encoding.

Thank you for the correction. Not sure why I assumed otherwise.

none

2005-07-06 04:48:04 UTC

Permalink

Here's an update. Thanks for all of the great feedback everyone, here's
where I'm at now.

First, yes, I do know a little about Unicode but I'm a bit rusty, and
judging by the feedback this thread has generated, I'm not the only one.
Looking back at my original reply, I'm kicking myself... of course
kCFStringEncodingMacRoman isn't encoding independent (!!).

Next, someone mentioned about std::string's c_str() method and
performance. Yes, it's left up to the STL implementation you are using,
but I seem to remember years back when I first looked at MW's STL
implementation that internally, the strings were null terminated and
calling c_str() didn't alloc a second buffer.

Anyway, I took a look at what I'm doing and what I need, and I probably
won't need to make this conversion after all (more on that later) but in
the mean time, some people asked about the use and where I was getting
the string from, and here's what I am (was) doing:

1. I need to get some strings from the system, and I'm getting a
CFStringRef string from an Apple API.
2. I then converting the CFStringRef to a std::string (via
CFStringGetCString()). Why? Well, I originally decided to convert the
strings into c-style strings for a number of reasons:
a. Although this project is mac-only, I'm a mac/win cross-platform
developer and prefer to develop using ANSI, non-platform dependent, data
types.
b. Debugging. While stepping through code, you can see what the
string is if it's a std::string but not a CFStringRef (at least under CW
8.3).
c. I needed to combine two strings obtained from Apple APIs (and thus
CFStringRef) into a different string, and at the time, it seemed easier
to keep everything as a std::string and manipulate the std::strings (but
now I'm just going to use CFStringCreateWithFormat()).
3. Later on, when I call the function which needs these strings (an
Apple API), I'd generate the new string:

CFStringRef cfDestination =
::CFStringCreateWithCString(kCFAllocatorDefault, strMyString.c_str(),
::GetApplicationTextEncoding());

But taking a look at what I really need the strings for, I I've decided
to keep everything as a CFStringRef and use CFStringCreateWithFormat()
to manipulate the strings.

Thanks,

-Jay

Reinder Verlinde

2005-07-06 06:29:01 UTC

Permalink

Post by none
Next, someone mentioned about std::string's c_str() method and
performance. Yes, it's left up to the STL implementation you are using,
but I seem to remember years back when I first looked at MW's STL
implementation that internally, the strings were null terminated and
calling c_str() didn't alloc a second buffer.

I think you should be using string::data() and string::size() instead of
string::c_str(). If you use string::c_str:

- the implementation must add the zero terminator,
if it is not already there. In edge cases, this might take
considerable time (reallocating the data to fit in the zero)
- the CFString 'constructor' must walk the entire string
to find that zero
- your std::string can not contain zeroes

Reinder

Michael Ash

2005-07-06 10:18:57 UTC

Permalink

Post by none
b. Debugging. While stepping through code, you can see what the
string is if it's a std::string but not a CFStringRef (at least under CW
8.3).

I seem to recall that the CW debugger is based on gdb. If that's true, and
*if* it gives you access to the gdb command line, then you can print
CFStringRefs by doing "print-object myStr". If it doesn't, then I'm sure
it has some mechanism for invoking function from the debugger, and you can
tell it to invoke "CFShow(myStr)".

Post by none
3. Later on, when I call the function which needs these strings (an
CFStringRef cfDestination =
::CFStringCreateWithCString(kCFAllocatorDefault, strMyString.c_str(),
::GetApplicationTextEncoding());
But taking a look at what I really need the strings for, I I've decided
to keep everything as a CFStringRef and use CFStringCreateWithFormat()
to manipulate the strings.

If you were to stick with your old approach, then you really should use
UTF-8 instead of the application text encoding. UTF-8 is the only
C-string-compatible encoding which will be able to represent every
character that a CFString can contain. Of course, if you aren't doing
this, then there's no need.

And I forgot to paste one of my favorite links earlier in the thread. For
anybody who's following along, scratching their heads, and saying things
like, "Unicode? Isn't that really wasteful, and won't it break all of my
existing code? ASCII is the One True Encoding anyway, and supporting other
languages is too hard," this page is great reading:

http://www.joelonsoftware.com/articles/Unicode.html

Chris Baum

2005-07-06 18:49:35 UTC

Permalink

Post by Michael Ash
I seem to recall that the CW debugger is based on gdb. If that's true, and
*if* it gives you access to the gdb command line, then you can print
CFStringRefs by doing "print-object myStr". If it doesn't, then I'm sure
it has some mechanism for invoking function from the debugger, and you can
tell it to invoke "CFShow(myStr)".

Add DataViewer_MSL_Mach-O.lib to your project to view CFStrings (and
other opaque types) in the debugger. See the DataViewer_Notes.txt
release notes for instructions.

Ben Artin

2005-07-01 07:35:43 UTC

Permalink

Post by David Phillip Oster

Post by none
First, I'm using CFStringCreateWithCString() to create the CFStringRef.
Is there a better API to convert the C string to a CFStringRef?

No.

You are, unfortunately, correct.
.c_str() often makes a copy into an internal buffer, since C+ standard
strings aren't necessarily null terminated. However, there is no call to
take an array of 8-bit characters and a length, so this is the best you
can do.

In a decent STL implementation, c_str() does not make a copy into a new buffer.
I think you should look at a few STL implementations and I believe you will find
that you were wrong when you said that c_str() often copies.

Also, CFStringCreateWithCString is a bad way to convert a std::string into a
CFStringRef because it does not handle embedded NULs correctly. I strongly
recommend using CFStringCreateWithBytes instead, passing string.c_str() and
string.size(). This will correctly handle embedded NULs and otherwise be
equivalent to CFStringCreateWithCString.

Post by David Phillip Oster
The Unix system call interface to the Macintosh HFS+ file system uses
the UTF-8 encoding of Unicode, which any sequence of unicode characters
into a sequence of 8-bit characters with no nulls, which goes very
nicely into an std::string.
Since UTF-8 has no nulls, and no endian issues, it works very nicely
with printf and C++ << operations.

UTF-8 has a NUL character, as does every other Unicode encoding, and as far as I
know a NUL character is valid in an HFS+ name, although it would be somewhat
ill-advised as I am sure many UNIX APIs would not handle that case well.
However, I have run into other cases where preserving embedded NULs was
important -- hence my recommendation to use CreateWithBytes.

(Also, minor nit: NUL is a character, NULL and nil are pointers.)

Ben

--
If this message helped you, consider buying an item
from my wish list: <http://artins.org/ben/wishlist>

I changed my name: <http://periodic-kingdom.org/People/NameChange.php>

David Phillip Oster

2005-07-01 15:44:04 UTC

Permalink

Post by Ben Artin
Also, CFStringCreateWithCString is a bad way to convert a std::string into a
CFStringRef because it does not handle embedded NULs correctly. I strongly
recommend using CFStringCreateWithBytes instead, passing string.c_str() and
string.size(). This will correctly handle embedded NULs and otherwise be
equivalent to CFStringCreateWithCString.

Thank you. I looked for that API, but missed it.

Post by Ben Artin

UTF-8 has a NUL character, as does every other Unicode encoding, and as far as I
know a NUL character is valid in an HFS+ name, although it would be somewhat
ill-advised as I am sure many UNIX APIs would not handle that case well.
However, I have run into other cases where preserving embedded NULs was
important -- hence my recommendation to use CreateWithBytes.
(Also, minor nit: NUL is a character, NULL and nil are pointers.)

I appreciate the corrections. Thanks.

--
David Phillip Oster

Jøhnny Fävòrítê (it means "A Device Which Is Meowing")

2005-07-01 03:37:41 UTC

Permalink

nope, probably not. and in fact, given the way you've worded the question, it
makes me think you've got a lot to learn about character sets and encodings.

everybody has an opinion, but mine is that i think the best approach is to
treat all strings as some form of unicode, wherever possible. macosx's
keyboard drivers generate unicode. CFString and NSString can take many forms
internally, but the docs encourage you to think of them as containing a
sequence of unicode characters. (macosx appears to use 16-bit unicode
characters.) if you want to create a folder or file with non-ascii characters
in its name, you can pass a utf-8 string (unicode encoded in 8-bit characters)
to the unix file functions. macosx's xml files are usually encoded in utf-8.
and so on.

this is far too big of a topic to cover in a single reply, but since almost
nobody seems willing to invest the time to learn, the best advice seems to be:
when in doubt, use unicode. mac roman is old-skule macos9 thinking, i.e.,
dead. here's a good starting point, to learn more:

http://www.joelonsoftware.com/articles/Unicode.html

Mark Hamilton

2005-07-01 05:17:02 UTC

Permalink

You might try this function:

GetApplicationTextEncoding()

I'm new to this myself, and ran across this in one of the Apple examples.

Michael Ash

2005-07-01 09:12:04 UTC

Permalink

Post by Mark Hamilton

GetApplicationTextEncoding()
I'm new to this myself, and ran across this in one of the Apple examples.

The docs say:

Your application needs to use the application text encoding when
it creates a CFStringRef from text stored in Resource Manager
resources. Typically the text uses a Mac encoding such as MacRoman
or MacJapanese.

This is unlikely to be appropriate.

On OS X, it's a good idea to just assume that everything is Unicode of one
sort or another. When dealing with 8-bit strings, that means UTF-8. This
assumption will be correct for standard input for command-line utilities
(usually), for paths, and many other things, whereas MacRoman and
MacJapanese has an approximately zero chance of being correct unless
you're dealing with legacy data such as Resource Manager tetx.

Reinder Verlinde

2005-07-01 17:29:17 UTC

Permalink

Post by none
I have a stl::string (and thus, a null terminated C-string) and need to
create a CFStringRef. There's the CFSTR() macro, bit it only applies to
constant strings, not dynamic run-time strings. I have the following
[...]
Second, I'm passing kCFStringEncodingMacRoman for the CFStringEncoding
because, well, other code I've seen pass kCFStringEncodingMacRoman.
Will this hold up on non-English systems? What if I run my app on a
non-English system?

std::string is almost encoding-agnostic. It allows any encoding where
one char is one byte that do not use NUL values (and even those rules
have more or less functional workarounds). You are the only one who can
tell what the proper encoding is. Where did you get the string data?

Reinder