I was just wondering why _bstr_t strings always use 2 bytes per
character. Are they always Unicode compatible?

Thank you.

Re: 2 bytes per character by Brian

Brian
Thu Oct 26 17:06:26 CDT 2006

More than compatible. They are UNICODE.

Brian

<mike7411@gmail.com> wrote in message
news:1161895882.706780.69540@b28g2000cwb.googlegroups.com...
>I was just wondering why _bstr_t strings always use 2 bytes per
> character. Are they always Unicode compatible?
>
> Thank you.
>



Re: 2 bytes per character by David

David
Thu Oct 26 17:09:28 CDT 2006

mike7411@gmail.com wrote:

> I was just wondering why _bstr_t strings always use 2 bytes per
> character. Are they always Unicode compatible?
>
> Thank you.
>

mike:

Yes. _bstr_t wraps a BSTR, which is a wide character string with the
length pre-pended. BSTR is designed to be used with COM, which has to
work with VB, which always uses 16-bit strings.

_bstr_t has constructor and conversion operator for const char* (using
the current ANSI code page), which makes it easy to use (or abuse) in an
ANSI build application.

David Wilkinson

Re: 2 bytes per character by Brian

Brian
Thu Oct 26 17:12:16 CDT 2006

Actually, there are exceptions that are worth mentioning as an addendum. It
is possible to use BSTRings as a container for other information, including
arbitrary binary data. But the original concept of BSTR's was to house
UNICODE strings.

Brian




Re: 2 bytes per character by peter

peter
Fri Oct 27 08:51:16 CDT 2006


Brian Muth wrote:
> More than compatible. They are UNICODE.
They could also be unicode with eight bit bytes.
>
> Brian
>
> <mike7411@gmail.com> wrote in message
> news:1161895882.706780.69540@b28g2000cwb.googlegroups.com...
> >I was just wondering why _bstr_t strings always use 2 bytes per
> > character. Are they always Unicode compatible?

More precisely, I believe a _bstr_t is encoded as UTF-16. So depending
on what character you "put in", it will consume two or four bytes.

> >
> > Thank you.
> >


/Peter


Re: 2 bytes per character by Alex

Alex
Fri Oct 27 09:49:43 CDT 2006

<peter.koch.larsen@gmail.com> wrote:
>>> I was just wondering why _bstr_t strings always use 2
>>> bytes per
>>> character. Are they always Unicode compatible?
>
> More precisely, I believe a _bstr_t is encoded as UTF-16.
> So depending
> on what character you "put in", it will consume two or
> four bytes.


I never heard of a character in BSTR that comsumes 4 bytes.
Moreover, it would break a lot of existing code if it was
so.



Re: 2 bytes per character by Igor

Igor
Fri Oct 27 10:05:02 CDT 2006

Alex Blekhman <xfkt@oohay.moc> wrote:
> <peter.koch.larsen@gmail.com> wrote:
>> More precisely, I believe a _bstr_t is encoded as UTF-16.
>> So depending
>> on what character you "put in", it will consume two or
>> four bytes.
>
> I never heard of a character in BSTR that comsumes 4 bytes.
> Moreover, it would break a lot of existing code if it was
> so.

Read about surrogate pairs: http://en.wikipedia.org/wiki/UTF-16
--
With best wishes,
Igor Tandetnik

With sufficient thrust, pigs fly just fine. However, this is not
necessarily a good idea. It is hard to be sure where they are going to
land, and it could be dangerous sitting under them as they fly
overhead. -- RFC 1925



Re: 2 bytes per character by Alex

Alex
Fri Oct 27 10:49:16 CDT 2006

"Igor Tandetnik" wrote:
>> I never heard of a character in BSTR that comsumes 4
>> bytes.
>> Moreover, it would break a lot of existing code if it was
>> so.
>
> Read about surrogate pairs:
> http://en.wikipedia.org/wiki/UTF-16


I'm aware of surrogate pairs. However, I was under
impression that while OS supports (to some extent) surrogate
pairs, BSTR is locked to UCS-2.



Re: 2 bytes per character by Igor

Igor
Fri Oct 27 11:11:48 CDT 2006

Alex Blekhman <xfkt@oohay.moc> wrote:
> "Igor Tandetnik" wrote:
>>> I never heard of a character in BSTR that comsumes 4
>>> bytes.
>>> Moreover, it would break a lot of existing code if it was
>>> so.
>>
>> Read about surrogate pairs:
>> http://en.wikipedia.org/wiki/UTF-16
>
>
> I'm aware of surrogate pairs. However, I was under
> impression that while OS supports (to some extent) surrogate
> pairs, BSTR is locked to UCS-2.

What precisely would stop one from putting a surrogate pair into a BSTR?
It just carries bytes around. In what sense may BSTR "not support"
surrogate pairs?
--
With best wishes,
Igor Tandetnik

With sufficient thrust, pigs fly just fine. However, this is not
necessarily a good idea. It is hard to be sure where they are going to
land, and it could be dangerous sitting under them as they fly
overhead. -- RFC 1925



Re: 2 bytes per character by Alex

Alex
Fri Oct 27 12:04:47 CDT 2006

"Igor Tandetnik" wrote:
>> I'm aware of surrogate pairs. However, I was under
>> impression that while OS supports (to some extent)
>> surrogate
>> pairs, BSTR is locked to UCS-2.
>
> What precisely would stop one from putting a surrogate
> pair into a BSTR? It just carries bytes around. In what
> sense may BSTR "not support" surrogate pairs?


Of course, nothing can stop developer from putting surrogate
pair into BSTR. The problem with "support" can arise from
other components that work with BSTR. (Not to speak of being
prohibited by OLE specification.) For example, SysStringLen
will return incorrect length, marshalling such BSTR will
corrupt its content, etc.. I'm concerned about support of
surrogate pair because when one writes COM related code,
then a lot of facilities that make it work provided by
system. I'm not sure that in all those places no assumptions
were made about content of BSTR.

I can bring an example of such assumptions. One day we tried
to transmit binary data via BSTR. Such usage is allowed by
OLE Automation and described in SysAllocStringByteLen
documentation. So, embedded zeros should be "suppored" by
BSTR. However, it was proven with simplest example that DCOM
marshalling code (or whatever else) assumed that BSTR won't
contain zero in the middle. Therefore, binary data was
truncated up to first encountered zero.


Re: 2 bytes per character by Igor

Igor
Fri Oct 27 12:30:32 CDT 2006

Alex Blekhman <xfkt@oohay.moc> wrote:
> "Igor Tandetnik" wrote:
>> What precisely would stop one from putting a surrogate
>> pair into a BSTR? It just carries bytes around. In what
>> sense may BSTR "not support" surrogate pairs?
>
>
> Of course, nothing can stop developer from putting surrogate
> pair into BSTR. The problem with "support" can arise from
> other components that work with BSTR. (Not to speak of being
> prohibited by OLE specification.) For example, SysStringLen
> will return incorrect length

It will return the length in wchar_t's, consistent with all the other
Win32 APIs.

> marshalling such BSTR will
> corrupt its content

Are you sure? Corrupt in what way? As far as I can tell, BSTR is treated
as a binary buffer for marshalling purposes. It may even have an odd
number of bytes.

> I'm not sure that in all those places no assumptions
> were made about content of BSTR.

I don't think there are any assumptions about contents of BSTRs in COM
runtime - witness SysAllocStringByteLen et al.

> I can bring an example of such assumptions. One day we tried
> to transmit binary data via BSTR. Such usage is allowed by
> OLE Automation and described in SysAllocStringByteLen
> documentation. So, embedded zeros should be "suppored" by
> BSTR. However, it was proven with simplest example that DCOM
> marshalling code (or whatever else) assumed that BSTR won't
> contain zero in the middle.

Can you show a repro? I've dealt with BSTRs containing embedded NULs on
a few occasions, and I believe COM runtime handles these just fine. Is
it possible you were using some library on top of raw COM that choked on
these? E.g. most methods of ATL's CComBSTR would trunctate a string with
embedded NULs.
--
With best wishes,
Igor Tandetnik

With sufficient thrust, pigs fly just fine. However, this is not
necessarily a good idea. It is hard to be sure where they are going to
land, and it could be dangerous sitting under them as they fly
overhead. -- RFC 1925



Re: 2 bytes per character by Alex

Alex
Fri Oct 27 17:31:16 CDT 2006

"Igor Tandetnik" wrote:
> I don't think there are any assumptions about contents of
> BSTRs in COM runtime - witness SysAllocStringByteLen et
> al.

I agree with you in general. I don't believe either that any
assumptions about BSTR content were made willingly. What I'm
concerned about is that support of surrogate pairs is kind
of gray area. Platform SDK itself has reservations when
speaks about surrogate pairs:

"Surrogates and Supplementary Characters"
http://windowssdk.msdn.microsoft.com/en-us/library/ms776414.aspx

<quote>
Windows 2000 introduced support for basic input, output, and
simple sorting of supplementary characters. However, not all
system components are compatible with supplementary
characters.
</quote>

>> I can bring an example of such assumptions. One day we
>> tried
>> to transmit binary data via BSTR. Such usage is allowed
>> by
>> OLE Automation and described in SysAllocStringByteLen
>> documentation. So, embedded zeros should be "suppored" by
>> BSTR. However, it was proven with simplest example that
>> DCOM
>> marshalling code (or whatever else) assumed that BSTR
>> won't
>> contain zero in the middle.
>
> Can you show a repro? I've dealt with BSTRs containing
> embedded NULs on a few occasions, and I believe COM
> runtime handles these just fine. Is it possible you were
> using some library on top of raw COM that choked on these?
> E.g. most methods of ATL's CComBSTR would trunctate a
> string with embedded NULs.

Yes, all popular BSTR wrappers fail to recognize embedded
NUL. So, no wrappers were used. I can't vouch that the
problem was inside COM runtime, DCOM, network driver or
something else. It was enough to discover that data doesn't
come through in its entirety.

I'll try to reproduce it with Windows XP; just don't have
Windows 2000 box available anymore. The origonal system was
running under Win2K as workstation and some version of
Novell as domain server.


Re: 2 bytes per character by Eugene

Eugene
Sat Oct 28 00:49:54 CDT 2006

Alex Blekhman wrote:
> What I'm
> concerned about is that support of surrogate pairs is kind
> of gray area.

This is because "support" is a meaningless term here.

As with any multi-unit encodings as long as you treat the string as an
indivisible unit (marshalling, concatenating, copying etc.) you don't need
to worry about multi-unit entries. If you need to look at individual units
(display in GUI, split on character boundary, search etc.) then you *may*
have problems. Again, this applies to anything that hase multi-units:
Shift-JIS, UTF-8 or UTF-16.
For surrogate pairs in UTF-16 you get a stronger guarantee because of the
way it is designed. Either member of a surrogate pair is in a range that
doesn't overlap "normal" single-unit characters. Which means that you can
safely search for a single-unit wchar_t or split a string on a single-unit
boundary.

One final issue that sometimes arises with surrogate pairs is how to convert
them to UTF-8 (which is often used on the Internet). The canonical way is to
convert them to the correspoding Unicode value and then convert to UTF-8.
Software which is not aware of surrogate pairs will instead treat each
element as an individual Unicode character and convert it in turn (creating
2 illegal UTF-8 characters). Some systems like Java even do it by design.
Which of course can lead to interoperability headaches. This issue too is
just a special case of problems that arise when you have to look at the
individual units.

Regarding Windows components/APIs the above should immediately give a sense
of which should work regardless of surrogate pairs and which may or may not
work.

With some exceptions most 3rd party software these days is not aware of
surrogate pairs. Which is IMHO completely reasonable.

--
Eugene