I am using Visual C 2005 Express Edition.
Here is the contents of a C++ file:

#include <stdio.h>

int main(void)
{
printf( "1. size : %d\n", sizeof('A'));
printf( "2. size : %d\n", sizeof('\x41'));
printf( "3. size : %d\n", sizeof('\u00F3'));
printf( "4. size : %d\n", sizeof('\u01F3'));
printf( "5. size : %d\n", sizeof('\uFB94'));
printf( "6. size : %d\n", sizeof('\U0001FB94'));
return 0;
}

And we get the following results:

1. size : 1
2. size : 1
3. size : 1
4. size : 1
5. size : 1
6. size : 4

Is it correct to have sizeof('\U0001FB94') == 4? '\U0001FB94' is an
ordinary character literal and it has the char type whose size is
always 1.

Re: sizeof peculiarity ? by Alex

Alex
Sun Jun 25 05:37:49 CDT 2006

cristi wrote:
> I am using Visual C 2005 Express Edition.
> Here is the contents of a C++ file:
>
> #include <stdio.h>
>
> int main(void)
> {
> printf( "1. size : %d\n", sizeof('A'));
> printf( "2. size : %d\n", sizeof('\x41'));
> printf( "3. size : %d\n", sizeof('\u00F3'));
> printf( "4. size : %d\n", sizeof('\u01F3'));
> printf( "5. size : %d\n", sizeof('\uFB94'));
> printf( "6. size : %d\n", sizeof('\U0001FB94'));
> return 0;
> }
>
> And we get the following results:
>
> 1. size : 1
> 2. size : 1
> 3. size : 1
> 4. size : 1
> 5. size : 1
> 6. size : 4
>
> Is it correct to have sizeof('\U0001FB94') == 4?
> '\U0001FB94' is an ordinary character literal and it
> has the char type whose size is always 1.

You should get compiler warnings when compiling this code.
With the exception of 0x24 and 0x40, characters in the range
of 0 to 0x20 and 0x7f to 0x9f cannot be represented with a
universal character name.

What happened is that preprocessor (not compiler) converted
above literals into characters, which are not acceptable by
compiler. '\U0001FB94' literal was converted into int. If
you want to store Unicode characters, then use widechars:

L'\x00F3', '\x01F3', etc..

The literal '\U0001FB94' won't fit into single wide
character. You will need wide string literal:

L"\x0001\xFB94"


HTH
Alex



Re: sizeof peculiarity ? by Alex

Alex
Sun Jun 25 05:43:40 CDT 2006

> [...] If you want to store Unicode
> characters, then use widechars:
> L'\x00F3', L'\x01F3', etc..
^

Sorry for the typo.



Re: sizeof peculiarity ? by cristi

cristi
Sun Jun 25 08:21:01 CDT 2006

> > I am using Visual C 2005 Express Edition.
> > Here is the contents of a C++ file:
> >
> > #include <stdio.h>
> >
> > int main(void)
> > {
> > printf( "1. size : %d\n", sizeof('A'));
> > printf( "2. size : %d\n", sizeof('\x41'));
> > printf( "3. size : %d\n", sizeof('\u00F3'));
> > printf( "4. size : %d\n", sizeof('\u01F3'));
> > printf( "5. size : %d\n", sizeof('\uFB94'));
> > printf( "6. size : %d\n", sizeof('\U0001FB94'));
> > return 0;
> > }
> >
> > And we get the following results:
> >
> > 1. size : 1
> > 2. size : 1
> > 3. size : 1
> > 4. size : 1
> > 5. size : 1
> > 6. size : 4
> >
> > Is it correct to have sizeof('\U0001FB94') == 4?
> > '\U0001FB94' is an ordinary character literal and it
> > has the char type whose size is always 1.
>
> You should get compiler warnings when compiling this code.
Yes. I get compiler warnings for the following lines:

printf( "4. size : %d\n", sizeof('\u01F3'));
printf( "5. size : %d\n", sizeof('\uFB94'));
printf( "6. size : %d\n", sizeof('\U0001FB94'));

and here are the warnings:

warning C4566: character represented by universal-character-name '\u01F3'
cannot be represented in the current code page (1250)
warning C4566: character represented by universal-character-name '\uFB94'
cannot be represented in the current code page (1250)
warning C4566: character represented by universal-character-name
'\UD83EDF94' cannot be represented in the current code page (1250)

We get the same warning for all the above lines. I am a Unicode novice and
playing with the compiler. My intention was *not* to use wide character
literals. The standard mentions that ordinary character literals containing a
single c-char are of type char. For the first to lines the behaviour is
standard. Even if the characters specified by the universal character names
do not fit in a char, the standard behaviour is preserved - that is, the type
is char - and we get a size of 1 (of course, the value is implementation
defined). But for the last line the type is changed. It seems very strange to
me.

The section in the standard (ISO/IEC 14882:1998 or ISO/IEC 14882:2003) I am
referencing is 2.13.2.

thx


>
> What happened is that preprocessor (not compiler) converted
> above literals into characters, which are not acceptable by
> compiler. '\U0001FB94' literal was converted into int. If
> you want to store Unicode characters, then use widechars:
>
> L'\x00F3', '\x01F3', etc..
>
> The literal '\U0001FB94' won't fit into single wide
> character. You will need wide string literal:
>
> L"\x0001\xFB94"
>
>
> HTH
> Alex
>
>
>

Re: sizeof peculiarity ? by Alex

Alex
Sun Jun 25 10:19:12 CDT 2006

cristi wrote:
> I get compiler warnings for the following lines:
>
> printf( "4. size : %d\n", sizeof('\u01F3'));
> printf( "5. size : %d\n", sizeof('\uFB94'));
> printf( "6. size : %d\n", sizeof('\U0001FB94'));
>
> and here are the warnings:
>
> warning C4566: character represented by
> universal-character-name '\u01F3' cannot be represented
> in the current code page (1250)
> warning C4566: character represented by
> universal-character-name '\uFB94' cannot be represented
> in the current code page (1250)
> warning C4566: character represented by
> universal-character-name '\UD83EDF94' cannot be
> represented in the current code page (1250)
>
> We get the same warning for all the above lines. I am a
> Unicode novice and playing with the compiler. My
> intention was *not* to use wide character literals. The
> standard mentions that ordinary character literals
> containing a single c-char are of type char. For the
> first to lines the behaviour is standard. Even if the
> characters specified by the universal character names do
> not fit in a char, the standard behaviour is preserved -
> that is, the type is char - and we get a size of 1 (of
> course, the value is implementation defined). But for the
> last line the type is changed. It seems very strange to
> me.

It's implementation defined. Here's what paragraph 2.13.2/5
says:

"A universal-character-name is translated to the encoding,
in the execution character set, of the character named. If
there is no such encoding, the universal-character-name is
translated to an implementation defined encoding."

Execution character set for MS compiler is ASCII character
set (with current system locale for extended characters).
See here for more info:

"Phases of Translation"
http://msdn2.microsoft.com/en-us/library/bxss3ska.aspx

Why MS compiler translates \uNNNN into single byte character
and \UNNNNNNNN into integer is beyond me.



Re: sizeof peculiarity ? by Tim

Tim
Sun Jun 25 20:19:23 CDT 2006

cristi <cristi@discussions.microsoft.com> wrote:

>I am using Visual C 2005 Express Edition.
>Here is the contents of a C++ file:
>
>#include <stdio.h>
>
>int main(void)
>{
> printf( "1. size : %d\n", sizeof('A'));
> printf( "2. size : %d\n", sizeof('\x41'));
> printf( "3. size : %d\n", sizeof('\u00F3'));
> printf( "4. size : %d\n", sizeof('\u01F3'));
> printf( "5. size : %d\n", sizeof('\uFB94'));
> printf( "6. size : %d\n", sizeof('\U0001FB94'));
> return 0;
>}
>
>And we get the following results:
>
>1. size : 1
>2. size : 1
>3. size : 1
>4. size : 1
>5. size : 1
>6. size : 4
>
>Is it correct to have sizeof('\U0001FB94') == 4? '\U0001FB94' is an
>ordinary character literal and it has the char type whose size is
>always 1.

Microsoft's compilers have always supported an extension allowing for
2-byte and 4-byte character literals:

unsigned short ab = 'AB'; // hex value 4241
unsigned int abcd = 'ABCD'; // hex value 44434241

I don't think you can complain about the compiler's behavior in this case.
0001FB94 is not a valid Unicode code point, so there isn no way to
determine whether it maps to one or more characters in the current
character set.

It's interesting to look at the hex values generated from those constants.
Respectively:

C:\tmp>x
1. size : 1
2. size : 1
3. size : 1
4. size : 1
5. size : 1
6. size : 1
1. char : 41
2. char : 41
3. char : f3
4. char : 3f
5. char : 3f
6. char : 3f3f
C:\tmp>

The unknowns in 4 and 5 translate as ?, and the unknown in 6 translates as
??.
--
- Tim Roberts, timr@probo.com
Providenza & Boekelheide, Inc.

Re: sizeof peculiarity ? by cristi

cristi
Mon Jun 26 09:41:02 CDT 2006

> Microsoft's compilers have always supported an extension allowing for
> 2-byte and 4-byte character literals:
>
> unsigned short ab = 'AB'; // hex value 4241
> unsigned int abcd = 'ABCD'; // hex value 44434241
>
> I don't think you can complain about the compiler's behavior in this case.
> 0001FB94 is not a valid Unicode code point, so there isn no way to
> determine whether it maps to one or more characters in the current
> character set.

It sound very interesting and useful to me to hear that.

Why \U0001FB94 is it not a valid Unicode codepoint? It identifies a
unicode character outside BMP and it is also defined by the C++
standard. Probably Visual C does not consider it a valid one because
wchar_t is only 2 bytes long.


I thought it is ok to use in a wide string literal such a universal
character name. The following piece of code:

int main(void)
{
printf( "length: %d\n", wcslen(L"A\U0001FB94") );
return 0;
}

shows a length of 3. I didn't look to the encoding. But, having read
other documents specifying that Win32 is UTF-16 (and seeing that we
can use japanese/chinese characters) I thought the compiler encodes
all the wide string literal in UTF-16.

Then, yesterday I read the 2.13.4/5 section in the standard about the
length of a wide string literal and the standard definition --- "The
size of a wide string literal is the total number of escape sequences,
universal-character-names, and other characters, plus one for the
terminating L'\0' --- does not match the result provided by the above
piece of code. So, probably your explanation clarifies why the above
piece of code reports a length of 3.

All this means that we cannot use japanese/chinese (or characters
outside BMP) characters in wide string/character literals?




>
> It's interesting to look at the hex values generated from those constants.
> Respectively:
>
> C:\tmp>x
> 1. size : 1
> 2. size : 1
> 3. size : 1
> 4. size : 1
> 5. size : 1
> 6. size : 1
> 1. char : 41
> 2. char : 41
> 3. char : f3
> 4. char : 3f
> 5. char : 3f
> 6. char : 3f3f
> C:\tmp>
>
> The unknowns in 4 and 5 translate as ?, and the unknown in 6 translates as
> ??.
> --
> - Tim Roberts, timr@probo.com
> Providenza & Boekelheide, Inc.
>

Re: sizeof peculiarity ? by Tim

Tim
Wed Jun 28 00:10:22 CDT 2006

cristi <cristi@discussions.microsoft.com> wrote:

>> Microsoft's compilers have always supported an extension allowing for
>> 2-byte and 4-byte character literals:
>>
>> unsigned short ab = 'AB'; // hex value 4241
>> unsigned int abcd = 'ABCD'; // hex value 44434241
>>
>> I don't think you can complain about the compiler's behavior in this case.
>> 0001FB94 is not a valid Unicode code point, so there isn no way to
>> determine whether it maps to one or more characters in the current
>> character set.
>
>It sound very interesting and useful to me to hear that.
>
>Why \U0001FB94 is it not a valid Unicode codepoint?

Because it does not represent any character. Not every random bit sequence
actually represents a Unicode code point. As near as I can tell, none of
the code points in 1FBxx are defined.

An ANSI string cannot contain Unicode characters. When you embed a Unicode
character in a non-Unicode string, as you have done, the compiler has to
translate that to ANSI in some way, based on the code page currently in use
(I believe). With some code pages, Japanese characters CAN be represented
in an 8-bit string, but only by using the multibyte escape sequences. Thus,
a single Unicode escape sequence in an ANSI string might map to more than
one byte.

In order for the compiler to know that, it has to know exactly which real
character the Unicode code point represents. \U0001FB94 does not map to
any real character, so there is no criteria the compiler can use to decide
how wide the equivalent 8-bit representation would be. Hence, you get
garbage.

>It identifies a
>unicode character outside BMP and it is also defined by the C++
>standard. Probably Visual C does not consider it a valid one because
>wchar_t is only 2 bytes long.

The FORMAT of the escape sequence is defined. The MEANING of escape
sequences that are not part of Unicode 4.0 is not defined.

Also, remember that you were NOT defining a wide character constant. You
defined a NARROW (8-bit) constant, using a Unicode escape sequence. If you
had defined them as Unicode constants, you would have received very
different results. You had:
printf( "6. size : %d\n", sizeof('\U0001FB94') );
if you had tried this:
printf( "6. size : %d\n", sizeof(L'\U0001FB94') );
you would have seen that ALL of the constants were 2 bytes in size (which,
it occurs to me, is incorrect in the 1FB94 case).

>I thought it is ok to use in a wide string literal such a universal
>character name. The following piece of code:
>
>int main(void)
>{
> printf( "length: %d\n", wcslen(L"A\U0001FB94") );
> return 0;
>}
>
>shows a length of 3. I didn't look to the encoding. But, having read
>other documents specifying that Win32 is UTF-16 (and seeing that we
>can use japanese/chinese characters) I thought the compiler encodes
>all the wide string literal in UTF-16.

Yes. The compiler encodes your string as the three UTF-16 words 0x0041
0xD83E 0xD94F. The D800 plane is reserved for use in UTF-16, to represent
characters larger than 16-bits. The D83E/D94F pair is the UTF-16
representation of 0001FB94. In this case, the compiler has done exactly
the right thing: the string L"A\U0001FB94" DOES contain three characters in
UTF-16.

>All this means that we cannot use japanese/chinese (or characters
>outside BMP) characters in wide string/character literals?

No, no, no!! Remember that you were *NOT* defining a wide character
literal! You were using a Unicode escape sequence in a NARROW character
literal. 'A' is a narrow character literal. L'A' is a wide character
literal.

Further, you were using an undefined character. If you used a Japanese
character outside of BMP that had an encoding in your current code page,
the compiler would properly produce a narrow character literal for it.
--
- Tim Roberts, timr@probo.com
Providenza & Boekelheide, Inc.

Re: sizeof peculiarity ? by Kevin

Kevin
Thu Jun 29 15:05:39 CDT 2006

The implementation of Unicode is to blame. Most people think it means two-byte
characters, whereas it is actually variable-length, up to six bytes.


--
#include <standard.disclaimer>
_
Kevin D Quitt USA 91387-4454 96.37% of all statistics are made up

Re: sizeof peculiarity ? by Tim

Tim
Thu Jun 29 23:51:34 CDT 2006

Kevin D. Quitt <KQuitt@IEEInc.com> wrote:
>
>The implementation of Unicode is to blame. Most people think it means two-byte
>characters, whereas it is actually variable-length, up to six bytes.

No, it is not. Unicode does not have a length. Unicode just defines the
characters, and there are 2^32 code points defined. It is the encoding
that defines the sizing. What Microsoft calls Unicode is the UTF-16
encoding, where each 32-bit code point can be represented by one or two
16-bit words.

What you're thinking of is UTF-8. That's where a single code point can
reach 6 bytes.
--
- Tim Roberts, timr@probo.com
Providenza & Boekelheide, Inc.

Re: sizeof peculiarity ? by Kevin

Kevin
Sun Jul 02 18:25:08 CDT 2006

On Thu, 29 Jun 2006 21:51:34 -0700, Tim Roberts <timr@probo.com> wrote:
>What you're thinking of is UTF-8. That's where a single code point can
>reach 6 bytes.

Thanks for the clarification. I knew that of course. *sigh*


--
#include <standard.disclaimer>
_
Kevin D Quitt USA 91387-4454 96.37% of all statistics are made up