Hi.
I try to read unicode encoded files.
File starts with unicode BOM (0xFEFF).

Here's the file content.

FF FE 42 00 45 00 47 00 49 00 4E 00

And I use following code.

if ((fd = _open(buffer, _O_RDONLY)) != -1)
{
while (_read(fd,&mem, 1) == 1)
;
_close(fd);
}

When I check the contents of 'mem'.

The contents of 'mem' are

42 45 47 49 4E

Why this happen?

How can I read Unicode BOM (0xFEFF)?

Re: How to read uncode encoded files? by James

James
Thu Jul 26 01:20:29 CDT 2007


"Anddy" <Anddy@discussions.microsoft.com> wrote in message
news:F101A8F1-E899-401C-B2E1-4A6B18C681A3@microsoft.com...
> Hi.
> I try to read unicode encoded files.
> File starts with unicode BOM (0xFEFF).
>
> Here's the file content.
>
> FF FE 42 00 45 00 47 00 49 00 4E 00
>
> And I use following code.
>
> if ((fd = _open(buffer, _O_RDONLY)) != -1)
> {
> while (_read(fd,&mem, 1) == 1)
> ;
> _close(fd);
> }
>
> When I check the contents of 'mem'.
>
> The contents of 'mem' are
>
> 42 45 47 49 4E
>
> Why this happen?
>
> How can I read Unicode BOM (0xFEFF)?
>
>
>

'buffer' is a terrible name for a filename variable.

How are 'fd' and 'mem' declared?


--
James Brown
Microsoft MVP - Windows SDK
www.catch22.net
Free Win32 Tutorials and Sourcecode


Re: How to read uncode encoded files? by Ulrich

Ulrich
Thu Jul 26 02:29:29 CDT 2007

Anddy wrote:
> I try to read unicode encoded files.

Unicode (capital 'U') is not an encoding but a whole standard that defines
several encodings. Keep that in mind!

> File starts with unicode BOM (0xFEFF).
>
> Here's the file content.
>
> FF FE 42 00 45 00 47 00 49 00 4E 00

Okay, this looks like little-endian UTF-16 or UCS2, both defined by the
Unicode standard. If I had the choice, I would prefer UTF-8 though.

> And I use following code.
>
> if ((fd = _open(buffer, _O_RDONLY)) != -1)
> {
> while (_read(fd,&mem, 1) == 1)
> ;
> _close(fd);
> }
>
> When I check the contents of 'mem'.
>
> The contents of 'mem' are
>
> 42 45 47 49 4E
>
> Why this happen?
>
> How can I read Unicode BOM (0xFEFF)?

It might be the case that you are screwed by a locale-specific conversion
performed by the C implementations of VC8[1]. It sees that there is a BOM
and then transparently transcodes the file to the internally used charset.
In that case, I suggest that you use the 'binary' flag when opening the
file (which you should do anyway) and maybe invoke a 'setlocale("C");' or
something like that to set the locale to neutral.

Uli

[1] I hope I get this point right, I don't exactly remember what and where
these conversions took place.


Re: How to read uncode encoded files? by Nemanja

Nemanja
Thu Jul 26 14:33:59 CDT 2007

On Jul 26, 1:04 am, Anddy <An...@discussions.microsoft.com> wrote:
>
> if ((fd = _open(buffer, _O_RDONLY)) != -1)
> {
> while (_read(fd,&mem, 1) == 1)
> ;
> _close(fd);
> }
>

Why don't you read it in binary mode (_O_BINARY)? That way all
translations should be skipped.


Re: How to read uncode encoded files? by Anddy

Anddy
Fri Jul 27 00:02:03 CDT 2007



"Nemanja Trifunovic" wrote:

> On Jul 26, 1:04 am, Anddy <An...@discussions.microsoft.com> wrote:
> >
> > if ((fd = _open(buffer, _O_RDONLY)) != -1)
> > {
> > while (_read(fd,&mem, 1) == 1)
> > ;
> > _close(fd);
> > }
> >
>
> Why don't you read it in binary mode (_O_BINARY)? That way all
> translations should be skipped.
>
>
_O_RDONLY | _O_BINARY doesn't help this problem.

That will translate "0D 0A" into "0A". Tha'ts all.

I already tried it.

Re: How to read uncode encoded files? by Anddy

Anddy
Fri Jul 27 00:08:02 CDT 2007



"James Brown" wrote:

>
> "Anddy" <Anddy@discussions.microsoft.com> wrote in message
> news:F101A8F1-E899-401C-B2E1-4A6B18C681A3@microsoft.com...
> > Hi.
> > I try to read unicode encoded files.
> > File starts with unicode BOM (0xFEFF).
> >
> > Here's the file content.
> >
> > FF FE 42 00 45 00 47 00 49 00 4E 00
> >
> > And I use following code.
> >
> > if ((fd = _open(buffer, _O_RDONLY)) != -1)
> > {
> > while (_read(fd,&mem, 1) == 1)
> > ;
> > _close(fd);
> > }
> >
> > When I check the contents of 'mem'.
> >
> > The contents of 'mem' are
> >
> > 42 45 47 49 4E
> >
> > Why this happen?
> >
> > How can I read Unicode BOM (0xFEFF)?
> >
> >
> >
>
> 'buffer' is a terrible name for a filename variable.
>
> How are 'fd' and 'mem' declared?
>
>
> --
> James Brown
> Microsoft MVP - Windows SDK
> www.catch22.net
> Free Win32 Tutorials and Sourcecode
>
>

Forget about variable names. The code is for example.

And I think, you don't know anything about Unicode.


Re: How to read uncode encoded files? by Anddy

Anddy
Fri Jul 27 00:18:02 CDT 2007



"Ulrich Eckhardt" wrote:

> Anddy wrote:
> > I try to read unicode encoded files.
>
> Unicode (capital 'U') is not an encoding but a whole standard that defines
> several encodings. Keep that in mind!
>
> > File starts with unicode BOM (0xFEFF).
> >
> > Here's the file content.
> >
> > FF FE 42 00 45 00 47 00 49 00 4E 00
>
> Okay, this looks like little-endian UTF-16 or UCS2, both defined by the
> Unicode standard. If I had the choice, I would prefer UTF-8 though.
>
> > And I use following code.
> >
> > if ((fd = _open(buffer, _O_RDONLY)) != -1)
> > {
> > while (_read(fd,&mem, 1) == 1)
> > ;
> > _close(fd);
> > }
> >
> > When I check the contents of 'mem'.
> >
> > The contents of 'mem' are
> >
> > 42 45 47 49 4E
> >
> > Why this happen?
> >
> > How can I read Unicode BOM (0xFEFF)?
>
> It might be the case that you are screwed by a locale-specific conversion
> performed by the C implementations of VC8[1]. It sees that there is a BOM
> and then transparently transcodes the file to the internally used charset.
> In that case, I suggest that you use the 'binary' flag when opening the
> file (which you should do anyway) and maybe invoke a 'setlocale("C");' or
> something like that to set the locale to neutral.
>
> Uli
>
> [1] I hope I get this point right, I don't exactly remember what and where
> these conversions took place.
>
>

I think your answer will help my question.

I tried to trace "_read" function.

"_read" calls "ReadFile". And I can't trace into that function.

In that function, these conversions took place.

I think "_read" or "_open", check the unicode BOM (FF FE).

If the file encoded as unicode, '_read' translate "42 00 45 00" into "42 45".

I tried _O_BINARY | _O_RDONLY also.

That only tranlsate "0D 0A" in to "0A".

I tried to use 'fopen(..., "rb"), fread( )', it didn't help.

I tried to use '_wopen, _wread'. it didn't help.

So I will check 'setlocale( )'.



Re: How to read uncode encoded files? by Alex

Alex
Fri Jul 27 05:22:06 CDT 2007

"Anddy" wrote:
>> Why don't you read it in binary mode (_O_BINARY)? That
>> way all
>> translations should be skipped.
>>
>>
> _O_RDONLY | _O_BINARY doesn't help this problem.
>
> That will translate "0D 0A" into "0A". Tha'ts all.
>
> I already tried it.

Show us your code. `_O_BINARY' is exactly what you need. I
tried following code and it works perfectly:

const char* buffer = "C:\\Temp\\test.txt";
__int8 mem[100] = { 0 };
int fd = -1;

if ((fd = _open(buffer, _O_RDONLY | _O_BINARY)) != -1)
{
long len = _lseek(fd, 0L, SEEK_END);
_lseek(fd, 0L, SEEK_SET);

_read(fd, &mem, len);

// ...

_close(fd);
}


Alex