Re: How to read uncode encoded files? by Anddy
Anddy
Fri Jul 27 00:18:02 CDT 2007
"Ulrich Eckhardt" wrote:
> Anddy wrote:
> > I try to read unicode encoded files.
>
> Unicode (capital 'U') is not an encoding but a whole standard that defines
> several encodings. Keep that in mind!
>
> > File starts with unicode BOM (0xFEFF).
> >
> > Here's the file content.
> >
> > FF FE 42 00 45 00 47 00 49 00 4E 00
>
> Okay, this looks like little-endian UTF-16 or UCS2, both defined by the
> Unicode standard. If I had the choice, I would prefer UTF-8 though.
>
> > And I use following code.
> >
> > if ((fd = _open(buffer, _O_RDONLY)) != -1)
> > {
> > while (_read(fd,&mem, 1) == 1)
> > ;
> > _close(fd);
> > }
> >
> > When I check the contents of 'mem'.
> >
> > The contents of 'mem' are
> >
> > 42 45 47 49 4E
> >
> > Why this happen?
> >
> > How can I read Unicode BOM (0xFEFF)?
>
> It might be the case that you are screwed by a locale-specific conversion
> performed by the C implementations of VC8[1]. It sees that there is a BOM
> and then transparently transcodes the file to the internally used charset.
> In that case, I suggest that you use the 'binary' flag when opening the
> file (which you should do anyway) and maybe invoke a 'setlocale("C");' or
> something like that to set the locale to neutral.
>
> Uli
>
> [1] I hope I get this point right, I don't exactly remember what and where
> these conversions took place.
>
>
I think your answer will help my question.
I tried to trace "_read" function.
"_read" calls "ReadFile". And I can't trace into that function.
In that function, these conversions took place.
I think "_read" or "_open", check the unicode BOM (FF FE).
If the file encoded as unicode, '_read' translate "42 00 45 00" into "42 45".
I tried _O_BINARY | _O_RDONLY also.
That only tranlsate "0D 0A" in to "0A".
I tried to use 'fopen(..., "rb"), fread( )', it didn't help.
I tried to use '_wopen, _wread'. it didn't help.
So I will check 'setlocale( )'.