Hi all,
I have the following problem: some thousands html files were encoded
as binary so they are quite not correctly accesible..
I wrote (better cut&paste...) a simple vbs that opens each HTM file
in a folder as a binary stream and rewrite it as text.
This works fine on the "problem" files.
Now the new problem is: how can I identify if a HTML file is binary or
not? Since in my folders file are mixed, some are good html and some
not (binary) I can't obviously work with extension.. is there
something that may work this way:
for each file in my directory check if it's binary, if so go on with
the cool stuff else movenext.
I'm gonna read more, but if someone could help...or even suggest
another approach i'd really appreciate it.

Thanks!

Re: How to batch convert binary files to "Text" by mayayana

mayayana
Fri Apr 06 09:30:59 CDT 2007

Do you maybe mean unicode rather than binary?
All files are "binary" in the sense that they're
composed of a series of bytes. A text file is just
one where all the bytes correspond to characters.
Ascii text uses 1 byte per character (at least in
the US and Europe) while unicode text uses 2
bytes. For example, if you look at a text file in a hex
editor that starts with the word "file", an an ascii
version will start with the bytes 102-105-108-101
or (hex) 66-69-6C-65. In English those correspond
to f-i-l-e. The unicode version would be:
66-00-69-00-6C-00-65-00

Unicode is representing each character as a 16-bit
numeric value rather than an 8-bit value.

If you need to convert unicode to ascii you can do it
by opening and resaving the file using Textstream.
(Note the extra parameters in the Textstream
methods that allow you to choose between unicode
and ascii.)

If you're not talking about unicode then maybe it's
some sort of encryption? In that case you'd need to
figure out what sort of encryption.

Another possibility: If you download from a Unix server
you might find that you have an HTML file with some
squares in it but no carriage returns. In that case it's
because of the different carriage return format. You can fix
that with the following. (Save this to Notepad, save as
a VBS file, and drop your distorted file onto it):
---------------------------
Dim fso, ts, s, arg, fil, fpath, s1
Set fso = CreateObject("Scripting.FileSystemObject")
If WScript.arguments.count = 0 Then
arg = InputBox("This script will correct web server text that
lacks carriage returns. Enter path of file.", "Fix File",
"C:\Windows\Desktop\")
Else
arg = WScript.arguments.item(0)
End If
If fso.FileExists(arg) = False Then
MsgBox "Wrong path", 64, "No such file"
WScript.Quit
End If
'-- ------got the file. read it into s.---------------------------

Set ts = fso.OpenTextFile(arg, 1, False)
s = ts.ReadAll
ts.Close
Set ts = Nothing

'-------- replace linefeed characters with vbcrlf ------------------------

s1 = Replace(s, vbCrLf, vbCr, 1, -1, 0)
s1 = Replace(s1, vbLf, vbCr, 1, -1, 0)
s1 = Replace(s1, vbCr, vbCrLf, 1, -1, 0

'-- -----write file. -----------------
If fso.fileexists(arg) = True Then
fso.deletefile arg, True
End If
Set ts = fso.CreateTextFile(arg, True)
ts.Write s1
ts.Close
Set ts = Nothing
Set fso = Nothing
MsgBox "All done", 64, "File fixed"
--------------------

If it's carriage return problems that you're talking
about, you could check for a distorted file by
opening the file, reading it as a string, and checking
to see whether it contains CrLf combinations.


> I have the following problem: some thousands html files were encoded
> as binary so they are quite not correctly accesible..
> I wrote (better cut&paste...) a simple vbs that opens each HTM file
> in a folder as a binary stream and rewrite it as text.
> This works fine on the "problem" files.
> Now the new problem is: how can I identify if a HTML file is binary or
> not? Since in my folders file are mixed, some are good html and some
> not (binary) I can't obviously work with extension.. is there
> something that may work this way:
> for each file in my directory check if it's binary, if so go on with
> the cool stuff else movenext.
> I'm gonna read more, but if someone could help...or even suggest
> another approach i'd really appreciate it.
>
> Thanks!
>



Re: How to batch convert binary files to "Text" by Paul

Paul
Fri Apr 06 10:39:24 CDT 2007


"mayayana" <mayaXXyana1a@mindXXspring.com> wrote in message
news:DQsRh.20087$tD2.3100@newsread1.news.pas.earthlink.net...
> Do you maybe mean unicode rather than binary?
> All files are "binary" in the sense that they're
> composed of a series of bytes. A text file is just
> one where all the bytes correspond to characters.
> Ascii text uses 1 byte per character (at least in
> the US and Europe) while unicode text uses 2
> bytes. For example, if you look at a text file in a hex
> editor that starts with the word "file", an an ascii
> version will start with the bytes 102-105-108-101
> or (hex) 66-69-6C-65. In English those correspond
> to f-i-l-e. The unicode version would be:
> 66-00-69-00-6C-00-65-00

Actually, the Unicode file version would start with a BOM (Byte Order Mark),
hex FFFE, so the file would contain:
FF-FE-66-00-69-00-6C-00-65-00

Html files are often encoded UTF-8 which looks strange in a text editor
because each UTF-8 character is represented by one or more bytes. Html
files tell you what encoding is used with a tag like this typically within
the <head> ... </head> section:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">. The
Charset parameter tells the browser how to interpret the HTML stream.
Charset and encoding mean the same thing.

-Paul Randall

>> I have the following problem: some thousands html files were encoded
>> as binary so they are quite not correctly accesible..
>> I wrote (better cut&paste...) a simple vbs that opens each HTM file
>> in a folder as a binary stream and rewrite it as text.
>> This works fine on the "problem" files.
>> Now the new problem is: how can I identify if a HTML file is binary or
>> not? Since in my folders file are mixed, some are good html and some
>> not (binary) I can't obviously work with extension.. is there
>> something that may work this way:
>> for each file in my directory check if it's binary, if so go on with
>> the cool stuff else movenext.
>> I'm gonna read more, but if someone could help...or even suggest
>> another approach i'd really appreciate it.
>>
>> Thanks!
>>
>
>



Re: How to batch convert binary files to "Text" by mayayana

mayayana
Fri Apr 06 14:13:35 CDT 2007

> > bytes. For example, if you look at a text file in a hex
> > editor that starts with the word "file", an an ascii
> > version will start with the bytes 102-105-108-101
> > or (hex) 66-69-6C-65. In English those correspond
> > to f-i-l-e. The unicode version would be:
> > 66-00-69-00-6C-00-65-00
>
> Actually, the Unicode file version would start with a BOM (Byte Order
Mark),
> hex FFFE, so the file would contain:
> FF-FE-66-00-69-00-6C-00-65-00
>
Thanks. I didn't know that. I assume that's
indicating Little Endian and that a similar
file on a older Mac would start with FE-FF?
Or is that a Windows-only thing?

> Html files are often encoded UTF-8 which looks strange in a text editor
> because each UTF-8 character is represented by one or more bytes.

So there's still another possibility of what the
OP means by "binary", assuming that he/she
is not defining HTML tags as "binary data". :)




Re: How to batch convert binary files to "Text" by Paul

Paul
Fri Apr 06 20:55:57 CDT 2007


"mayayana" <mayaXXyana1a@mindXXspring.com> wrote in message
news:zZwRh.211$3P3.57@newsread3.news.pas.earthlink.net...
>> > bytes. For example, if you look at a text file in a hex
>> > editor that starts with the word "file", an an ascii
>> > version will start with the bytes 102-105-108-101
>> > or (hex) 66-69-6C-65. In English those correspond
>> > to f-i-l-e. The unicode version would be:
>> > 66-00-69-00-6C-00-65-00
>>
>> Actually, the Unicode file version would start with a BOM (Byte Order
> Mark),
>> hex FFFE, so the file would contain:
>> FF-FE-66-00-69-00-6C-00-65-00
>>
> Thanks. I didn't know that. I assume that's
> indicating Little Endian and that a similar
> file on a older Mac would start with FE-FF?
> Or is that a Windows-only thing?

As I recall, older Macs and PCs have opposite byte orders for 16-bit values,
but I can never remember which one is Big Endian and which is Little Endian.
I don't know if it is just a Microsoft standard to start the file with a
BOM, but there has to be something to tell the OS whether the file is Ansi
or Unicode. You can prove that Windows puts in the BOM on WXP with Notepad.
Create a small text file with Notepad and save it both as Ansi and as
Unicode, then view the files with a hex editor to see the difference.

-Paul Randall

>> Html files are often encoded UTF-8 which looks strange in a text editor
>> because each UTF-8 character is represented by one or more bytes.
>
> So there's still another possibility of what the
> OP means by "binary", assuming that he/she
> is not defining HTML tags as "binary data". :)