Re: Extracting text from PDF documents for virtual printer driver by Chris
Chris
Fri Mar 28 08:29:57 CDT 2008
On Mar 27, 11:21 pm, Tim Roberts <t...@probo.com> wrote:
> Chris <christopher.bu...@gmail.com> wrote:
>
> >I'm developing a virtual printer driver, and it handles most documents
> >well. Where it fails is with PDF documents containing east Asian
> >characters. I've found some other posts which mention that Adobe
> >provides glyph indices, rather than unicode characters. How do I get
> >the actual text in this situation?
>
> It depends on the font. Non-TrueType fonts, and some TrueType fonts, do
> not use Unicode encoding. You need to chase down the font in use to learn
> what encoding it uses.
> --
> Tim Roberts, t...@probo.com
> Providenza & Boekelheide, Inc.
Thanks, Tim. Actually, I just realized that Adobe provides glyph
indices for all fonts, so my current solution is really just a hack.
Are you saying there's no font-agnostic way to extract text from a PDF
file?