[plug] [OT] PDF Conversions

Mark O'Shea mark at musicalstoat.co.uk
Thu Apr 7 18:48:18 WST 2005


On Thu, 2005-04-07 at 10:28 +0800, Timothy White wrote:
> I'm doing some experiments with PS and PDF. Knowing that PS is text I
> can edit different aspects of the documents by hand or through scripts
> without having to change formats. But PDF is binary (encapsulated?) so I
> can't do the same. If I convert the PS to PDF then the text remains text
> and I can select it and extract it with pdftotext (which is xpdf doing
> the work.) But if I try and convert the document back to PS the text is
> no longer text and the file becomes like an image. The same happens for
> documents that start as PDF with extractable text.
> What is worse is I can extract the text from the PDF but the text isn't
> retained if I convert from PDF to PS.
> So:
> $pdftotext document.pdf
> Gives the contents in ascii
> $pdftops document.pdf
> Gives a plain 'image'
> $pdf2ps document.pdf
> Also gives a plain image (note that pdf2ps is Ghost Script while pdftops
> is Xpdf)
> 
Hi Tim,

Well, a pdf isn't binary, but some parts of it can be.  When you look
through one you will probably see something like:
17 0 obj
<</Length 18 0 R/Filter /FlateDecode>>
stream
*****binary data here
endstream
endobj

This means that the data in that object is a stream which is to be
decoded into content using the filter called FlateDecode.  Which just
happens to be the inflate() from zlib (there are a few other filters but
I think this one is popular because of the lack of patents).  So you can
actually write a parser that could extract the text from a pdf.

But you could also use the ready made utilities that you mention to
convert it to text or postscript.  Now my experience is that if you use
the utility from xpdf (pdftops) it will create a postscript file with
the text in it.  I haven't found that it produces images, but of course
that doesn't mean that it doesn't, just that it never has for me.  Have
you tried checking the man page to see if you need any extra options for
your purposes?  The utility which relies on ghostscript (pdf2ps) I have
found does produce images, and so I probably wouldn't use it for this.  

Does that give you any useful information?  Check out the adobe website
for more information on the specification for pdf (1.whatever)

Regards,
-- 
Mark O'Shea




More information about the plug mailing list