[plug] [OT] PDF Conversions

Fri Apr 8 03:06:25 WST 2005

Timothy White wrote:

> Ok, my mistake. pdftops does give the correct ps. Unfortunately it still
> doesn't help. It appears that the text is still in some obscure format.

It's more likely that it's simply filled with lots of positional markup. 
PostScript, and PDF even more so, are strongly presentation oriented 
formats (though anybody who's seen the "game of life" PostScript program 
knows which is more flexible). If there's a need to split a word into a 
series of separate characters to get the positioning right, say because 
of custom kerning etc, it'll be done. You can almost never rely on 
finding a simple string in PostScript or (suitably decoded) PDF. Even if 
you go from PS -> PDF -> PS, chances are the final PS will be more 
convoluted and there isn't much you can do about this.

Here's my non-expert and oversimplifed attempt at an explanation. 
Remembering that PDF is (mostly) a strictly linear format, unlike 
PostScript, if you have PostScript input like:

   This is one column of           This is a second column of text.
   text. In PostScript it          it can also often be described
   can often be described          PostScript as a largely separate
   as a continuous body            data stream.
   with formatting info.

then the PostScript is likely to contain all the first column's text 
then all the second column's text, much like you'd read them. This is 
dependent on the application generating the PostScript, of course, but 
is reasonably likely. If you then make a PDF of that PostScript, you'll 
usually get the two columns interleaved line-by-line to preserve 
linearity, eg:

   This is one column of
   This is a second column of text.
   text. In PostScript it
   it can also often be described
   can often be described
   PostScript as a largely separate
   as a continuous body
   data stream.
   with formatting info.

making a right mess of things. It'd be hard enough to match "column of 
text" reliably in the first version, when you consider the insertion of 
various markup etc. In the second one, it'd require some *really* funky 
logic to reliably get right.

Now, if you convert that PDF back to PostScript, the converter can't 
tell how the columns were laid out (barring the use of PDF articles, 
which the PDF->PS conversion is unlikely to use anyway, and only apply 
to this particular problem not the general class of problems with 
PS->PDF->PS conversion). So you'll probably get mangled-looking overly 
convoluted PostScript with things in a different order to the original.

This is only one of the *many* issues involved.

The only tools I know of that can really significantly edit PDF are 
Adobe Acrobat Professional ($$), which is still quite limited, and 
PitStop Professional, which can perform major PDF surgery and make major 
changes relatively easily, but is even more expensive than Acrobat and 
works as a plugin for Acrobat (so you need both). I'm not aware of any 
decent PDF editing tools that can be downloaded for free, let alone any 
open source ones (though I'd be glad to be informed if there are any I 
don't know of).

I'm a little confused as to why you want to do this - is there no 
possibility of simply modifying the data before export/conversion to PDF?

My personal advice - and remember that I'm no expert in this myself - 
would be to find a simpler, nicer way if at all possible. Failing that, 
you may want to spend some time looking over the (extremely extensive) 
PDF and PostScript documentation. Who knows, perhaps you can come up 
with a new and useful OSS tool for manipulating PDF...

--
Craig Ringer