Terminally Incoherent

Utterly random, incoherent and disjointed rants and ramblings...

Saturday, April 22, 2006

Text Dumping PDF files

The other day I got a request to convert a PDF file into a text file or something that could be imported to Excel. The was essentially some big accounting mumbo-jumbo full of numbers arranged in columns with fancy headings. There were over 200 pages of it.

Now the easiest thing to do was to use the Windows version of Adobe Acrobat and simply save the file as .txt. But of course, that knocked out all the white space. All the colums run into eachother and the file looked like crap. There is no way you could do anything useful with it.

Of course my linux PDF reader (acroread) did not have the "Save as Text" option, so the first place I turned to was the nifty linux app pdftotext.

pdftotext bigstupidfile.pdf

This gives you a quick text dump which is roughly equivalent to the buit in Acrobat save behavior. But fortunately pdftotext has all kindso of nifty features. If you want to preserve the whitespace and layout details you should do:

pdftotext -layout -eol dos bigstupidfile.pdf

The -eol dos bit is there to specify the end of line style. Remember, I'm on a unix box converting this file for a windows dude who will want to import this stuff to excel.

Needles to say, the trick worked perfectly. The columns were preserved and the file looked great. So whenever you need to convert some pdf data into text I highly recommend using -layout option.

0 Comments:

Post a Comment

<< Home