Innovative Web Design and Application Development News

September 15th, 2010
September 15th, 2010

PDF Repair

The completely manual, last ditch, option to recover a PDF

I was recently given a PDF file that consisted of scanned pages requiring repair. Commercial repair tools only recovered the first page. I managed to find one option to go a little further. This option for advanced users is presented here in case it helps others.

First of all, I have dealt with these a few times and if you need the text content of a PDF this will not work - grab one of the many commercial PDF repair tools and try that. I have a few I like to use and in general they will recover a lot, but no automated approach is going to get you 100% of the way there...

Tools Needed

Photoshop (perhaps some other image editing program)
Textpad (or another good text/hex editor)

Knowing that when you scan an image into a PDF, most of the time it is stored as a JPEG within the PDF I figured I would start by changing the file extension from .pdf to .jpg and see if it opened in Photoshop... It did! but only the first page. So now the tricky bit...

I opened the pdf in Textpad and started scanning down the file searching for "endobj"  basically you will see constructs that look like:

13 0 obj
<< /Type /XObject /Subtype /Image /Width 1275 /Height 1649
/BitsPerComponent 8 /ColorSpace /DeviceRGB
/Filter /DCTDecode /Length 239648 >>

followed by a bunch of binary data. If you look closely at the first line of the binary you´ll see "JFIF" (a JPEG/JFIF compression header)
The endobj is the end of the object before and the 13 0 obj starts a new object.
so... the method goes like this:

Remove everything in the file down to and including the next endobj, save the file (make a copy obviously) open in Photoshop (hopefully getting the next page) and save as to a JPG then repeat for each subsequent image.

In the case of my file I only got 1 and a half more pages before the file abruptly ended, but that was better then nothing. I truly hope this helps others, granted it´s an edge case for repairing a PDF file and will only work to get out the images from a PDF, but as it was for me, something was better then nothing

Share this:
No Comments
You must login or register to post comments Login/Signup
RSS feed Feed Description
Subscribe to the complete News RSS news feedAll News RSS feed Complete RSS feed
Subscribe to the News RSS news feed for this category onlyTechnobloggle RSS feed for: Technobloggle
A Rich Site Summary (RSS) feed is an xml data file that provides a summary of the information contained here. It is not designed to be viewed in your browser, but instead by rss reader software. If you do not know what this means - you can safely ignore it, as it is provided for advanced users with rss reader software only.
Copyright © 1992-2024
web development: