|
|
| Author |
Message |
alchemist
 Total Messages 5
|
In general I've been quite successful in extracting images by navigating the iText (http://itextpdf.com/) data-structure, identifying the encoding and applying the appropriate decoder to the PdfStream.
I have a slightly more complex situation where it appears the images are in a PDF form encoded with FLATEDECODE
So I decode the FLATEDECODE data and end up with the following:
/GS5 gs
q
0 0 1 1 re
W n
/GS4 gs
0 J 0 j 4 M []0 d 0.0284 w
0 0 0 0.7 K
0.9539 0.5332 m
0.9891 0.5332 l
0.9891 0.5896 l
0.9697 0.6397 l
0.4126 0.6397 l
0.4126 0.337 l
0.9697 0.337 l
0.9912 0.3718 l
0.9912 0.4193 l
0.6447 0.4193 l
0.6792 0.5332 l
0.7494 0.5332 l
0.7494 0.4957 l
0.7814 0.4957 l
0.7814 0.545 l
0.8219 0.545 l
0.8219 0.4994 l
0.8557 0.4994 l
0.8557 0.5488 l
0.8982 0.5488 l
0.8982 0.4945 l
0.9315 0.4945 l
0.931 0.5327 l
s
0.7 0 0 0.45 k
0.0118 0.1781 m
0.0757 0.0868 0.1632 0.0305 0.2595 0.0305 c
0.356 0.0305 0.4433 0.0868 0.5073 0.1779 c
0.5073 0.8219 l
0.4433 0.9132 0.3559 0.9695 0.2595 0.9695 c
0.1631 0.9695 0.0757 0.9132 0.0118 0.822 c
0.0118 0.1842 l
f
0.0379 w
0 0 0 1 K
0.0118 0.1781 m
0.0757 0.0868 0.1632 0.0305 0.2595 0.0305 c
0.356 0.0305 0.4433 0.0868 0.5073 0.1779 c
0.5073 0.8219 l
0.4433 0.9132 0.3559 0.9695 0.2595 0.9695 c
0.1631 0.9695 0.0757 0.9132 0.0118 0.822 c
0.0118 0.1842 l
s
0 0 0 0.7 K
0.9776 0.5242 m
0.9776 0.4343 l
S
0 0 0 1 k
0.3677 0.5177 m
0.3677 0.3633 0.3193 0.2382 0.2597 0.2382 c
0.2001 0.2382 0.1517 0.3633 0.1517 0.5177 c
0.1517 0.672 0.2001 0.7972 0.2597 0.7972 c
0.3193 0.7972 0.3677 0.672 0.3677 0.5177 c
f
0.0265 w
0 0 0 0 K
0.3677 0.5177 m
0.3677 0.3633 0.3193 0.2382 0.2597 0.2382 c
0.2001 0.2382 0.1517 0.3633 0.1517 0.5177 c
0.1517 0.672 0.2001 0.7972 0.2597 0.7972 c
0.3193 0.7972 0.3677 0.672 0.3677 0.5177 c
s
0 0 0 0 k
0.2107 0.4173 m
0.2107 0.5479 l
0.3071 0.5479 l
0.3071 0.4173 l
0.3071 0.4152 l
0.3071 0.355 0.2855 0.3061 0.2589 0.3061 c
0.2323 0.3061 0.2107 0.355 0.2107 0.4152 c
f
0.231 0.5074 m
0.231 0.6125 l
0.231 0.6569 0.2449 0.693 0.2621 0.693 c
0.2793 0.693 0.2932 0.6569 0.2932 0.6125 c
0.3054 0.6125 l
0.3054 0.6743 0.286 0.7244 0.2621 0.7244 c
0.2382 0.7244 0.2188 0.6743 0.2188 0.6125 c
0.2188 0.5074 l
f
0 0 0 1 k
0.2733 0.4696 m
0.2733 0.4487 0.2667 0.4319 0.2587 0.4319 c
0.2506 0.4319 0.2441 0.4487 0.2441 0.4696 c
0.2441 0.4905 0.2506 0.5074 0.2587 0.5074 c
0.2667 0.5074 0.2733 0.4905 0.2733 0.4696 c
f
0.2695 0.3512 m
0.2478 0.3512 l
0.2516 0.4706 l
0.2664 0.4706 l
f
Q
1. Exactly what format is the above, is it raw ghost-script because I can't seem to read it with anything (gsview,acrobat etc)?
2. If it is an EPS (maybe it's something else?) what do I need to do to turn this into and viewable image?
| Posted: 15 Nov 2010 11:04 PM |
|
|
| |
aandi
 Total Messages 17064
|
Please see the PDF Reference. You cannot do this stuff by guessing!!
| Posted: 15 Nov 2010 11:05 PM |
|
|
| |
alchemist
 Total Messages 5
|
Can you please be a little bit more verbose I do not understand?
| Posted: 15 Nov 2010 11:06 PM |
|
|
| |
aandi
 Total Messages 17064
|
This information is all clearly explained in the PDF Reference. Your question suggests you have not read this document. But it is absolutely necessary for your task to read the first chapters (as far as the chapter "Graphics").
| Posted: 15 Nov 2010 11:12 PM |
|
|
| |
aandi
 Total Messages 17064
|
If I am wrong, and you have read the document, perhaps we can find you find out where you have gone wrong, because this is not an image. What makes you believe it is an image?
| Posted: 15 Nov 2010 11:17 PM |
|
|
| |
alchemist
 Total Messages 5
|
Okay thanks...you are absolutely correct I haven't read this document. Where can I download or get a copy of "PDF Reference"? I assume it's produced by Adobe?
| Posted: 15 Nov 2010 11:20 PM |
|
|
| |
alchemist
 Total Messages 5
|
...is this the one?
http://www.adobe.com/devnet/pdf/pdf_reference.html
| Posted: 15 Nov 2010 11:22 PM |
|
|
| |
aandi
 Total Messages 17064
|
Yes, that's it. It's now an ISO standard, but you can get it free from Adobe.
(This is very helpful of Adobe. Usually these open standards are expensive: 380 swiss francs for ISO 32000-1:2008).
| Posted: 15 Nov 2010 11:45 PM |
|
|
| |
alchemist
 Total Messages 5
|
It's also likely that I'm going completely down the wrong path to so I'll explain my problem in more detail.
I am try ing to extract FPO low res images that appear in a PDF document (those appearing with OPI comments)
For this particular document (I chose this one a random from the internet) I have problems:
http://media.wiley.com/product_data/excerpt/18/EHEP0000/EHEP000018-1.pdf
Now I know the image exists because I can see it on the page so the image is there somewhere (the low res one). Note the F PdfName refers to the high res image (which I don't care about for now)
This is the object which is identified (for page 3) as the OPI reference for the image. Normally it would have a reference to an XObject (which is the low res image) but I can't find it so I assumed it was in a form connected to this obj?
^M25 0 obj<>>>/Subtype/Form/Length 70/Filter/FlateDecode/Name/Fm20/Matrix[1 0 0 1 0 0]/Resources<>/ProcSet[/PDF]/ExtGState<>>>/Type/XObject/BBox[0 0 1332.0 783.0]/FormType 1>>stream
Posted: 15 Nov 2010 11:54 PM Originally Posted: 15 Nov 2010 11:50 PM |
|
|
| |
aandi
 Total Messages 17064
|
It's easy to miss that an OPI entry can appear in either an image dictionary (what you expect) or a form XObject (what you find).
A form XObject is a general collection of page marking operators, the same ones found on the page itself, with its own resources. These resources can contain images, which will be rendered as well as the vector art. And it can contain nested form XObjects too, this can be deeply nested. These can in turn contain their own OPI entries - and you may well find this case.
In general, then, an OPI dictionary is associated with some drawing on the page, which might be an image, or might use the full range of vector+image+text available in PDF. No image data is stored for this case (it isn't needed).
| Posted: 16 Nov 2010 01:34 AM |
|
|
| |
|
|