PDFsharp - moved to http://forum.pdfsharp.net/

peteratoce · Joined: 20 Feb 2007 Posts: 5

It would be most welcome if the library could compress images (not reduce resolution, as is sometimes appropriate).
Here are the results of some tests I did:
I started with a 100 page TIF file (A4, resolution 1200 dpi). BTW, such high resolution is absolutely necessary when a scanned document is to be printed on an offset press.

First, I opened the TIF in Acrobat (V 7) and saved as PDF. The file size barely grew, from 30.507.933 Bytes to 30.561.981 Bytes.

Then I used PDFsharp to do the equivalent (TIF aquired through System.Drawing.Image.FromFile, each page passed to PDFsharp through XImage.FromGdiPlusImage and then inserted in the output PDF with XGraphics.DrawImage). The conversion took about four times as long, and the resultant file size was 100.594.087 Bytes, i.e. more than three times as much.

Another consideration is the amount of memory needed during conversion. My understanding is that all newly created PDF pages have to be kept in memory by PDFsharp, until they are finally saved to file. My first test, done with a similar TIF file, but with 1012 pages in it, ran into an OutOfMemoryException. I expect that pages with compressed images on them would need far less memory during processing.

Thomas Hoevel · Joined: 16 Oct 2006 Posts: 387 Location: Cologne, Germany

I cannot explain why the file is so much bigger.

Have you tried a release build? The debug build by default produces "verbose" PDF files that are bigger.

Images in the PDF file use lossless LZ compression (except for JPEG images - those are copied byte by byte into the PDF file).

Not sure if the verbose mode can account for a factor 3 - I don't expect that.

I'd like to know which image format and compression was used for the TIFF file. If it was JPEG or CCITT/FAX than this could be the reason - PDFsharp uses the standard LZ compression, but other methods may be better for your scanned image.
Or maybe the image got converted to 24 bit RGB - this could explain factor 3.

PDFsharp does not read the files - it relies on GDI+ to read them; the 8-bit-to-24-bit-conversion could occur here.

Long story short: we do compress image data. I'd like to know what happens there.

BTW: all pages are kept in memory. With 1000 scanned pages this really could be a problem, but for most applications this approach is appropriate.
_________________
Regards
Thomas Hoevel
PDFsharp Team

peteratoce · Joined: 20 Feb 2007 Posts: 5

Differences in file size really seem to be caused by differing compression schemes:
A 100 page TIF (CCITT G4): 30.507.933 Bytes,
the same TIF (LZW): 100.349.200 Bytes.

PDFsharp created a file of size 100.521.556 Bytes from the G4, so the result is consistent.

I wish somebody (perhaps a knowledgeable user?) would turn his/her attention to image import and export in the library, including questions of different (= optimal) compression schemes for differing content types! From my experience I can say that GDI+ as an intermediate would have to go, though...

And, it would be nice to have more control over memory allocation, creation of temporary files or whatever is necessary to successfully process really large files.

Peter