PDFsharp - moved to http://forum.pdfsharp.net/ Forum Index PDFsharp - moved to http://forum.pdfsharp.net/
Please visit the new PDFsharp forum at http://forum.pdfsharp.net/
 
 FAQFAQ   SearchSearch   MemberlistMemberlist   UsergroupsUsergroups   RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

Split PDF based on text within the PDF

 
This forum is locked: you cannot post, reply to, or edit topics.   This topic is locked: you cannot edit posts or make replies.    PDFsharp - moved to http://forum.pdfsharp.net/ Forum Index -> Support - moved to http://forum.pdfsharp.net/
View previous topic :: View next topic  
Author Message
jigsaw



Joined: 02 Nov 2006
Posts: 1
Location: Australia

PostPosted: Thu Nov 02, 2006 9:06 pm    Post subject: Split PDF based on text within the PDF Reply with quote

Hi,
I was wondering if the PDFsharp object could be used to find text within a PDF file and retrieve the page number it was on. What I need to do is split a PDF file based upon finding some text.

ie. The top of every page has the text
[Customer:xxxxxxxx]
where xxxxxxxx is the customer name. When the xxxxxxxx changes I need to split the PDF. So a single PDF with 10 pages, 4 which are for Customer X, 3 for Customer Y and 3 for Customer Z would need to produce 3 files,
one for customer X of three pages
one for customer Y of four pages
one for customer Z of three pages

It would also be great if the text search could use a regular expression.

Is this possible with PDFsharp?
Back to top
View user's profile Send private message Visit poster's website
Stefan Lange



Joined: 12 Oct 2006
Posts: 47
Location: Cologne, Germany

PostPosted: Thu Nov 02, 2006 10:04 pm    Post subject: Reply with quote

Hello,

the content of a PDF page is a sequence of bytes that represents graphical commands. These bytes are called the "content stream" of the page. You can get it uncompressed with this code:
Code:
page.Contents.CreateSingleContent().Stream.UnfilteredValue;


A "Hello, World" page may look like this:
Code:
1 0 0 1 0 841.8898 cm
1 0 0 -1 0 0 cm
BT
-100 Tz
/F0 -10 Tf
1 0 0 1 70.8661 80.9199 Tm
-10 TL
(Hello)Tj
/F0 -10 Tf
1 0 0 1 99.4111 80.9199 Tm
(World!)Tj
ET


You can find
Code:
(Hello)
and
Code:
(World!)
as strings.

You should find
Code:
([Customer:xxxxxxxx])
in your PDF file. This is easy to parse. Try PdfSharp Explorer to analyse your PDF.

But depending on the PDF producer application you find this:
Code:
[(H)42(e)32(l)37(l)33(0)]TJ

There is kerning information (distance adjustment) between the characters. Adobe Acrobat never creates this if you use a fixed size font like Courier. Unfortunately tools like FreePDF always creates distance information, even if it is superfluous.

We at empira currently have the same problem to identify address information in PDF files and split it into single files. We recommend using Adobe Acrobat as producer and Courier New as font for the information text.

Further I wrote the class PdfSharp.Pdf.Content.ContentReader to convert a content stream into a squence of operation (it is in the current source code). Maybe this reader helps you to find your text.

I will publish our solution if we have one (currently we are working on other things).

Regards
Stefan Lange
Back to top
View user's profile Send private message Send e-mail
Display posts from previous:   
This forum is locked: you cannot post, reply to, or edit topics.   This topic is locked: you cannot edit posts or make replies.    PDFsharp - moved to http://forum.pdfsharp.net/ Forum Index -> Support - moved to http://forum.pdfsharp.net/ All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group. Hosted by phpBB.BizHat.com


Start Your Own YouTube Clone

Free Web Hosting | Free Forum Hosting | FlashWebHost.com | Image Hosting | Photo Gallery | FreeMarriage.com

Powered by PhpBBweb.com, setup your forum now!
For Support, visit Forums.BizHat.com