PDFsharp - moved to http://forum.pdfsharp.net/ Forum Index PDFsharp - moved to http://forum.pdfsharp.net/
Please visit the new PDFsharp forum at http://forum.pdfsharp.net/
 
 FAQFAQ   SearchSearch   MemberlistMemberlist   UsergroupsUsergroups   RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

accessing text in a pdf document

 
This forum is locked: you cannot post, reply to, or edit topics.   This topic is locked: you cannot edit posts or make replies.    PDFsharp - moved to http://forum.pdfsharp.net/ Forum Index -> Support - moved to http://forum.pdfsharp.net/
View previous topic :: View next topic  
Author Message
gary



Joined: 25 Feb 2007
Posts: 1

PostPosted: Mon Feb 26, 2007 10:00 pm    Post subject: accessing text in a pdf document Reply with quote

Hi there all... I've tried creating a few test applications to answer this question, but cannot figure it out!

Can someone give me a simple example showing how to extract all the text in a pdf document into a single string? I would *greatly* appreciate any help you can provide!
Back to top
View user's profile Send private message
aknuth



Joined: 23 Mar 2007
Posts: 16
Location: Berlin

PostPosted: Wed May 16, 2007 5:46 pm    Post subject: Reply with quote

Hello,
this is a very dirty solution, but it shows one way to get what you want. You do have to mind about encoding properly, as the example assumes, that the pdf text is encoded in default system encoding.

it extracts text from the first page only.

Code:
string pdfTextRegexp = @"(T[wdcm*])[\s]*(\[([^\]]*)\]|\((?<text>[^\)]*)\))[\s]*Tj";

PdfDocument r = PdfReader.Open(file);
PdfContents contents = r.Pages[0].Contents;
foreach (PdfReference o in contents.Elements) {
   PdfContent c = o.Value as PdfContent;
   if (c != null) {
      string content = Encoding.Default.GetString(c.Stream.Value);
      using (StringReader sr = new StringReader(content)) {
         string line;
         while ((line = sr.ReadLine()) != null) {
            Match m = Regex.Match(line, pdfTextRegexp, RegexOptions.Compiled);
            if (m.Success) {
               Debug.WriteLine(m.Groups["text"].Value);
            }
         }
      }
   }
}


Anyone who has a better solution, hopefully using the PDFsharp api, please contribute.

Regards,
André
Back to top
View user's profile Send private message Visit poster's website
Display posts from previous:   
This forum is locked: you cannot post, reply to, or edit topics.   This topic is locked: you cannot edit posts or make replies.    PDFsharp - moved to http://forum.pdfsharp.net/ Forum Index -> Support - moved to http://forum.pdfsharp.net/ All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group. Hosted by phpBB.BizHat.com


Start Your Own YouTube Clone

Free Web Hosting | Free Forum Hosting | FlashWebHost.com | Image Hosting | Photo Gallery | FreeMarriage.com

Powered by PhpBBweb.com, setup your forum now!
For Support, visit Forums.BizHat.com