View previous topic :: View next topic |
Author |
Message |
gasi
Joined: 07 Oct 2008 Posts: 1
|
Posted: Tue Oct 07, 2008 2:36 pm Post subject: Read Text of a PDF-File |
|
|
Hi,
I'm using PDFSharp for a short time. I'm trying to read the whole text of a PDF-file, for example headlines and textbodies. But I didn't find a way to do this.
Actually I tried to use PDFDictionary to navigate in some objects (e. g. "/MediaBox","/xObject") but there was no success.
Can somebody give me an advice? For example what class(es) (and methods) has to be used.
Thanks. |
|
Back to top |
|
|
PeterGillespie
Joined: 14 Oct 2008 Posts: 8 Location: England
|
Posted: Fri Nov 07, 2008 10:36 am Post subject: |
|
|
You should probably look at Migradoc to accomplish this. I would imagine the steps you are:
Load your PDF into a Migradoc Document object.
You can then iterate through each section within it. (I have not tried importing a prec-reated PDF file into Migradoc so not sure how this works)
Assuming you get this far you can then iterate through each Element within the section which would look something like:
Code: |
List<string> allText= new List<string>();
foreach (DocumentObject element in Section.Elements)
{
if (element is MigraDoc.DocumentObjectModel.Text)
{
MigraDoc.DocumentObjectModel.Text textObj =
(MigraDoc.DocumentObjectModel.Text)element;
allText.Add(textObj.Content);
}
}
|
|
|
Back to top |
|
|
marihanzo
Joined: 17 Mar 2009 Posts: 2
|
Posted: Tue Mar 17, 2009 4:26 pm Post subject: |
|
|
Unfortunately I wasn't able to apply this solution to my context.
So I've implemented another solution that uses a low level parsing of pdf content.
My solution has been posted here:
http://pdfsharp.s3.bizhat.com/viewtopic.php?p=1603#1603
I hope this will help you.
Enjoy it! |
|
Back to top |
|
|
|