PDFsharp - moved to http://forum.pdfsharp.net/ Forum Index PDFsharp - moved to http://forum.pdfsharp.net/
Please visit the new PDFsharp forum at http://forum.pdfsharp.net/
 
 FAQFAQ   SearchSearch   MemberlistMemberlist   UsergroupsUsergroups   RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

Important Notice: We regret to inform you that our free phpBB forum hosting service will be discontinued by the end of June 30, 2024. If you wish to migrate to our paid hosting service, please contact billing@hostonnet.com.
Reading PDF contents?

 
This forum is locked: you cannot post, reply to, or edit topics.   This topic is locked: you cannot edit posts or make replies.    PDFsharp - moved to http://forum.pdfsharp.net/ Forum Index -> Support - moved to http://forum.pdfsharp.net/
View previous topic :: View next topic  
Author Message
Megidolaon



Joined: 19 Aug 2008
Posts: 1

PostPosted: Tue Aug 19, 2008 10:54 am    Post subject: Reading PDF contents? Reply with quote

Hello, I've just started using PDFSharp and I was wondering how you can read the content of a PDF.

I tried looping through the Pages.Elements Property of the PdfDocument class but I get an error that I cannot convert from DictionaryEntry to Typ DictionaryElements.

Alternatively I tried using the PdfContent class from the CreateSingleContent method of a PdfPage but all I get are a handful cryptic values (something like "7 0 R", "120 B" or such) as whole content of a Pdf containing text and a table with at least 50 values.

Also, is there a difference between reading normal text and the contents of a table?

Thanks in advance.
Back to top
View user's profile Send private message
gkataria



Joined: 20 Aug 2008
Posts: 3

PostPosted: Wed Aug 20, 2008 11:26 am    Post subject: Reply with quote

i was able to get the images of a page from below code, but still unable to find the text.

write below code in any click event

PdfDocument document = PdfReader.Open("C:\\HelloWorld.pdf", PdfDocumentOpenMode.ReadOnly);

int imageCount = 0;
// Iterate pages
foreach (PdfPage page in document.Pages)
{
// Get resources dictionary
PdfDictionary resources = page.Elements.GetDictionary("/Resources");
if (resources != null)
{
// Get external objects dictionary
PdfDictionary xObjects = resources.Elements.GetDictionary("/XObject");
if (xObjects != null)
{
PdfItem[] items = xObjects.Elements.Values;
// Iterate references to external objects
foreach (PdfItem item in items)
{
PdfReference reference = item as PdfReference;
if (reference != null)
{
PdfDictionary xObject = reference.Value as PdfDictionary;
// Is external object an image?
if (xObject != null && xObject.Elements.GetString("/Subtype") == "/Image")
{
imageCount++;
ExportImage(xObject, imageCount);

}
}
}
}
}
}


the following functions are used:

/// <summary>
/// Currently extracts only JPEG images.
/// </summary>
static void ExportImage(PdfDictionary image, int count)
{
string filter = image.Elements.GetName("/Filter");
switch (filter)
{
case "/DCTDecode":
ExportJpegImage(image, count);
break;

case "/FlateDecode":
ExportAsPngImage(image, count);
break;
}
}

/// <summary>
/// Exports a JPEG image.
/// </summary>
static void ExportJpegImage(PdfDictionary image, int count)
{
// Fortunately JPEG has native support in PDF and exporting an image is just writing the stream to a file.
byte[] stream = image.Stream.Value;
//FileStream fs = new FileStream(String.Format("Image{0}.jpeg", count++), FileMode.Create, FileAccess.Write);
//fs.Read(
//BinaryWriter bw = new BinaryWriter(fs);
//bw.Write(stream);

File.WriteAllBytes("C:\\poc_image_" + count.ToString() + ".jpeg", stream);
//bw.Close();
}
Back to top
View user's profile Send private message
blackjack2150



Joined: 21 Aug 2008
Posts: 5

PostPosted: Thu Aug 21, 2008 7:41 am    Post subject: Reply with quote

Hi. For text extraction you can use the PDFBox library. For .NET you also have to put a reference to IKVM in your code. An easy solution is using Text Mining Tool (which uses PDFBox). Just google it.
Back to top
View user's profile Send private message
gkataria



Joined: 20 Aug 2008
Posts: 3

PostPosted: Tue Aug 26, 2008 12:56 pm    Post subject: Reply with quote

But i actually needed to find each text and image objects position as well
Back to top
View user's profile Send private message
Display posts from previous:   
This forum is locked: you cannot post, reply to, or edit topics.   This topic is locked: you cannot edit posts or make replies.    PDFsharp - moved to http://forum.pdfsharp.net/ Forum Index -> Support - moved to http://forum.pdfsharp.net/ All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group. Hosted by phpBB.BizHat.com


Start Your Own YouTube Clone

Free Web Hosting | Free Forum Hosting | FlashWebHost.com | Image Hosting | Photo Gallery | FreeMarriage.com

Powered by PhpBBweb.com, setup your forum now!
For Support, visit Forums.BizHat.com