PDFsharp - moved to http://forum.pdfsharp.net/

luizpapa · Joined: 08 Nov 2007 Posts: 2

Hi,

Is it possible to extract text from a pdf file?

It would be better yet if I could extract the text from a area from a page of the pdf instead of the entire file...

I am trying to do it with PDFSharp, but I am not finding a way to do it.

TIA,
Luiz Papa

dgalloway · Joined: 09 Nov 2007 Posts: 1

I have been trying to do that, too. I have been able to use the ContentReader to read a page. I looped through all of the cObjects in the page, but couldn't figure out how to display the content of the object or how to determine if it had any text in it.

Dave Galloway

luizpapa · Joined: 08 Nov 2007 Posts: 2

I think I will use pdfbox to do that.

The code below does exactly what I want. The only problem is that I have to put IKVM within my project references...

org.pdfbox.pdmodel.PDDocument doc = org.pdfbox.pdmodel.PDDocument.load(txtFile.Text);
org.pdfbox.util.PDFTextStripperByArea stripper = new org.pdfbox.util.PDFTextStripperByArea();
java.awt.geom.Rectangle2D rect = new java.awt.geom.Rectangle2D.Double(x, y, width, height);
stripper.addRegion("regiao1", rect);
stripper.setSortByPosition(true);
org.pdfbox.pdmodel.PDDocumentCatalog cat = doc.getDocumentCatalog();
org.pdfbox.pdmodel.PDPageNode pn = cat.getPages();
org.pdfbox.pdmodel.PDPage pag = pn.getKids().toArray()[0] as org.pdfbox.pdmodel.PDPage;
stripper.extractRegions(pag);
return stripper.getTextForRegion("regiao1");