View previous topic :: View next topic |
Author |
Message |
luizpapa
Joined: 08 Nov 2007 Posts: 2
|
Posted: Thu Nov 08, 2007 8:04 pm Post subject: Extracting text from pdf |
|
|
Hi,
Is it possible to extract text from a pdf file?
It would be better yet if I could extract the text from a area from a page of the pdf instead of the entire file...
I am trying to do it with PDFSharp, but I am not finding a way to do it.
TIA,
Luiz Papa |
|
Back to top |
|
|
dgalloway
Joined: 09 Nov 2007 Posts: 1
|
Posted: Fri Nov 09, 2007 2:33 pm Post subject: Extracting Text from PDF |
|
|
I have been trying to do that, too. I have been able to use the ContentReader to read a page. I looped through all of the cObjects in the page, but couldn't figure out how to display the content of the object or how to determine if it had any text in it.
Dave Galloway |
|
Back to top |
|
|
luizpapa
Joined: 08 Nov 2007 Posts: 2
|
Posted: Fri Nov 09, 2007 4:39 pm Post subject: Pdfbox |
|
|
I think I will use pdfbox to do that.
The code below does exactly what I want. The only problem is that I have to put IKVM within my project references...
org.pdfbox.pdmodel.PDDocument doc = org.pdfbox.pdmodel.PDDocument.load(txtFile.Text);
org.pdfbox.util.PDFTextStripperByArea stripper = new org.pdfbox.util.PDFTextStripperByArea();
java.awt.geom.Rectangle2D rect = new java.awt.geom.Rectangle2D.Double(x, y, width, height);
stripper.addRegion("regiao1", rect);
stripper.setSortByPosition(true);
org.pdfbox.pdmodel.PDDocumentCatalog cat = doc.getDocumentCatalog();
org.pdfbox.pdmodel.PDPageNode pn = cat.getPages();
org.pdfbox.pdmodel.PDPage pag = pn.getKids().toArray()[0] as org.pdfbox.pdmodel.PDPage;
stripper.extractRegions(pag);
return stripper.getTextForRegion("regiao1"); |
|
Back to top |
|
|
|