View previous topic :: View next topic |
Author |
Message |
antesima
Joined: 02 Jul 2008 Posts: 5
|
Posted: Wed Jul 02, 2008 9:46 am Post subject: Problem retrieving raw text content |
|
|
Hello,
thanks for this excellent library, I use it almost every day.
Now, I'm having problems retrieving raw text from an existing
PDF File.
I use the following code, and the string are then parsed to find some info.
Code: |
PdfDocument pddDoc = PdfReader.Open(strPath_, PdfDocumentOpenMode.ReadOnly);
foreach (PdfPage ppgPage in pddDoc.Pages)
{
strReturn += Page.Contents.CreateSingleContent().Stream.ToString();
}
|
All is ok with PDF files generated by MS Reporting Services,
but with some PDF files generated with MigraDoc, I retrieve no text,
just codes as these ones :
Code: |
15 0 Td <005000440076005700550048> Tj
36.979 0 Td <003C005900480056> Tj
27.662 0 Td <002700480045004800570048005100460052005800550057> Tj
|
It looks like dictionnary keys, but how can I extract the text content from it ?
Regards,
Antesima |
|
Back to top |
|
|
antesima
Joined: 02 Jul 2008 Posts: 5
|
Posted: Wed Jul 16, 2008 2:48 pm Post subject: |
|
|
Does somebody have a clue ?
Do you need a sample code that generates the PDF ?
(as it is generated with PDFSharp with Times New Roman font). |
|
Back to top |
|
|
Thomas Hoevel
Joined: 16 Oct 2006 Posts: 387 Location: Cologne, Germany
|
Posted: Wed Jul 16, 2008 3:12 pm Post subject: |
|
|
Between the brackets you see Unicode characters in hex format.
You can convert them to Unicode strings using .NET (take 4 chars, convert to int, convert to char, add to string).
Since the high byte is always 00 (in the samples shown) these are odinary ANSI chars.
I may be wrong: maybe these are not Unicode chars, but indices into the font subset.
It should also be possible to create ANSI PDF files with MigraDoc (it's a parameter of PdfDocumentRenderer).
OTOH for compatibility of your application with unknown PDF files you should support Unicode, too. _________________ Regards
Thomas Hoevel
PDFsharp Team |
|
Back to top |
|
|
antesima
Joined: 02 Jul 2008 Posts: 5
|
Posted: Thu Jul 17, 2008 6:27 am Post subject: |
|
|
Ok thank you I will give it a try and give you the feedback.
Regards,
Antesima |
|
Back to top |
|
|
antesima
Joined: 02 Jul 2008 Posts: 5
|
Posted: Thu Jul 24, 2008 8:56 am Post subject: |
|
|
It doesn't seem to fit...
Here is the string I try to convert :
"00280057005800470048"
and the code I use :
Code: |
private static string ConvertNumericUnicode(string strArgument_)
{
string strResult = null;
string strCurrent = strArgument_;
while (strCurrent.Length >= 4)
{
int iChar = Int32.Parse(strCurrent.Substring(0, 4), System.Globalization.NumberStyles.AllowHexSpecifier);
char cTemp = (char)iChar;
strResult += cTemp;
strCurrent = strCurrent.Substring(4);
}
return strResult;
}
|
A I missing something ?
Regards,
Antesima |
|
Back to top |
|
|
Thomas Hoevel
Joined: 16 Oct 2006 Posts: 387 Location: Cologne, Germany
|
Posted: Thu Jul 24, 2008 3:12 pm Post subject: |
|
|
So it seems these are indices into the font subsets, not unicode character codes (would be too simple ); don't blame me, I warned you about it.
So you have to add another level of indirection by looking into the font table. That's not my area of expertise so I can't give you any clue.
The other solution: create ANSI PDF files ... _________________ Regards
Thomas Hoevel
PDFsharp Team |
|
Back to top |
|
|
antesima
Joined: 02 Jul 2008 Posts: 5
|
Posted: Tue Jul 29, 2008 12:38 pm Post subject: |
|
|
Ok thank you, I will try to get the fonts and extract the text.
If I manage to have some code that work, I will publish it here. |
|
Back to top |
|
|
|