PDFsharp - moved to http://forum.pdfsharp.net/ Forum Index PDFsharp - moved to http://forum.pdfsharp.net/
Please visit the new PDFsharp forum at http://forum.pdfsharp.net/
 
 FAQFAQ   SearchSearch   MemberlistMemberlist   UsergroupsUsergroups   RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

Important Notice: We regret to inform you that our free phpBB forum hosting service will be discontinued by the end of June 30, 2024. If you wish to migrate to our paid hosting service, please contact billing@hostonnet.com.
Problem retrieving raw text content

 
This forum is locked: you cannot post, reply to, or edit topics.   This topic is locked: you cannot edit posts or make replies.    PDFsharp - moved to http://forum.pdfsharp.net/ Forum Index -> Support - moved to http://forum.pdfsharp.net/
View previous topic :: View next topic  
Author Message
antesima



Joined: 02 Jul 2008
Posts: 5

PostPosted: Wed Jul 02, 2008 9:46 am    Post subject: Problem retrieving raw text content Reply with quote

Hello,

thanks for this excellent library, I use it almost every day.

Now, I'm having problems retrieving raw text from an existing
PDF File.

I use the following code, and the string are then parsed to find some info.

Code:

PdfDocument pddDoc = PdfReader.Open(strPath_, PdfDocumentOpenMode.ReadOnly);

foreach (PdfPage ppgPage in pddDoc.Pages)
{
    strReturn += Page.Contents.CreateSingleContent().Stream.ToString();
}


All is ok with PDF files generated by MS Reporting Services,
but with some PDF files generated with MigraDoc, I retrieve no text,
just codes as these ones :

Code:

15 0 Td <005000440076005700550048> Tj
36.979 0 Td <003C005900480056> Tj
27.662 0 Td <002700480045004800570048005100460052005800550057> Tj


It looks like dictionnary keys, but how can I extract the text content from it ?

Regards,
Antesima
Back to top
View user's profile Send private message
antesima



Joined: 02 Jul 2008
Posts: 5

PostPosted: Wed Jul 16, 2008 2:48 pm    Post subject: Reply with quote

Does somebody have a clue ?

Do you need a sample code that generates the PDF ?
(as it is generated with PDFSharp with Times New Roman font).
Back to top
View user's profile Send private message
Thomas Hoevel



Joined: 16 Oct 2006
Posts: 387
Location: Cologne, Germany

PostPosted: Wed Jul 16, 2008 3:12 pm    Post subject: Reply with quote

Between the brackets you see Unicode characters in hex format.
You can convert them to Unicode strings using .NET (take 4 chars, convert to int, convert to char, add to string).

Since the high byte is always 00 (in the samples shown) these are odinary ANSI chars.

I may be wrong: maybe these are not Unicode chars, but indices into the font subset.

It should also be possible to create ANSI PDF files with MigraDoc (it's a parameter of PdfDocumentRenderer).
OTOH for compatibility of your application with unknown PDF files you should support Unicode, too.
_________________
Regards
Thomas Hoevel
PDFsharp Team
Back to top
View user's profile Send private message Visit poster's website
antesima



Joined: 02 Jul 2008
Posts: 5

PostPosted: Thu Jul 17, 2008 6:27 am    Post subject: Reply with quote

Ok thank you I will give it a try and give you the feedback.

Regards,
Antesima
Back to top
View user's profile Send private message
antesima



Joined: 02 Jul 2008
Posts: 5

PostPosted: Thu Jul 24, 2008 8:56 am    Post subject: Reply with quote

It doesn't seem to fit...

Here is the string I try to convert :

"00280057005800470048"

and the code I use :

Code:

private static string ConvertNumericUnicode(string strArgument_)
        {
            string strResult = null;
            string strCurrent = strArgument_;
            while (strCurrent.Length >= 4)
            {
                int iChar = Int32.Parse(strCurrent.Substring(0, 4), System.Globalization.NumberStyles.AllowHexSpecifier);
               
                char cTemp = (char)iChar;
                strResult += cTemp;
                strCurrent = strCurrent.Substring(4);
            }
            return strResult;
        }


A I missing something ?

Regards,
Antesima
Back to top
View user's profile Send private message
Thomas Hoevel



Joined: 16 Oct 2006
Posts: 387
Location: Cologne, Germany

PostPosted: Thu Jul 24, 2008 3:12 pm    Post subject: Reply with quote

So it seems these are indices into the font subsets, not unicode character codes (would be too simple Crying or Very sad ); don't blame me, I warned you about it.

So you have to add another level of indirection by looking into the font table. That's not my area of expertise so I can't give you any clue.

The other solution: create ANSI PDF files ...
_________________
Regards
Thomas Hoevel
PDFsharp Team
Back to top
View user's profile Send private message Visit poster's website
antesima



Joined: 02 Jul 2008
Posts: 5

PostPosted: Tue Jul 29, 2008 12:38 pm    Post subject: Reply with quote

Ok thank you, I will try to get the fonts and extract the text.

If I manage to have some code that work, I will publish it here.
Back to top
View user's profile Send private message
Display posts from previous:   
This forum is locked: you cannot post, reply to, or edit topics.   This topic is locked: you cannot edit posts or make replies.    PDFsharp - moved to http://forum.pdfsharp.net/ Forum Index -> Support - moved to http://forum.pdfsharp.net/ All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group. Hosted by phpBB.BizHat.com


Start Your Own YouTube Clone

Free Web Hosting | Free Forum Hosting | FlashWebHost.com | Image Hosting | Photo Gallery | FreeMarriage.com

Powered by PhpBBweb.com, setup your forum now!
For Support, visit Forums.BizHat.com