PDFsharp - moved to http://forum.pdfsharp.net/ Forum Index PDFsharp - moved to http://forum.pdfsharp.net/
Please visit the new PDFsharp forum at http://forum.pdfsharp.net/
 
 FAQFAQ   SearchSearch   MemberlistMemberlist   UsergroupsUsergroups   RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

Migradoc: encoding ö, ä, ß, ü, etc from html text

 
This forum is locked: you cannot post, reply to, or edit topics.   This topic is locked: you cannot edit posts or make replies.    PDFsharp - moved to http://forum.pdfsharp.net/ Forum Index -> Support - moved to http://forum.pdfsharp.net/
View previous topic :: View next topic  
Author Message
Mpasc



Joined: 03 Dec 2008
Posts: 6

PostPosted: Mon Dec 08, 2008 9:43 am    Post subject: Migradoc: encoding ö, ä, ß, ü, etc from html text Reply with quote

Hello all,

I had a German text from a textarea with ö, ä, ß, ü, etc. But they are written in the PDF like a square.

I create a Migradoc document and then I render it with PdfDocumentRenderer in the following way:


Code:

//First I parse the HTML text

htmlText = htmlText.Replace("&#196 ;", "Ä");
htmlText= htmlText.Replace("&#203 ;", "Ë");
htmlText= htmlText.Replace("&#207 ;", "Ï");
htmlText= htmlText.Replace("&#214 ;", "Ö");
htmlText= htmlText.Replace("&#220 ;", "Ü");

htmlText = htmlText.Replace("&#228 ;", "ä");
htmlText = htmlText.Replace("&#235 ;", "ë");
htmlText = htmlText.Replace("&#239 ;", "ï");
htmlText = htmlParagraphs.Replace("&#246 ;", "o");
htmlText = htmlParagraphs.Replace("&#252 ;", "ü");
htmlText = htmlParagraphs.Replace("&#223 ;", "ß");
       


Document document = new Document();

//Then I create the sections and paragraphs with the text
[...]

//Finally I create the PdfDocumentRenderer object like this:

PdfDocumentRenderer renderer = new PdfDocumentRenderer(true, PdfSharp.Pdf.PdfFontEmbedding.Always);
renderer.Document = document;
renderer.RenderDocument();

//And send it to the browser
Response.Clear();
Response.ClearContent();
Response.ClearHeaders();
Response.Buffer = true;
Response.ContentType = "application/pdf";
Response.AddHeader("content-length", stream.Length.ToString());
Response.BinaryWrite(stream.ToArray());
Response.Flush();
stream.Close();
Response.End();



[NOTE: in the original code there is no space between the code and the semicolon (ex: &#196 Wink but i did it like this to avoid the browser codes it.]

But as I said before the diaeresis and other special characters are displayed as empty square 􀀀.

Thank you!
_________________
MPasc
Back to top
View user's profile Send private message
Thomas Hoevel



Joined: 16 Oct 2006
Posts: 387
Location: Cologne, Germany

PostPosted: Mon Dec 08, 2008 1:38 pm    Post subject: Reply with quote

Hi!

Could this be the error:
Code:
htmlText = htmlText.Replace("&#239 ;", "ï");
htmlText = htmlParagraphs.Replace("&#246 ;", "o");


All previous replacements at htmlText are overwritten with the new assignment from htmlParagraphs.

All ANSI characters should work (be sure to activate Unicode if you want to include non-ANSI characters).
_________________
Regards
Thomas Hoevel
PDFsharp Team
Back to top
View user's profile Send private message Visit poster's website
Mpasc



Joined: 03 Dec 2008
Posts: 6

PostPosted: Mon Dec 08, 2008 1:44 pm    Post subject: Reply with quote

Thomas Hoevel wrote:
Hi!

Could this be the error:
Code:
htmlText = htmlText.Replace("&#239 ;", "ï");
htmlText = htmlParagraphs.Replace("&#246 ;", "o");


All previous replacements at htmlText are overwritten with the new assignment from htmlParagraphs.

All ANSI characters should work (be sure to activate Unicode if you want to include non-ANSI characters).


Sorry, the htmlParagraphs was the original name of the variable and I change it here to make my explanation clearer. So the original code is:

Code:

        htmlParagraphs = htmlParagraphs.Replace("Ä", "Ä");
        htmlParagraphs = htmlParagraphs.Replace("Ë", "Ë");
        htmlParagraphs = htmlParagraphs.Replace("Ï", "Ï");
        htmlParagraphs = htmlParagraphs.Replace("Ö", "Ö");
        htmlParagraphs = htmlParagraphs.Replace("Ü", "Ü");

        htmlParagraphs = htmlParagraphs.Replace("ä", "ä");
        htmlParagraphs = htmlParagraphs.Replace("ë", "ë");
        htmlParagraphs = htmlParagraphs.Replace("ï", "ï");
        htmlParagraphs = htmlParagraphs.Replace("ö", "o");
        htmlParagraphs = htmlParagraphs.Replace("ü", "ü");
        htmlParagraphs = htmlParagraphs.Replace("ß", "ß");

htmlParagrapsh is just a String with the html coded text.

However, you mentioned that I should make sure to activate Unicode. When I create the PdfDocumentRenderer I set it like:

Code:

PdfDocumentRenderer renderer = new PdfDocumentRenderer(true, PdfSharp.Pdf.PdfFontEmbedding.Always);


Should I do any other thing to activate unicode then?

Thank you!![/code]
_________________
MPasc
Back to top
View user's profile Send private message
Thomas Hoevel



Joined: 16 Oct 2006
Posts: 387
Location: Cologne, Germany

PostPosted: Mon Dec 08, 2008 3:24 pm    Post subject: Re: Migradoc: encoding ö, ä, ß, ü, etc from html text Reply with quote

Mpasc wrote:
But as I said before the diaeresis and other special characters are displayed as empty square ��.

I guess I was on the wrong track.

Which font do you use?
The empty square is normally the default character for anything that's not implemented in a font.
The default font for MigraDoc is "Verdana".
_________________
Regards
Thomas Hoevel
PDFsharp Team
Back to top
View user's profile Send private message Visit poster's website
Mpasc



Joined: 03 Dec 2008
Posts: 6

PostPosted: Tue Dec 09, 2008 8:58 am    Post subject: Reply with quote

Hello,

I use Arial.

Following the example HelloMigradoc I define the style in a method like this:

Code:

 public static void DefineStyles(Document
{
            MigraDoc.DocumentObjectModel.Style style;

            // Get the predefined style Normal.
            style = document.Styles["Normal"];

            // Modify the style
            style.Font.Name = "Arial";
            style.Font.Size = 10;
            style.Font.Bold = false;
            style.ParagraphFormat.Alignment = ParagraphAlignment.Justify;
            style.ParagraphFormat.SpaceBefore = 12;
            style.ParagraphFormat.SpaceAfter = 12;

            //Style for Heading1

            style = document.Styles["Heading1"];
            style.Font.Name = "Arial";
            style.Font.Size = 14;
            style.Font.Bold = true;
            style.Font.Color = Colors.DarkBlue;
            style.ParagraphFormat.PageBreakBefore = true;
            style.ParagraphFormat.SpaceAfter = 6;

            // Create a new style called TextBox based on style Normal
            style = document.Styles.AddStyle("TextBox", "Normal");
            style.Font.Bold = true;
            style.Font.Size = 40;
           
            style.ParagraphFormat.Borders.Width = 2.5;
            style.ParagraphFormat.Borders.Distance = 3;
}


And then, in another method, I create the paragraphs and set the style:

Code:

            public static Paragraph CreateParagraph(Document document, String text, String style)
        {
            //the style parameter is Normal or TextBox
           
            Paragraph paragraph = document.LastSection.AddParagraph();
            paragraph.Style = style;
            paragraph.AddFormattedText(HTMLParser.getUntaggedText(text), style);

            return paragraph;
           
        }


The results are:
    - The TextBox style does not work (all text has Normal style then)
    - The vowels with diaeresis are still replaced by blank squares

Any clue?

Thank you very much.
_________________
MPasc
Back to top
View user's profile Send private message
Thomas Hoevel



Joined: 16 Oct 2006
Posts: 387
Location: Cologne, Germany

PostPosted: Tue Dec 09, 2008 10:00 am    Post subject: Reply with quote

Mpasc wrote:
And, besides, I still see the blank squares instead of the diaeresis.

Umlaute do work with PDFsharp.
Are Umlaute handled correctly in your source code? Visual Studio I presume? Did you set file encoding to UTF-8?
Do you see correct strings in the Debugger?

Did you try to save a PDF file on the server? Check the Umlaute there.
Maybe they get lost while transfering the file from the server to the client.

Have you tried using HtmlDecode instead of replacing the characters?

BTW: "Normal" is the default style that is used if the Style of a paragraph is null.
_________________
Regards
Thomas Hoevel
PDFsharp Team
Back to top
View user's profile Send private message Visit poster's website
Mpasc



Joined: 03 Dec 2008
Posts: 6

PostPosted: Wed Dec 10, 2008 2:29 pm    Post subject: Reply with quote

Hello,

I already take the text from the textarea with HtmlDecode. However, I get the diaeresis with the code. For example, I get "&#239 ;"(with no blank spaces between the characters) for ï, so I tried to replace them as I did.

I have debugged the application and I can see that the string has the blank squares already in the server, so it is not due to that they are being lost during the transfer to the client.

Regarding to your other suggestions:

    - What is umlaute?
    - I work with Microsoft Visual Studio 2005.
    - How can I encode to UTF-8? Should I set it somehow in the Migradoc document?


Thank you very much for all you help!
_________________
MPasc
Back to top
View user's profile Send private message
Thomas Hoevel



Joined: 16 Oct 2006
Posts: 387
Location: Cologne, Germany

PostPosted: Thu Dec 11, 2008 8:56 am    Post subject: Reply with quote

Hello!
Mpasc wrote:
I have debugged the application and I can see that the string has the blank squares already in the server

That leaves me rather clueless.

So I'd say there are two possible explanations:
  • The special characters are already lost while replacing
  • The special characters do not exist in the fonts on the server
  • With remote debugging: maybe characters get lost between server and debugger client


AFAIK a "string" in C# is always Unicode. Special characters are no problem for C#.

There's no diaresis in German. We have ÄÖÜäöü and call them "Umlaute". The ligature ß is a different story, but PDFsharp handles all these characters correctly (Unicode mode or not).

As long as you see blank squares in the debugger try to cure the problem in the C# code on the server. It can't be a problem of the HTML response settings or the MigraDoc settings.
_________________
Regards
Thomas Hoevel
PDFsharp Team
Back to top
View user's profile Send private message Visit poster's website
Display posts from previous:   
This forum is locked: you cannot post, reply to, or edit topics.   This topic is locked: you cannot edit posts or make replies.    PDFsharp - moved to http://forum.pdfsharp.net/ Forum Index -> Support - moved to http://forum.pdfsharp.net/ All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group. Hosted by phpBB.BizHat.com


Start Your Own YouTube Clone

Free Web Hosting | Free Forum Hosting | FlashWebHost.com | Image Hosting | Photo Gallery | FreeMarriage.com

Powered by PhpBBweb.com, setup your forum now!
For Support, visit Forums.BizHat.com